Basic Analyses
Additional Information on Regression Analysis
Regression analysis is a method for estimating a relationship of a specified functional form between a response variable and one or more explanatory variables. Common examples of regression relationships in stream ecosystems include modeling biological characteristics (e.g., species richness) as a function of different environmental factors, or modeling local environmental condition (e.g., stream temperature) as a function of other data that can be extracted from maps.
The most common functional form that is assumed for a regression analysis is a linear relationship:

Regression analysis can be used to fit relationships between any variables with little consideration of the underlying assumptions. However, when the estimated relationships are used to predict likely values of y at new values of the explanatory variables, or when the estimated relationships are interpreted with respect to whether they accurately represent the underlying physical or biological relationships, the theoretical assumptions must be considered more carefully. More specifically, one must assess whether the assumed functional form is sufficiently representative of the actual relationship, whether the sampling variability in y is distributed as assumed, whether the magnitude of the sampling variability in y changes across the range of predictions, whether the samples used to fit the model are independent, and whether errors in the measured values of the explanatory variables are small enough to be ignored. We discuss each of these assumptions in more detail and illustrate methods for assessing the degree to which assumptions are supported by the data.
Is the Assumed Functional Form Appropriate?

Is the Assumed Distribution for Sampling Variability Appropriate?
Regression analysis seeks to find a relationship between the expected, mean value of the response variable, y, and different explanatory variables. A key assumption for accurately estimating this relationship is the distribution of the error, or variability, in observed values of y. The most common assumption is that sampling error in y is normally distributed. That is, for any combination of explanatory variables, we assume that the scatter of observed values about the mean follows a normal distribution.

In assessing the assumption of normal sampling variability, it is often useful to consider the characteristics of the response variable. Many typical response variables are constrained and by definition, sampling variability is not normal. For example, variables that measure a count (e.g., total taxon richness) have a minimum value of 0, and those that measure a proportion (e.g., relative abundance) have a minimum value of 0 and a maximum value of 1. In general, normal distributions do not allow for such constraints, and therefore, may not be appropriate. However, some variables may appear to be constrained (e.g., multimetric indices) but are reasonably well approximated by a normal distribution. Other variables that have a minimum value of zero (e.g., chemical concentrations, watershed area) are normally distributed after a log transformation. Generalized linear models also allow one to directly model certain types of data with non-normal distributions.
Is the Sampling Variance Constant?

Are Samples Independent?
Regression models typically assume that samples are independent from one another, and when this assumption is violated, more confidence may be ascribed to results than is supported by the data. Considerations of whether samples are independent are covered in depth on the Autocorrelation page.