APPENDIX A. DATA EVALUATION AND STATISTICAL HYPOTHESIS TESTING
A1 INTRODUCTION
A variety of data (including rainfall intensities and depths; discharge rates and flow volumes; the concentrations of chemical parameters; and the measurement of physical parameters) are generated during a stormwater monitoring program. We can examine these data for patterns and trends, comparing stormwater quality between different areas over time; input/output comparisons of structural best management practices (BMPs); and pre/post monitoring in a basin to compare source control BMP(s) implementation. However, the timing and magnitude of stormwater quality phenomena are influenced by many highly variable factors, such as: storm intensity and duration, the length of the antecedent dry period, and the magnitude and frequency of pollution causing activities within the catchment area. We can only describe in a general way the potential influence of each factor. It is nearly impossible to assess in a statistical sense (i.e., with some level of error) interactions among all factors. We therefore use the tools of statistical analysis to infer, with a predictable level of error, generalities about average conditions (or trends over time) and the variability from the limited information obtained from our monitoring programs.
The first step in the process of evaluating a stormwater data set is to validate the chemical data, "qualifying" those that do not meet the criteria established in the QA/QC plan. After completing the data validation process, we conduct an initial evaluation using summary (univariate) statistics (Section A3). The initial evaluation shows whether the data can be used in statistical hypothesis testing. The type of hypothesis tested is determined by the program objectives. These usually include one or more of the following:
 Characterization of stormwater discharges (e.g., average conditions, variability, ranges, etc.)
 Comparing stormwater discharge quality to state and federal water quality criteria
 Monitoring to detect trends in discharge quality over time and between different locations
 Monitoring to assess the effectiveness of BMPs for stormwater control
The statistical testing techniques appropriate to each of these objectives are discussed in Sections A5 through A8.
A2 DATA EDITING, VALIDATION, AND TREATMENT
Prior to conducting a statistical test, data should be screened to eliminate potentially biased or nonrepresentative values. Biased and nonrepresentative values may arise due to equipment malfunctions, field or laboratory protocol errors, weather problems, human error, and similar events. In addition, there are procedures for addressing data below laboratory detection values, and estimation of particulate fractions of metals. Finally, data should be transferred to a normal distribution if statistical tests will be used, because they rely on normality of the data as one of their assumptions.
Percent Capture. If samples were taken using automated flow weighted compositing equipment, estimate the percent of the total discharge that was captured (i.e., the amount of the total flow that was sampled by the equipment during the time the equipment was activated) for each sample. As a general rule, samples with less than 60% capture should be rejected as not representative of the event. In some circumstances, samples with less than 60% capture may be used, depending on the objective of the analysis. For example, the 60% capture criterion may not be applicable to a sample collected to characterize the "first flush" of a storm event. In compiling data, it is suggested that data with less than 80% capture (but greater than 60%) be noted.
QA/QC Qualifiers. Based on the results of the QA/QC evaluation, laboratory data considered suspect due to the contamination of blanks, exceedance of holding times, or low surrogate recoveries should be qualified or rejected. Ideally, statistical tests will be performed only on data that have passed this screening process. Although it is possible to use data that have been qualified as estimated values, a higher level of uncertainty is associated with the test results. It is up to the data user to make an educated decision whether to include estimated values.
Event Mean Concentrations (EMCs)
If EMCs will be used for data comparisons, then all data should either be collected as an EMC or, if individual samples are analyzed, an EMC should be computed. This can be accomplished by integrating the hydrograph (plot of flow rate versus time) and pollutograph (plot of concentration versus time). Pollutant mass is estimated by applying the trapezoidal rule to a number of corresponding time segments of the hydrograph and the pollutograph. The product of the partial flow volume and associated concentration estimates the mass in that segment of the discharge. The sum of all such segment masses estimates the total mass discharged by the event. The estimation of the total area under the hydrograph provides the total volume of runoff. Total mass divided by the total runoff volume provides the desired value for the EMC.
PQL and MDL. The method detection limit (MDL) is defined as the "minimum concentration of an analyte that can be measured and reported with a 99% confidence that the analyte concentration is greater than zero" (40 CFR 136.2). The practical quantification limit (PQL) is the minimum concentration of an analyte that can be accurately and precisely quantified. In general, the PQL is 5 to 10 times the MDL, depending on the analyte. In general, statistical tests will be more accurate if the data values are above the PQL. However, statistical tests can be performed even if large amounts of the data are between the MDL and PQL, but the confidence (power) of the test may be lower due to increased uncertainty. Prior to conducting statistical tests, the data set should be examined to determine the percentage of points that are below the MDL and PQL. If a large proportion of the data is below the MDL, statistical testing may not be appropriate.
Averaging of Duplicates. Data from duplicate samples (laboratory or field) should be averaged prior to statistical analysis. That is, the average value should be used in place of either of the two duplicate values.
Calculating Metal Fractions. Where total and dissolved fractions of metals are measured, it is possible to estimate several other fractions from these numbers and the concentration of TSS. Perform the following calculations and add these data to the data set:
 Percent dissolved = (dissolved conc./total conc.)
 Suspended (µg/L) = (total conc. dissolved conc.)
 Particulate (µg/g) = [suspended conc. (µg/L) / TSS conc. (mg/L)] * 1,000 mg/g
Distributional Tests. Many commonly used statistical tests (e.g., parametric Analysis of Variance) are based on the assumption that the data were sampled at random from a population with a normal distribution. Therefore, another attribute of the data that should be investigated is its apparent probability distribution. It is important to determine whether the probability distribution is normal or lognormal. Researchers have found that generally the lognormal distribution provides the best fit to stormwater quality data (USEPA 1983; Driscoll et al., 1990). If the data are not normally distributed, or if the data set contains a very high proportion of nondetects, a nonparametric statistical procedure should be utilized for testing trends. Non parametric techniques examine the data based on rank rather than distribution.
Several methods can be used to determine the normality of a data set or of the transformed values, including the W test, Probability Plot Correlation Coefficient (PPCC), and a graphical check of the data. These methods are useful for the analysis of stormwater quality data.
The procedure employed for the graphical test is to develop a logprobability plot for visual assessment of the lognormal distribution. First, compute the mean and standard deviation of the natural (base e) logarithm transforms of the EMCs. The theoretical distribution is constructed from these values (the log mean [U] and the log standard deviation [W]). When combined with the plotting position based on the normal distribution, this derived distribution indicates the expected value (assuming that the data follow a lognormal distribution) of a pollutant's concentration at any probability of occurrence. This expected probability distribution then can be compared with the data by plotting the two on the same log probability plot.
The plotting position of the individual data points can be determined by assigning an expected probability for each EMC in the ranked series of observed values. This position varies with the number of observations (N) in the sample, and is provided by the following general equation (Driscoll et al. 1990):
Pr = (m  ½) / N Where m is the rank order of the observation and Pr is the plotting position (probability). A visual check of the data using a log probability plot can be a very effective test, and is recommended. For further quantitative information the PPCC could be used.
The PPCC test provides both quantitative and graphical representations of the goodness of fit of the distribution with respect to normality. Although the PPCC test is less commonly available in statistical programs than the W test, it is straightforward in its application. It consists of creating a plot of the data on probability paper (i.e., paper with a probability scale along the long edge and a linear scale along the short edge). Plots of data that are normally distributed form a straight line. The correlation coefficient for the best fit straight line can be calculated and compared with the critical value for that number of data points, as provided in the literature (Vogel, 1986). If data are better predicted by a lognormal distribution than a normal distribution, the lognormal distribution should be utilized for estimation of population statistics and analysis of variance tests.
The W test (Shapiro and Wilk, 1965) is available in most statistical software programs. The W test result is significant if the Prob < W (i.e., the probability that the test statistic "W" is less than the critical W value). Typically W is assumed to be 0.05.
Treatment of Nondetects. When stormwater data sets include some nondetects within the data, separate data analysis techniques are required to accurately estimate sample statistics. When belowdetectionlimit data exist in a data set, they will affect statistical parameters computed from that set. For example, when belowdetectionlimit data are set to the detection limit (often cited as a conservative approach), it causes an overestimation of central tendency measures and an underestimation of dispersion measures, as opposed to what would have been obtained had the true values of the belowdetectionlimit data been known. Figure A1 shows an example of the phenomena using a hypothetical lognormal distribution with a detection limit artificially set at 1.0.
The magnitude of the error made by failing to properly treat detection limit data is a function of the size of the data set (i.e., the total number of events for which a concentration was reported [N]; the percentage of the total set represented by detection limit data; and the value of the detection limit relative to the median of the data above the detection limit).
The treatment of detection limit data varies among workers in the field and the objectives for which the data are being analyzed. The traditional practice has been simply to take all detection limit data at their face value, the argument being that since the actual values are really lower, the average so calculated will be conservative for prediction of concentrations near the median. However, prediction of values that are exceeded rarely (i.e., pollutant concentrations that are observed less than 5% of the time) may very likely be underpredicted (see Figure A1a). Others have set the values equal to one half (or some other fraction) of the detection limit. When a significant percentage of a data set is at or below the detection limit, the treatment method can seriously affect analytical results and their interpretation. In statistical parlance, data sets with "less than" observations are termed "censored data."
Figure A1 (a and b) Comparison of Approaches to Analysis of Detection Limit Data
(a  using the detection limit)
(b  using maximum likelihood estimator)
Simply stated, the approach to treating detection limit data has been to ignore their magnitude, but use their probability (or plotting position) in determining the lognormal distribution that best fit the data set in question. That is, using regression, all of the data above the detection limit is fit to a lognormal curve and it is assumed that the detection limit data follows the same lognormal frequency distribution. This is accomplished as follows:
Transform the data to a normal distribution (in this case using a log transformation).
Rank order the data set in question (m = 1, 2,...., N).
Compute the probability (i.e., plotting position) associated with the rank order (m) as discussed earlier.
Compute the corresponding Z score (area under the standard normal curve; i.e., the number of standard deviations away from the mean) for each probability value.
Determine the regression line that best fits the data subset above the detection limit (i.e., regression fit of transferred data with Z score values).
Determine the log mean and log standard deviation from the regression line (i.e., the mean is the intercept of a Z score value of zero, while the standard deviation is the slope of the line).
Compute the arithmetic statistical parameters from these values as discussed in Section A3.
The actual execution of the correction is much simpler than its description. A graphic illustration of the results of the procedure is presented by Figure A1b, which also indicates how the pertinent statistics are affected. Newman and Dixon (1990) have developed a public domain software program called UNCENSOR to perform these calculations.
A3 DESCRIPTIVE STATISTICS
The purpose of calculating general descriptive statistics is to gain an overview of the data and to prepare for more formal statistical hypothesis testing. The data are displayed in a variety of ways and summary statistics are generated. These exploratory techniques can provide clues as to the presence of major treatment effects (e.g., station, year, land use type) that can be tested for statistical significance. Descriptive statistics also indicate how groups of data can be combined or "pooled" prior to statistical testing. "Pooling" effectively increases the sample size and the power of the analysis to detect significant differences.
For example, if data collected at two physically similar or nearby highway monitoring stations have been demonstrated to not differ statistically from each other, the data could be pooled for further testing to compare to other locations or configurations. The reverse may be demonstrated by the descriptive statistics as well.
Summary Statistics. First, calculate simple descriptive statistics, characterizing the central tendency, variability, and distribution of the data set. Central tendency is measured by the sample mean (if normal, the arithmetic average of the data), the median (the 50th percentile of the distribution), and the mode (the most probable value). The variability of the data set is represented by the sample standard deviation and by its squared value, the variance. For non parametric tests, data variability is measured by the interquartile difference, the difference between the values of the 1st (25th percentile) and 3rd quartile (75th percentile) values. Any statistical software program and most hand calculators can be used to calculate these parameters.
Descriptive Statistics Utilizing the Lognormal Distribution
This guidance applies when computing descriptive statistics utilizing the lognormal assumption. If a sample (a data set of N observations) is drawn from an underlying population that has a lognormal distribution, the following apply:
 An estimate of the mean and variance of the population is obtained by computing the mean and standard deviation of the log transforms of the data.
 The arithmetic statistical parameters of the population (mean, median, standard deviation, coefficient of variation) should be determined from the theoretical relationships (see Table A1) between these values and the mean and standard deviation of the transformed data.
 The arithmetic mean so computed will not match that produced by a straight average of the data. Both provide an estimate of the population mean, but the approach just described provides a better estimator. As the sample size increases, the two values converge. For the entire population, both approaches would produce the same value.
A few mathematical formulas based on probability theory summarize the pertinent statistical relationships for lognormal probability distributions. These provide the basis for back and forth conversions between arithmetic properties of the untransformed data (in which concentrations, flows, and loads are reported) and properties of the transformed data (in which probability and frequency characteristics are defined and computed).
Using a two parameter lognormal distribution, the definition of one single central tendency (e.g., median, mean) and one dispersion (e.g., standard deviation, coefficient of variation) parameter automatically defines the values for all of the other measures of central tendency and dispersion as well as the entire distribution. Table A1 presents the formulas that define these relationships from which other values can be computed.
Box and Whisker Plots. The Box and Whisker Plot is a graphical method of displaying the variability, spread, and distribution of the data set. The "box" shows the 25th, 50th, and 75th percentile. One method of assessing variability is the interquartile range, defined above. The "whiskers" which illustrate the spread of the data, are obtained by multiplying the interquartile range by 1.5. These plots can also be used to display the degree of overlap between two data sets, used as an indication (but not proof) of whether the data sets are likely to be derived from the same populations. If data are lognormal, the plots can be produced using the logtransformed data.
TABLE A1 RELATIONSHIPS OF LOGNORMAL DISTRIBUTIONS
T = EXP (U) 
S = M * CV 
M = EXP (U + 0.5 * W^{2}) 
W = SQRT (LN (1 + CV^{2}) 
M = T * SQRT (1 + CV^{2}) 
U = LN (M/EXP (O.5 * W^{2})) 
CV = SQRT (EXP (W^{2})  1) 
U = LN (M/SQRT (1 + CV^{2}) 
Parameter designations are defined as:

Arithmetic 
Logarithmic 
MEAN 
M 
U 
STD DEVIATION 
S 
W 
COEF OF VARIATION 
CV 

MEDIAN 
T 

LN(x) designates the base e logarithm of the value x 


SQRT(x) designates the square root of the value x 


EXP(x) designates e to the power x 


A4 HYPOTHESIS TESTING
Hypothesis testing is performed using statistical procedures to measure the significance of a particular effect (e.g., TSS concentration or station location). Statistical analysis is used to determine whether a particular mathematical model describes the pattern of variability in the data set better than a "random" model. Two types of models are commonly used. Respectively, they state that:
 There is a significant, mathematical relationship between a change in the magnitude of one variable to that of another variable (e.g., total suspended solids and total zinc concentrations in stormwater runoff).
 There is a significant effect of a treatment on the magnitude of a variable (e.g., an effect of station location or monitoring year or input/output of a BMP on total zinc concentration in stormwater runoff).
These hypotheses are tested using the tools of Correlation Analysis and Analysis of Variance (ANOVA), respectively. The following steps are common to both procedures:
 Formulate the hypothesis to be tested, called the null hypothesis (Ho)
 Determine the test statistic
 Define the rejection criterion for the test statistic
 Determine whether the calculated value of the test statistic falls above or below the rejection criterion
Test statistics, significance levels, and rejection criteria are described below.
Test Statistics. The sum of squares (of the deviations of the measurements from the mean) is used as a measure of the amount of variability in the data set that is explained by the statistical model. The total sum of squares can be decomposed into a portion due to variation among treatment groups ("sum of squares for treatments") and a portion due to variation within groups ("sum of squares for error"). The "mean square for error" is calculated by dividing the sum of squares for an effect source (treatment, error, or total) by the number of degrees of freedom for that effect. This "normalizes" the variability from one source for comparison with the variability from another. The "F ratio" is then calculated as the ratio of the mean square for treatments to the unexplained variability mean square for error. If treatments have only a small effect on the variable of interest, then the portion of the total mean square due to variation within groups will be small relative to the portion between groups.
The probability that a given F ratio could be generated by chance alone using a random model (i.e., by chance alone) is measured by the parameter "P > F." "F" is called the statistic of interest. A P value of "0.10 > F," for example, would mean that the observed F ratio could have been generated 10% of the time by chance alone. The effect of treatments is said to be "significant" if this probability is less than the chosen significance level (alpha), which is commonly set at 0.05.
Significance Levels. It is important to realize that statistical tests are not absolutely conclusive. There is always some degree of risk that one of two types of error will be committed:
 Rejection of a true hypothesis (Type I error); or
 Failure to reject a false hypothesis (Type II error).
If a calculated test statistic meets the rejection criterion, then reject the null hypothesis; otherwise, continue to assume that the null hypothesis is correct. The probability of committing a Type I error is denoted by the Greek symbol alpha (α), that of committing a Type II error by beta (β). Alpha is also called the "significance level of the test" (i.e., the probability of rejecting a true hypothesis). Common values for alpha are 0.10, 0.05, and 0.01. As the value of alpha decreases, the confidence in the test increases. However, at the same time, the probability of committing a Type II error (beta) also increases. Therefore, setting alpha too low will result in too strict a test, which will reduce the chance of rejecting a true hypothesis, but fail to reject many false ones. Statistical tests of runoff data generally use a target alpha of 0.05 or a 95% level of confidence.
Correlation Analysis. Correlation analysis considers the linear relationship between two variables. Correlation analysis can be used to identify parameters, which may explain or reduce some of the variability inherent in the process of statistical hypothesis testing, but doesn't necessarily imply a cause and effect relationship. Correlation is expressed on a scale from 1 to 1, with 1 representing perfect correlation; 1 representing perfect inverse correlation; and 0 representing no correlation.
Two way Analysis of Variance. ANOVA is a statistical technique used to assess the effects of different treatments on a particular water quality parameter and to determine whether the effects of different levels of each treatment are significantly different from each other. For example, a two way ANOVA can be used to determine the relationship between effects of the treatments station location and monitoring year on the total concentration of a parameter of interest. The ANOVA model tests whether:
 Stations differ from each other across all monitoring years; and
 Monitoring years differ from each other across all stations.
In addition, by testing for interactions in the station and year combinations, the model tests whether monitoring year influences the total zinc concentration at each station equally. In this approach, the null hypothesis states that there are no significant effects of station location or monitoring year on total zinc concentrations in stream samples. The two way ANOVA is used to determine whether the null hypothesis can be rejected, indicating that significant differences between treatment effects were observed. If the null hypothesis is rejected, additional analyses are conducted to identify which of the stations or monitoring years were significantly different from each other.
Checking Assumptions. Two tests must be performed before the results of the ANOVA can be considered valid. These tests, performed on the "residuals" (i.e., that portion of the variability in the data set that is not explained by the statistical model), are used to check the validity of two important assumptions:
 Data were normally distributed; and
 Variability was homogenous across treatment effects (e.g., stations and years).
The degree to which the residuals are normally distributed is checked by performing the W test. The homogeneity of the variances is checked using Levine's test for absolute values of residuals. To perform this test, the absolute values of the residuals from the ANOVA are used in a new Two Way ANOVA as the response (y) values. The assumption of homogeneity is satisfied if no significant station or year effects are detected (i.e., Prob > F is less than alpha for all effects).
Nonparametric Analysis of Variance
If the assumptions of a parametric ANOVA cannot be met or if the proportion of non detects in the data set exceeds approximately 15%, a Kruskal Wallis nonparametric ANOVA can be used to examine hypotheses regarding significant differences in constituent concentrations between outfalls and between years. The nonparametric ANOVA evaluates the ranks of the observed concentrations within each treatment. "Non detects" are treated as tied values and are assigned an average rank. If a significant difference between treatments is detected, a nonparametric multiple comparison procedure can be used to determine which treatments are heterogeneous. It should be noted that in general, nonparametric methods are less powerful than their parametric counterparts, reducing the likelihood that a (true) significant difference between treatments will be detected.
A5 CHARACTERIZATION OF STORMWATER DISCHARGES
The characterization of runoff provides both qualitative and quantitative overviews of a storm event. The qualitative analysis for each monitored event should include a narrative describing the timing and nature of the field activities. The narrative should include, at a minimum:
 Station identification;
 Date of storm event;
 Names of field personnel;
 Time precipitation started and ended (if known), times samples were taken, time monitoring ended; and
 Information regarding any problems encountered and changes to the sampling protocol that can affect the interpretation of the data.
After writing the narrative, graph the hydrologic data (flow and precipitation). Examine the graphs for patterns in the timing and intensity of runoff relative to those of precipitation. After sampling a minimum of three or four storms, calculate summary statistics from the analytical results (Section A3). Use these results to determine whether the data set is sufficiently robust to support statistical hypothesis testing. If not, additional monitoring at selected locations in order to obtain more data may be warranted.
Stormwater Discharge and Rainfall Information
Produce a hydrograph for each storm, displaying storm duration on the horizontal (H) axis and discharge rate on the vertical (Y) axis. Rainfall should be plotted on the same graph (or in a different graph on the same page). The collection times of the subsamples used for compositing should be noted on the horizontal axis of each plot. Analysis of these graphs for data gaps and outlying (i.e., extreme) data points may provide some information about the functioning of the automated equipment during the storm. Outliers should be rejected from the data set for the purpose of statistical analysis if the cause of their behavior can be identified (e.g., poor QA/QC of a particular data point, poor storm capture, etc.).
Typical Applications of Hypothesis Testing to Characterization Data
Typical applications of statistical testing procedures to discharge quality data include determining whether any of the following are significant:
 Differences between stations;
 Differences between monitoring years; and
 Correlations between different water quality parameters.
A6 COMPARISON TO STATE AND FEDERAL WATER QUALITY OBJECTIVES
The validated analytical results for samples from piped or open channel drainage systems from an individual storm event can be compared to water quality criteria for the protection of aquatic life under acute (short term) conditions. Although the pipe or open channel (in many cases) is not a receiving water body that supports beneficial uses, comparison to criteria can provide and indication of potential toxicity. For parameters other than metals, this will entail a simple comparison of the observed grab or flow weighted composite concentration and the corresponding criterion. The toxicity of several trace metals increases as hardness decreases. Consequently, the acute criteria for most metals must be calculated for each sample based on the hardness measured in the sample. The equations to be used for these calculations are contained in the state water quality standards regulations (WAC 173 201A 040).
Surface water criteria have not been developed for some parameters on the priority pollutant list. Moreover, in many cases there are no state criteria for conventional parameters. It may be appropriate to compare the results for these parameters to other benchmarks, such as mean or median values from the Nationwide Urban Runoff Program (USEPA 1983), or more recently collected local or regional data, to identify potential pollutants of concern.
If the initial statistical analysis indicates that the data set is adequate, statistical testing can be conducted to assess the probability that a water quality criterion will be exceeded at a given location. The procedure described on page 17.16 of Maidment (1992) can be used. A minimum of seven samples is generally required to achieve a meaningful result.
Pollutant loading estimates may provide an indication of the potential impact of a stormwater discharge on a receiving water body. The calculation of pollutant loads provides a direct quantitative measurement of the pollutants in stormwater discharge to the receiving water. Pollutant loadings can be calculated using either an estimate of flow in an average year (annual load), or flow measured during a specific storm event (instantaneous load). Loadings can be calculated using Schueler's Simple Model (described in USEPA 1992), the SUNOM generated by the Center for Watershed Protection statistical models, or one of several dynamic models. The simple model estimates the mean pollutant loading from a particular outfall or subbasin to a receiving water. A statisticalbased models, such as the FWHA model (Driscoll et al. 1990), can be used to characterize the variability of pollutant loading and concentrations, including the expected frequency of exceeding water quality criteria. A dynamic model also can calculate the expected frequency of exceedances. In addition, a dynamic model can account for the variability inherent in stormwater discharge data including variations in concentration, flow rate, and runoff volume. Thus, it can be used to calculate the entire frequency distribution for the concentration of a pollutant and the theoretical frequency distribution (i.e., the probability distribution) for loadings from the outfall or subbasin. This enables the modeler to describe the effects of observed discharges on receiving water quality in terms of the frequency at which water quality standards are likely to be exceeded. Dynamic models include USEPA's Stormwater Management Model (SWMM) and Hydrologic Simulation Program (HSPF), the U.S. Army Corps of Engineers' Storage, Treatment, Overflow, Runoff Model (STORM), and Illinois State Water Survey's Model QILLUDAS (or Auto QI) (USEPA 1992).
Whatever method is used to estimate annual pollutant loadings, an estimate of the event mean concentration (EMC) should be used as input. Note that buildup/washoff functions, which are available in SWMM and several other models, cannot accurately simulate all of the ways pollutants can enter stormwater; thus, the results should be interpreted with caution. The EMC is defined as the constituent mass discharge divided by the flow volume and is essentially the pollutant mass per unit of discharge volume. In stormwater monitoring programs, the EMC is estimated from the concentration of a constituent in a flow weighted composite sample. Studies by Collins and Dickey (1992) demonstrate that the EMC derived from a flow weighted composite sample does a good job of estimating the true event mean concentration for all but very short, intense storms. During short storms, the automated sampler cannot be programmed to collect a sufficient number of samples to ensure that the results are representative.
A7 ASSESSING TRENDS IN STORMWATER DISCHARGE QUALITY
Power Analysis
Initial analyses can be used to determine whether statistical tests of hypotheses concerning a data set will be of sufficient power to reveal true differences between treatments (e.g., outfalls or years). Factors that influence the power of a test to detect a difference between treatments include:
 Magnitude of the trend to be detected
 Variability in the data set
 Number of independent samples per treatment
 Desired confidence interval for the estimate
The power of a test to detect a difference of a given magnitude (e.g., 20%) between treatments when it is truly present (equivalent to providing insurance against a "false negative") can be increased by increasing the number of observations in the data set. Most high powered statistical packages for the personal computer provide the ability to conduct a power analysis or to create a power curve.
If the data are highly variable, the number of samples required to adequately ensure against a false negative test may require financial resources beyond the project scope. Clearly, where historical data are available, this type of analysis is of great benefit to project planning. Where historical data are not available, a power analysis can be conducted after the first year of sampling, preparatory to designing the monitoring program for successive years.
Time Trends. Several statistical methods, both parametric and nonparametric, are available for detecting trends. They include graphical methods, regression methods, the Mann Kendall test, Sen's non parametric estimator of slope, the Seasonal Kendall test (Pitt, 1994), and ANOVA. Preliminary evaluations of data correlations and seasonal effects should be made prior to trend analysis. Data correlations are likely if data are taken close together in time or space. Close data can be influenced by each other and do not provide unique information. Seasonal effects should be removed, or a procedure that is unaffected by data cycles should be selected (seasonal Kendall test). The correlation between concentration versus flow should be checked by fitting a regression equation to a concentration versus flow plot. The effect of any such correlation should be subtracted from the data prior to the trend analysis.
Graphical Methods
Plots of trends in constituent concentrations over time can be examined for seasonal or annual patterns:
 Sort the data set by station and sampling date (i.e., first station and oldest sampling data are the first line of data);
 For each station, select "date" as the x variable and plot the parameter of interest on the y axis; and
 Visually inspect the data for upward or downward trends and note any large "peaks" or "valleys."
Regression Methods
Linear least squares regression on water quality versus time, with a t test to determine if the true slope is not different from zero, can be used if the data are not cyclic or correlated and are normally distributed.
Mann Kendall Test
This test is useful when data are missing . It can consider multiple data observations per time period, and enables examinations of trends at multiple stations and comparisons of trends between stations. Seasonal cycles and other data relationships (such as flow versus concentration correlation) affect this test and must be corrected. If data are highly correlated, the test can be applied to median values in discrete time groupings.
Sen's Nonparametric Estimator of Slope
This is a nonparametric test based on rank. It is not sensitive to extreme values, gross data errors, or missing data (Gilbert 1987).
Seasonal Kendall Test
This method is preferred to most regression methods if the data are skewed, serially correlated, or cyclic (Gilbert 1987). It can be used for data sets having missing values, tied values, censored values (below detection limits) and single or multiple data observations in each time period. Data correlations and dependence must be considered in the analysis (Pitt, 1994).
Analysis of Variance
ANOVA can be used to detect significant differences in stormwater quality at two or more monitoring events. Refer to Section A4 for a detailed description of ANOVA.
A8 ASSESSING THE EFFECTIVENESS OF BEST MANAGEMENT PRACTICES
Existing BMPs
The effectiveness of existing BMPs can be qualitatively evaluated by comparing sampling results for drainage basins with BMPs to results for basins without BMPs. After a minimum of three sampling events, an exploratory data analysis (Section A2) can be conducted to determine whether the use of statistical methods to detect significant differences between sampling locations is appropriate. Alternatively, it may be necessary to collect more data (i.e., sample during additional storms) before statistical methods can be applied.
Statistical analysis of water quality data for locations with and without BMPs is performed using the ANOVA procedures described in Section A3. As described above, the data set will consist of stormwater samples collected from each location during three or more storm events. Ideally, the locations being compared will be sampled during the same storm events.
Data on constituent concentrations or pollutant loadings from several locations with or without a BMP can be pooled for comparison. Pooling data under each treatment makes the data set more robust by capturing more of the potential variability while sampling the same number of storms. Pooled drainage basins should be similar in most respects. Data from markedly different drainage basins should not be pooled, even if both locations have the same BMP. Correlation analysis can be performed to determine if metals concentrations are highly correlated with TSS.
Future BMPs
To evaluate the effectiveness of BMPs not yet in place, water samples collected prior to BMP implementation can be quantitatively compared to samples collected at the same location after BMP implementation. If the data set appears to follow a normal or lognormal distribution and does not contain a high proportion of non detects, the Student's t Test should be used to determine whether "post BMP" water quality differs significantly from "pre BMP" water quality. If the data set does not appear to follow a normal distribution and/or contains a high proportion of non detects, nonparametric methods should be used to test for significant differences.
APPENDIX A REFERENCES
Collins, M.A. and R.O. Dickey. 1992. Observations on Stormwater Quality Behavior. Invited Paper, Proceedings of the 1992 state/USEPA Water Quality Data Assessment Seminar, USEPA Region VI.
Driscoll, E., P.E. Shelley, and E.W. Strecker. 1990. Pollutant Loadings and Impacts from Highway Stormwater Runoff, Volume III: Analytical Investigation and Research Report. FHWARD88008. Gilbert, Richard O. 1987. Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold Co.: New York, NY.
Maidment, D.R. 1992. Handbook of Hydrology. McGraw Hill, Inc. CITY?
Newman, M.C. and P.M. Dixon. 1990. UNCENSOR: A program to estimate means and standards deviations for data sets with below detection limit observations, American Environmental Laboratory 2(2):2630.
Pitt, Robert. 1994. Detecting Water Quality Trends from Stormwater Discharge Reductions. Proceedings of the Engineering Foundation Conference on Stormwater Monitoring. August 712. Crested Butte, CO.
Shapiro, S.S. and Wilk, M.B. 1965. "An Analysis of Variance Test for Normality (complete samples," Biometrika, 52.
U.S. Environmental Protection Agency. . 1983. Final Report of the Nationwide Urban Runoff Program (NURP). Water Planning Division, Washington, D.C.
U.S. Environmental Protection Agency. 1992. Guidance Manual for the Preparation of Part 2 of the NPDES Permit Applications for Discharges from Municipal Separate Storm Sewer Systems. USEPA 833 B 92 002, November, 1992.
Vogel, R.M. 1986. "The Probability Plot Correlation Coefficient Test for Normal, Lognormal, and Gumbel Distributional Hypotheses," Water Resour. Res., Vol. 22, No. 4, pp. 587590.
Back to top
Questions and feedback should be directed to Deirdre Remley (deirdre.remley@dot.gov, 2023660524).
