7964
Lecture 10


In this discussion, we'll focus on correlation.

Correlation can be thought of as "strength of relationship" -- that is, if two things are very correlated, they are strongly associated with each other, strongly related to each other. You can generally predict one with a high degree of accuracy if you know the other. So, for instance, class attendance among undergraduates and grades are often highly correlaetd.

However, keep in mind that correlation is merely association--not causation.

Correlation is measured with a "correlation coefficent", which ranges from -1 to 1.

Scatterplots--where data are plotted based on two variables, X and Y-- are useful ways to graphically illustrate how correlated two variables are. Click here for examples of positive and of negative correlation.


I. Example #1

Consider the following example, taken from Lucy (2006) (originally taken from Grim (2002)) of data on the average molecular weight of the dye methyl violet and UV irradiation time from an accelerated aging experiment.

Time (minutesWeight (Da)
0.0367.20
15.3368.97
30.6367.42
45.3366.19
60.2365.91
75.5365.68
90.6365.12
105.7363.59

A scatterplot showing the correlation between these two variables would look something like:



The formula for the correlation coefficient is:



The numerator in this formula looks like the variance formula that we've seen for a single variable--but represents the covariance, which is essentially a measure of how much two variables vary together.

The correlation is essentially a standardized version of the covariance--it is the covariance adjusted for the standard deviation of x and y.

What is "r" in example #1? We can calculate out the mean of time as 52.9; we can calculate out the mean of weight as 366.26. Given that,

Time (min)X -
mean X
(X - mean X)2Weight(Y -
mean Y)
(Y - mean Y)2(X-mean X)*
(Y-mean Y)
0.0-52.902798.41367.20.94.883-49.72
15.3-37.611414.51368.972.717.344-101.92
30.6-22.33498.63367.421.161.345-25.90
45.3-7.6157.91366.19-.07.005.53
60.27.3353.73365.91-.35.122-2.57
75.522.61511.21365.68-.58.336-13.11
90.637.671419.03365.12-1.141.300-42.94
105.752.842792.06363.59-2.677.129-141.08


The numerator for "r" is -376.72.
The denominator for "r" is the square root of (18.465 * 9545.49), or 419.83.

The "r" correlation, therefore, is -.8973.

It is negative because as time increases, weight decreases.


II. Significance Testing for Correlations

We can use t-tests to test for significance of a sample correlation.

We calculate

        t =          r X √ df
                   _________________
                      √ (1 - r2)

We've actually used up two pieces of information--we've estimated two means / standard deviations. (You can also think of this in terms of needing or "using up" two data points to plot a line.) So now our "degrees of freedom" are n-2.

So, in this case, the t statistic would = [( -.8973 ) * ( √ 6 )] / ( √ ( 1-.89732 ) = -4.78.

With 6 degrees of freedom, we see that 95% of the t distribution area is within plus / minus 2.447. So, our value of -4.78 is beyond -2.447, so we can say that the linear correlation coefficient is significant at 95% confidence. Indeed, our t tells us that our correlation coefficient is significant even at the 99% level, because the "critical value" of the t at 99% is 3.707--that is, 99% of the area under the t-curve falls between -3.707 and 3.707. Another way to think about this: there's less than a .001 chance that we would get that high of a t if our null hypothesis "no correlation" was true in the population--if these two variables weren't at truly associated with each other in the population.

There are three major limitations of Pearson's correlations:



III. Correlation Coefficents for Non-Linear Data

One complicating issue is that correlation coefficients "r" (these are called Pearson's Correlation Coefficients") only measure linear relationships. Therefore, if you have two variables that are related, but in a non-linear fashion, you may get a deceptively low r, and (in error) fail to reject the null hypothesis. In other words, you may have a relationship, but the Pearson's r fails to give evidence of that relationship.

In order to account for non-linear relationships, you have two options: