Correlation Analysis

When looking at the relationship between two variables taken from objects in a sample, correlation is the appropriate approach when we are interested in the strength of the association between the variables but cannot assume causality (that is, we have two potentially interdependent variables, not one independent and one depent variable).  The question addressed by correlation analysis is the extent to which two variables covary.  In graphical terms, this amounts to asking how closely points on a scatterplot fall to an imaginary line drawn through the long axis.  Of course, it is important to remember that we are not saying anything about the line itself (slope, intercept, etc.), just where the points lie in relation to that line. 
 
 


 
 
Principles of Correlation 

To measure the degree of correlation, we compute one of several correlation coefficients, measures of the tendency for two variables to change together.  Correlation coefficients range from -1.0 to +1.0.  A correlation of +1.0 indicates that the variables always change togther and in the same direction (positive correlation), while a value of -1.0 indicates a perfect negative correlation, where larger values for one variable are always associated with small values of the other, and vice versa.  A correlation of 0.0 indicates that the variables vary independently of one another (are uncorrelated, or show no joint dependence).  Values in between these extremes represent different degrees of positive and negative correlations.

Graphically, perfect correlations imply that the points fall along a imaginary line of some non-zero slope, whereas completely uncorrelated variables generate a scatterplot that is circular.  Thus, the correlation coefficient measures the ellipticality of the scatter of points.  Points falling along a line of zero slope, however, are also uncorrelated, as one of the variables shows no variance (and thus cannot covary with the other). 

While there are multiple types of correlation coefficients, there are two that are used most commonly.  Both of these depend on computing the product of the two deviations of X1 and X2 from their respective means.
 
 


 
 
Pearson Correlation Coefficient

The Pearson correlation coeffcient is a parametric statistic, which assumes that (1) a random sample, (2) both variables are interval or ratio, (3) both variables are more or less normally distributed, and (4) any relationship that exists is linear.  To calculate the Pearson correlation, we must first calculate the covariance, or sum of the products of the deviations of two variables from their respective means.  The covariance (cov) is calculated as

cov(X1,X2) = 1/(n-1) * SUM ((X1i - X1 bar)(X2i - X2 bar))

While the covariance shows the same tendencies as the correlation, its actual value is dependent on the original units (so cov ranges from negative to positive infinity).  We would like to standardize these covariances, so we can compare variables measured on different scales and compute correlations among pairs of variables measured in different scales.  To do this, we divide the covariance by the standard deviations of the variables to generate the Pearson correlation coefficient (rp), as 

rp = cov (X1,X2) / (SX1 * SX2)

It is important to remember that r is not a test of significance, just a measure of the degree of association. 

Click here to see an example calculation.


 
 
Spearman Correlation Coefficient

The Spearman correlation is nonparametric, and is also known as a rank correlation, as it is conducted on the ranks of the observations for data that are at least ordinal.  Specifically, this correlation evaluates the differences in ranks of an object that is ranked for two different variables.  So, the sample of objects is ranked twice (once for each of the variables for which the correlation is to be assessed), and the difference in the ranks is calculated for each object.  The Spearman correlation from these data is given by

rs = 1 - ((6*SUMd2) / n*(n2 - 1))

where d2 = (RX1 - RX2)2.  If the rank order is the same for both variables, then the correlation is perfect (1.0 or -1.0). 

As with the Pearson correlation, we do not know from the value of r alone whether the observed correlation is significant.  For either type of correlation, we can test the null hypothesis that the correlation is not significant by calculating 

t = r * SQRT((n - 2) / (1 - r2)

and comparing this to a critical t at the 0.05 level with n - 2 degrees of freedom (n - 2 since one degree of freedom is lost for each variable).

Click here to see an example calculation.
 

Back to Summaries