# Common Mistakes in the Application of Statistics in Social Science Research – I: Interpretation of the Correlation Coefficient

When reviewing the results of social science research, common errors are apparent in formulating hypotheses, choosing the model (s), and interpreting despite the use of quality data from a primary survey or from authenticated secondary sources. Here is an attempt to address a few mistakes for the benefit of researchers so that their results have credibility with society and later be useful for sound policy formulation. In this paper, the interpretation errors of the correlation coefficient are chosen for discussion.

Correlation measures linear association, not causality

Correlation measures linear association, not causality There is a difference between the words “relation” and “association” in statistics which may not be in English. The term “correlation” has the word “relation” but should only be understood as the degree or strength of “association” (linear) between two quantifiable variables, and should not be understood as the strength of “dependence or relation. Between two variables. The moment we use the word “relation” we fall into the trap of “dependence” between the two variables. The correlation hypothesis is that the two variables are treated symmetrically, that is to say that the two variables are treated in the same way, they are neither independent variables nor dependent variables. They are simply treated as two variables and the Pearson product moment correlation coefficient or Pearson correlation coefficient (PCC) or simply the correlation coefficient measures the strength or degree of linear association between two quantifiable variables. PCC is the covariance between two variables divided by the product of the standard deviation of the two variables, varying between -1 and +1. It is therefore crucial to use academic wisdom in the choice of variables, otherwise it leads to a false correlation. Oil extraction in OPEC countries and the GDP of importing countries may show a strong correlation, but may be wrong. The correlation relation is a misnomer. The association is appropriate.

Even pronouncing correlation as “co-relation” or as “co-related” is a misnomer. As mentioned earlier, even though the word Correlation has the word relation, it does not measure the relation, but measures the strength or degree of linear association between two (quantifiable) variables. Quantifiable means only that two variables must be quantified and can be discontinuous or continuous variables. For example, the correlation between weight and height only measures the strength of the association between weight and height and does not conclude that as weight increases, height decreases or vice versa. In the social sciences, the example could be the per capita income of individuals and the number of years of schooling. Suppose the correlation between the two is positive and statistically significant, this only means that income per capita and years of schooling are positively associated and does not mean that as the number of years of schooling increases, income per capita inhabitant increases or vice versa. In order to prove one of these dependencies, the researcher must use regression (single or multiple).

Steps before correlation

Before calculating the PCC, the researcher must confirm whether two variables considered are quantified on an interval or ratio scale. If the two variables, for example, relate to ranks, then the correlation coefficient cannot be calculated, instead the Spearman rank correlation must be calculated. Next, a scatter plot should be viewed in MS Excel or other software to confirm the linear association between two variables. If the dispersion indicates a curvilinear association, then the correlation coefficient should not be calculated. Instead, the “correlation ratio” should be calculated. Only if the dispersion shows a linear association, then the Pearson Product Moment correlation coefficient or the Pearson correlation coefficient (PCC) given by the covariance between two variables divided by the product of the standard deviation of the two variables can be calculated.

Interpretation of the correlation coefficient

After calculating the correlation coefficient between X and Y, it should be tested for statistical significance by comparing the calculated t-value with the t-value in the table at a significance level of 5% or 1% and at (n-2) degrees of freedom, where n refers to the number of pairs of observations X and Y. If the calculated t-value is greater than the t-value in the table, then the correlation coefficient is statistically significant and interpreted accordingly as two variables are associated positively or negatively associated or not associated at all. For example, one may be interested in finding the correlation (coefficient) between (1) the frequency of smoking and the incidence of lung cancer, (2) scores on statistics and mathematics exams, (3) income per inhabitant and per capita expenditure, (4) crop productivity and irrigation water, (5) fertility rate and per capita income, (6) female labor participation rate and per capita income, etc. In all of these cases, if we find a statistically significant correlation, there is a tendency to indicate: as smoking increases lung cancer increases; since candidates perform well in mathematics, they will also perform well in statistics; as per capita income increases, the fertility rate decreases and so on. But such conclusions are false since the correlation coefficient does not indicate causality, but only indicates the extent of the association between the two variables. Therefore, the correlation helps infer what causes what, what regression is doing. Thus, to draw a conclusion about the dependence of the variable, the researcher must run a regression, regressing the dependent variable on the independent variable. Therefore, before running a regression, the researcher should hypothesize and name the dependent variable, and name the independent variable (s), choose a linear or nonlinear model, and regress the dependent variable on the independent variable (s).

Degrees of freedom

It is crucial to have sufficient degrees of freedom for correlation or regression. Sometimes the correlation is calculated only for 8 pairs of cases, in which case the ‘t’ value in the table at the 5 percent level is 2.45. Unless the calculated ‘t’ value is greater than 2.45, the correlation is not significant. It is therefore crucial to have a large number of observations in order to obtain statistically reliable estimates. Overall, it is suggested that the researcher (s) familiarize themselves with “correlation”, a simple measure of association between two variables.

###### Warning

The opinions expressed above are those of the author.

###### END OF ARTICLE