Help With Correlation Statistical Analysis
What is Correlation in Statistics
Correlation is a statistical analysis technique used to measure the association of two or more measurable variables of different datasets. Correlation is used to measure the degree of association, magnitude, and direction of the variables. Correlation works on the assumption that the dataset has a linear relationship.
Scatter plots are used to check if the variables have outliers, non-linear relationships, and subgroups in the dataset. The correlation coefficient (r) is the degree of association between any two variables on a continuous scale. The two main methods used to measure the correlation coefficient are pearson’s (r) and spearman’s (p). The coefficient of determination(r2) is the proportion of variation between the variables.
Interpretation of Correlation
The first factor used to interpret correlation is direction, that is, how the variables increase or decrease. Correlation is used to measure the direction of two or more quantitative variables. Variables are said to be positively correlated if they both move in the same direction, that is, they both increase or decrease. Negatively correlated variables are when one variable increases as the other variable decreases.
The second factor used to measure correlation is the magnitude between variables. Correlation coefficient varies from -1.0 to +1.0. A correlation coefficient of 1.0 has a perfect linear relationship between the two variables, meaning that points are on a straight line. Variables are said to have no linear relationship if their value is zero. The greater the value of the correlation coefficient, the stronger the magnitude between the two variables.
The third factor used to measure correlation is the degree of association between datasets. Correlation coefficient is measured using pearson’s and spearman’s methods. Pearson’s method is used for datasets that are normally distributed, linear, and independent, otherwise, spearman’s method is used. Correlation of determination(r2) is obtained by squaring the r-value and gives an account of the variability between the two variables.
Errors in Correlation and Factors Affecting Correlation Analysis
Usually, correlation is misunderstood for a cause. However, correlation is used to show the relationship of the variables and not the cause for their association. Hence, correlation analysis does not distinguish the variables as dependent or independent.
Scatter plots must be drawn to show the outliers, non-linearity, and subgroups of the data. An outlier is an off value in a dataset and causes inconsistency in the correlation coefficient. A dataset with a non-linear relationship should not be used because the correlation obtained will be inconclusive.
Subgroups in quantitative variables lead to false correlation analysis because the characters in the subgroups are different. The sample size used should be consistent with the study objective. A small sample size implies a non-existent relationship. Medium to large sample size provides a preferred type 1 error, least correlation coefficient, and p-value.
Summary
Correlation analysis is used to measure the association of two or more variables. The correlation coefficient has a maximum magnitude of one. The negative and positive sign shows the correlation direction. Pearson’s correlation coefficient measures datasets that are normally distributed, while spearman’s correlation coefficient is a non-parametric method that measures dataset that is not normally distributed.
Correlation is normally mistaken for cause, but correlation only shows the degree of association and not the cause for the relationship. Factors such as outliers, small sample size, non-linearity, and subgroups may lead to incorrect correlation analysis, false results, or a non-existent association, hence it is important to draw a scatter plot to determine if the data provided is viable for correlation statistic.
If the points are close to the regression line or in a straight line, then the data is fit for correlation analysis. Correlation of determination shows the range of the variability between the quantitative variables and can be presented in decimal form or converted to a percentage.