Read this tutorial before you use Proc Corr
All of us at some point in the process of examining data, check for correlations among different variables in the data especially pair-wise correlations.
Among a large chunk of business analysts in industry, there exists a notion of ‘linear correlation coefficient’ being the only criterion for pair-wise correlation and hence at the maximum a Proc Corr is run in SAS to check for the same. This will of course be useful for finding out correlations between continuous variables. However it more often than not fails when confronted with real life data which frequently contains all kinds of variables, continuous, binary or multi-level categorical etc.
- In a scenario where you are trying to find out correlation between continuous variables, Proc Corr is a good choice, because it simply gives you linear correlation coefficients.
- Now when you are looking at correlation between a binary variable and a continuous variable, your idea of correlation needs a little change in perspective. Simple linear correlation coefficient is rendered meaningless here, because one is not really dealing with meaningful numbers now, but categories. In many datasets you would observe that these categories have been given some numbers, but don’t confuse them with real numeric variables, they are just represented using numbers. They very well could have been given some other numbers, changing the value of the linear correlation coefficient, if one was using the same to assess correlation in this case. How do you go about working around this problem then?
- Observe what a binary variable does to your continuous variable when taken together. It basically divides your continuous variable into two chunks defined by two levels of that binary variable. Correlation among your binary and continuous variable would mean that when you change from one level of binary variable to another; behaviour of your continuous variable is going to change as well. Now whether that behaviour change is statistically significant can be checked by ‘Proc Ttest’ using continuous variable as “variable in question” and binary variable as “class variable”.
As you can see for the category “1” here cont_var seem to have higher values and that is how bin_var is affecting cont_var OR is correlated with cont_var.
- Correlation between a multilevel categorical variable and continuous variable is nothing but an extension to what we discussed above. Instead of just two levels, now we are talking of multiple levels. So ‘Proc ANOVA’ comes in picture. You use continuous variable as “variable in question” and your categorical variable as “class variable”. Results of Proc ANOVA will tell you whether continuous variable’s mean differs significantly for any of the groups defined by different levels of categorical variable. If it does, then you can use Bonferroni test in conjunction with Proc ANOVA to find out which of the classes are affecting your categorical variable.
An additional tip here is, if you are doing this pair wise correlation with your derived variable and other categorical variables in question, you can use Bonferroni test to determine which categories of your categorical variable would be worth converting to dummy variables and be used in the model building.
As you can see here, cont_var is being affected by one of the categories here that is C, whereas rest of the two categories have more or less the same behaviour. If cont_var here was your dummy variable, you should make a dummy variable for the category C from that categorical variable of yours.
- When it comes to correlation between categorical variables, either of binary or multilevel; simple choice is Chisquared test, which can be carried out with ‘Proc Freq’. Let’s see how we view “correlation” in this context. Look at the proportions in which one of the categorical variables is divided between its categories for overall population. Let’s call that C1. Now correlation between C1 and C2 means that, this proportional distribution is going to change for different levels of C2. The statistical significance of this change would be determined by Chisq test.
Below is a simple chart for reference which gives a picture of which test to use in case your data has mixed type variables.
Congratulations! Now you are armed to deal with checking correlation between all kinds of variable types rather than erroneously using Proc Corr!
Latest posts by Lalit Sachan (see all)
- Logistic Regression vs Decision Trees vs SVM: Part II - October 6, 2015
- Logistic Regression Vs Decision Trees Vs SVM: Part I - October 5, 2015
- How to Create a Multi-Dimensional Visualisation in R - April 10, 2015
Follow us on