Read this tutorial before you use Proc Corr

All of us at some point in the process of examining data, check for correlations among different variables in the data especially pair-wise correlations.

Among a large chunk of business analysts in industry, there exists a notion of ‘linear correlation coefficient’ being the only criterion for pair-wise correlation and hence at the maximum a Proc Corr is run in SAS to check for the same.  This  will of course be useful for finding out correlations between continuous variables. However it more often than not fails when confronted with real life data which frequently contains all kinds of variables, continuous, binary or multi-level categorical etc.

  • In a scenario where you are trying to find out correlation between continuous variablesProc Corr is a good choice, because it simply gives you linear correlation coefficients.
  • Now when you are looking at correlation between a binary variable and a continuous variable, your idea of correlation needs a little change in perspective. Simple linear correlation coefficient is rendered meaningless here, because one is not really dealing with meaningful numbers now, but categories. In many datasets you would observe that these categories have been given some numbers, but don’t confuse them with real numeric variables, they are just represented using numbers.  They very well could have been given some other numbers, changing the value of the linear correlation coefficient, if one was using the same to assess correlation in this case. How do you go about working around this problem then?
    • Observe what a binary variable does to your continuous variable when taken together. It basically divides your continuous variable into two chunks defined by two levels of that binary variable. Correlation among your binary and continuous variable would mean that when you change from one level of binary variable to another; behaviour of your continuous variable is going to change as well. Now whether that behaviour change is statistically significant can be checked by ‘Proc Ttest’ using continuous variable as “variable in question” and binary variable as “class variable”.

Continuous & Binary Variable

As you can see for the category “1” here cont_var seem to have higher values and that is how bin_var is affecting cont_var OR is correlated with cont_var.

  • Correlation between a multilevel categorical variable and continuous variable is nothing but an extension to what we discussed above. Instead of just two levels, now we are talking of multiple levels. So ‘Proc ANOVA’ comes in picture. You use continuous variable as “variable in question” and your categorical variable as “class variable”. Results of Proc ANOVA will tell you whether continuous variable’s mean differs significantly for any of the groups defined by different levels of categorical variable. If it does, then you can use Bonferroni test in conjunction with Proc ANOVA to find out which of the classes are affecting your categorical variable.

An additional tip here is, if you are doing this pair wise correlation with your derived variable and other categorical variables in question, you can use Bonferroni test to determine which categories of your categorical variable would be worth converting to dummy variables and be used in the model building.

Continuous variable and multi-categorical variable correlation

As you can see here, cont_var is being affected by one of the categories here that is C, whereas rest of the two categories have more or less the same behaviour. If cont_var here was your dummy variable, you should make a dummy variable for the category C from that categorical variable of yours.

  • When it comes to correlation between categorical variables, either of binary or multilevel; simple choice is Chisquared test, which can be carried out with ‘Proc Freq’. Let’s see how we view “correlation” in this context. Look at the proportions in which one of the categorical variables is divided between its categories for overall population. Let’s call that C1. Now correlation between C1 and C2 means that, this proportional distribution is going to change for different levels of C2. The statistical significance of this change would be determined by Chisq test.

Below is a simple chart for reference which gives a picture of which test to use in case your data has mixed type variables.

Data Table

Congratulations! Now you are armed to deal with checking correlation between all kinds of variable types rather than erroneously using Proc Corr!

Share this on
facebooktwittergoogle_plusredditlinkedinmail

Follow us on
facebooktwittergoogle_pluslinkedinrss

Comments

  • Richard
    Reply

    thanks alot for the article, I agree to the above suggestions. I have been asking so many people about these correlations between types of variables. crystal clear.

    regards
    Richard.

  • Lalit Sachan
    Reply

    Vadim,

    idea here is the perspective of correlation when it comes to categorical variable [binary or multilevel ] is not same as the linear correlation. Agree, that in case of binary variables , there might not be a difference in direct calculated values; but that’s not the only thing we are looking at here.
    Thanks!
    lalit

  • Vadim Pliner
    Reply

    Lalit,

    The value of Pearson correlation between a binary variable and a continuous variable does not depend on the binary variable’s coding, it could be 0 & 1 or -1 & 1 or any other pair of numbers. Therefore your advice not to use PROC CORR when one variable is binary and the other one is continuous is questionable to say at least.

    Regards,
    Vadim

  • Lalit Sachan
    Reply

    Hi Michael , article talks about pair wise analysis, 1-way ANOVA only. Moreover, it suggests an improvement over simply using linear corrleation coefficient. Yes you are right , ANOVA comes in with a lot of assumptions itself and PROC GLM would be safer choice in the strictest sense. It would be even better than what is suggested but the aim was rather to keep it simple and suggest something which is better suited than linear correlation coefficient.
    Thanks!
    Lalit

  • Michael Esposito
    Reply

    Aatash,

    It should be explicitly pointed that that PROC ANOVA is strictly for balanced data – with some exceptions like 1-way ANOVA, Latin Squares, completely nested designs, etc.

    It may be better (safer) to suggest the use of PROC GLM instead?

    Cheers,
    Michael.