Analytics in R can be as easy as analytics in Excel

For many who are trying to start with data analytics, R comes across as a good programming tool being free and famous. But then you stumble onto a big hurdle before you can even type an ‘r’ of R [oh i wish they had given a longer name to R]. The word “programming” scares half of you back to looking for an alternative.

Today let’s establish the fact that analytics in R can be taken  just as an extension to writing formulas on Excel. Seriously, that’s true! Although as you progress, you get to know and learn more complex features of R, but don’t let that deter you, because that comes as an advantage over Excel and by which time you would have fallen in love with R.

Programming in R is very different than JAVA/C/C++ and cousins:

Some of you had a passing brush with JAVA, C, C++ at some point of your academic life. Because of that short lived experience, programming means to you writing 50 lines in an alien language just to calculate mean/average of 10 numbers. You’d rather use a calculator, or better, Excel. Right? Now, how different is R in that respect? Very!

In R, you don’t have to write all those lines of code. In fact, most of these mundane tasks have formulas in R just like Excel. For example calculating averages, how would you do it in Excel? Write a formula in one of those cells; argument of those formulas are going to be the cell references where the values are for whom you want to sum or get an average of. Nothing different in R vis a vis Excel.

You have formula “mean()” [Although in R we call formulae “functions” ]. Pass the values to this formula and you have your averages. Want to concatenate two strings, you will use formula “concatenate” in excel, in R the function names is “paste”. Using vlookup/hlookup? How about using functions join/merge in R?! I can go on and on with these parallels.

R vs. Excel

If it’s all the same as Excel, then why learn R?

Because R is not only similar, it’s much more, and then some more and so on. Few differences which current exclusively Excel users can identify with:

  • Say, goodbye to excel templates. You have R scripts instead where you have written all those formulas, they take in the raw data file and generate whatever report you had to generate in one click. No need to apply the template or painfully putting all that data in the template as you did in excel.
  • The plots? Selected data, insert plot. No need to go by that path. For example just write plot(x,y) for a scatter plot. And you are done. And you can use this over and over again, no need to follow steps of clicking and pointing and dragging each and every time.
  • Yes, you have add-ons in excel for regression and optimization and other stuff, but have you seen that output? Once you see how smooth things are in R when it comes to statistics and predictive modelling, you won’t go back to Excel.
  • Talking about add-ons, R has 4,000+ of them, all of them free, just imagine how many functionalities you will have at your finger-tips.

And I’m not talking about how much better predictive modelling and data visualization capabilities are in R. Let’s not even go there.

Ok, I’m somewhat convinced. Tell me more.

You are only somewhat convinced because we haven’t yet really done anything in R, where you’d realise the easy transition from Excel to R.

Let’s start with a reporting task which we’ll do in R. How to do that in Excel is already straight forward for you, so just imagine that in your head and compare.

The business problem is as follows:

You have two raw data files in csv format. First file contains store ids, their location, this month’s sales, store category [small, medium, large] ,and their next month targets. Second file contains store ids, their average sales last year and area manager affiliated to them.

Your job: Select stores which are present in both the files. Get average sale by category of stores for this month. Get average targets by category for this month. Plot boxplots for sales to observe sales distribution for each category for last year and this month. Carry out a ‘ttest’ to see if average sales for last year are much different from this month’s sales for medium sized stores.

Here is what you would have written in R to do all this:

#importing files to R

data1=read.csv(“file1.csv”)

data2=read.csv(“file1.csv”)

#merging the two

library(plyr)

merged=join(data1,data2,by=”storeid”,type=”inner”)

#get average sales by category

average_sales=tapply(current_sales,category,mean)

average_target=tapply(target,category,mean)

#plot them

plot(last_year_sales,category)

plot(current_sales,category)

#carry out t-test

newdata=merged[category==”medium”,]

attach(newdata)

t.test(current_sales,lastyear_sales)

Did you notice, it is all formulas and it takes just two lines to do each step. And if you give me that raw data again I don’t have to go and write all of this all over again. I will just use the same script again.

So, where do I start?

Start with downloading R and R studio. Since you haven’t really used any editor, I’ll not praise R Studio here. Let’s just say that, currently R studio is the best way to work with R and yes it is free!

There are a lot of open resources to learn R from. Problem with them is they are mostly not structured. Those sources, forums, youtube videos and discussions are excellent when you already have some idea of R and are struggling with some specific issue. As you are looking to start learning R, a complete, structured course, which enables you to start working from scratch would be what you require.

My personal suggestion would be to take a cost-effective, online professional course which not only gradually builds your knowledge but assesses your progress with real world case studies and quizzes as you progress in the course. You get a structured path with professional guidance and you can also learn at your own pace and time.

You will find exactly such a course here: www.edvancer.in/course/r-tutorial.

Share this on
facebooktwittergoogle_plusredditlinkedinmail

Follow us on
facebooktwittergoogle_pluslinkedinrss

Comments