Edvancer's Knowledge Hub

Top mistakes beginners make while learning data science (Part 1)

Manu Jeevan 20/07/2017

There are a handful of extremely common mistakes that you should never make while learning data science. These pitfalls can make it incredibly difficult to gain momentum early in the process. So, if you’re just starting out, remember to avoid these common missteps. 1) Not learning data cleansing Steve Lohr of The New York Times said: “Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” Unsurprisingly, a lot of data scientists’ work is to clean data. Data scientists spend 80% of their time converting data into a usable form. In the real world, the data you are analyzing is going to be messy (unstructured) and difficult to work with. Hence, cleaning data is a vital skill. Some examples of messy data sets include missing values and inconsistent string formatting (e.g., ‘San Fransico’ versus ‘san francisco’ versus ‘ny’). Most beginners don’t learn how to clean messy datasets and directly jump to learning programming and algorithms. 2) Applying to all the generic job roles There are more specialized roles within data science: data engineer, data analyst, machine learning engineer, and business analytics professional. There are many components that make up data science. How much you need to know varies for each of these roles. For instance, a data engineer needs to be good at coding and making projects. A data analyst must learn less stats and math but more visualization. Machine Learning engineers are more focused on software and ML. People trying to get into data science apply for all the generic roles without even analysing their background, strengths and weakness. You need to apply to roles that suit your field of study. Let me give you an example: If you are a Java programmer then it is easy for you to learn Hadoop so you need to target jobs related to Hadoop. If you have a management background then you need to target roles related to business analytics. If you really want to become a data scientist then you need to target roles like junior data scientist or a data analyst, instead of applying for roles that requires advanced machine learning skills. 3) Learning only the syntax It is really tempting to just learn the syntax when you want to acquire a skill quickly or acquire knowledge quickly. For instance, you might know how to wrangle data in Pandas, and how to execute an algorithm using scikitlearn or create visualization packages using ggplot2 in R. All of these libraries do a task automatically so that you don’t need to understand what is going on underneath the hood. For example, linear regression fails in lot of cases. It fails when you have non-linearity in your data. It fails when you have a lot of high dimensional data. Unless you really understand how linear regression works, you will never understand to catch those cases. When you are visualizing data, you have to understand why you are using a bar chart instead of a line chart. Why is it better than a line chart? What are the cases when you should use a bar chart? This knowledge, comes from working on projects. So it is not just important to learn how to do things. It is important to learn why you are doing certain steps and how those things work. I can’t emphasize this enough. It is just like reading a work book that says, type in this command. This is one of the reasons why “Learn Python the hard way” is not a good way to learn: because it doesn’t give you a deep understanding of how something is working. 4) Learning only through guided tutorials Learning by watching guided tutorials is great in the beginning, but not advisable for the long run. It is critical to work on real world projects that push you beyond your comfort zone. In real life, the data science work flow keeps on changing. You need to know how to quickly move from data wrangling to analysis, then from analysis to writing algorithms. Guided tutorials won’t teach you how to adjust to these situations. So, you need to work on projects that deal with different aspects of data science – data wrangling, programming, writing algorithms, etc. Once you complete one or two guided tutorials, start working on data science problems that you are really interested in. It can be figuring out new and interesting things about your city, mapping all the devices on the internet, finding the real positions NBA players play, or anything else. The best thing about learning data science is that there are infinite interesting things to work on; It’s all about asking questions and finding a way to get answers. In my next post, I will talk about a few more mistakes that beginners make while learning data science. Till then, stay tuned!

About
Latest Posts

Manu Jeevan

Manu Jeevan is a self-taught data scientist and loves to explain data science concepts in simple terms. You can connect with him on LinkedIn, or email him at manu@bigdataexaminer.com.