Edvancer's Knowledge Hub

4 Hadoop Job Roles & Their Pre-requisites

Edvancer Support 08/03/2016

If you are new to Hadoop & Big Data you must be wondering what the various Hadoop job roles entail, which one you would be most suitable for and what do you need to do to get one of those jobs. Datameer’s has predicted the global Hadoop market to be $50.2 billion by 2020.

Source: Datameer

Even though the growth of the Hadoop market is high, Gartner survey shows that lack of skilled Hadoop professionals is the primary problem companies are facing in investing more. Companies are unable to find skilled professionals who know how to derive value from Hadoop. The above statistics shows that there is increasingly huge demand and supply gap for Hadoop professionals. Let us break down what the various Hadoop job roles are and what are their key pre-requisites in terms of skills:

Building blocks of Hadoop

Apache Hadoop was a subproject of the web-search software called Nutch. The team at Nutch used Java to build Hadoop. Just because the team at Nutch were more comfortable to program in Java, they used Java to develop Hadoop – So there was no particular reason to use Java. Even though Hadoop can run on windows, it was built on Linux and it is the preferred method for installing and managing Hadoop. A fundamental knowledge of Linux (an operating system) can help you to understand the Hadoop ecosystem much better. Linux knowledge will also help you to work efficiently with HDFS (a distributed file system of Hadoop).

Analysis of Hadoop Job market

I will show you three examples of Hadoop job positions and skills companies are looking while hiring for these positions. The following are not a complete list of Hadoop postings, but these just a few examples.

Hadoop Developer

Hadoop developers need to write jobs and tasks to store, manage and analyze Big Data in the cluster. They need to understand the functionalities of the storage and processing layers of Hadoop. Also, Hadoop developers have to access some of the advanced features in Hadoop via Java API. All these tasks require a basic knowledge of core Java programming. Knowledge of SQL will also be helpful. Most companies are looking for Hadoop developers to manage and process all their unstructured data and to build Hadoop products on top of the traditional database systems that already exist in the organization.

Data Warehouse Analyst/ ETL developer using Hadoop

ETL is one bridge leading organizations towards greater Hadoop adoption. Today many enterprises use Hadoop for the pedestrian uses of storage and ETL. ETL, which means extract, transform and load, involves people “taking data from the traditional data source and dumping it into Hadoop.

Source: Cloudera

Cloudera shows the three most common use cases for Hadoop ( data transformation, archiving, and exploration) – Many analysts say that more than 75 % of the Hadoop adoption resides in the first two use cases. If you are aiming to become an ETL developer/ a data warehouse analyst using Hadoop, then you need to have a strong knowledge of database and SQL concepts. Java programming is not a prerequisite to becoming an ETL developer/ a data warehouse analyst, but a fundamental knowledge of Linux will be useful.

Data scientist– Hadoop

Here, the company is looking for data scientists with knowledge of Hadoop. As a data scientist, you need to write a lot of MapReduce, Hive or Pig code to access and analyze huge volumes of unstructured data stored in Hadoop. You can write MapReduce code in your preferred programming language, there are APIS that converts your code written in your preferred programming language to Java MapReduce code. Hence, Java programming knowledge is not a prerequisite for this job. But, you need to know at least one programming language. If you do not know any programming, then you can primarily learn Hive and Pig to analyze the data. Intermediate knowledge of SQL is required because, as a data scientist, you should know how to work with a relational database.

Hadoop Administrator:

As a Hadoop admin it is your job to set-up the entire Hadoop cluster, configure it, maintain and manage it, take care of security, back-up and recovery. The admin also does capacity planning, anticipates requirements for enhancing or lowering capacity of the cluster and performance tuning. A Hadoop admin needs to have excellent knowledge of Linux, networking, hardware, databases, SQL amongst other administrator related skills. Knowledge of Java is not mandatory.

Conclusion

Many articles in the big data community say that Java is not a prerequisite for two reasons:

Tools like Hive and Pig that are built on top of Hadoop offer their own programming languages, Pig can be programmed using Pig Latin and Hive can be programmed using HiveQL, for working with data on your cluster. So you don’t need to learn Java to use tools that are built on top of Hadoop.
You can write MapReduce code in any language of your preference. There are specific APIs that converts your code written using any programming language (either C, Python or C++) to Java MapReduce code. So you don’t have to learn Java to program MapReduce.

Yes, these arguments are true. Hence do not let the lack of Java knowledge hold you back from learning Hadoop and using it. Spark is an upcoming processing engine which is also replacing Mapreduce and Java is not required there, though you will need to know Python or Scala. Suffice to say as long as you have a programming background don’t hesitate to start learning Hadoop. Why? Because. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions. You can learn Hadoop & Big Data through Edvancer’s Certified Hadoop & Big Data Expert course.