Anatomy of a data science project
Before the cool algorithms and complex data architectures, comes the process that all data science projects must follow (I know this already sounds a little bookish, but trust me – I am just breaking the subject down into little pieces!).
Different projects follow different approaches, but the objective remains largely the same – let the data give you inputs and present the insights learned in an appealing and understandable way to the client.
An incremental approach (similar to agile model) as opposed to a waterfall development model can save time, effort, and money. Deriving continuous value from the data in each cycle is safer and quicker. It is because this approach allows us to get actionable insights and be better prepared to apply operations on the data with every iteration.
If we were to club some of the sub-tasks and broadly present the stages of a datascience project, this is what it would look like:
It involves asking relevant questions and completely nailing down the business use case. Some of those questions could be:
- What is the ultimate objective of the project? Is it to optimize ad spend during the holidays or to build a system capable of changing air conditioning settings to match the number of customers in the store?
- Will Data Science really add value and why is it needed? The team and the client both have to believe that there is a need for using machine learning algorithms or analytics techniques to get insights or build a system.
- What are the data sources? Will they be given to the team by the client or will the team have to find it from public sources?
This phase comprises of collecting data from different sources and also its management. If the data has to be pulled from public sources, crawlers(programs that scrape data from websites) and other open information sources are used.
Another important subtask of this phase is data cleaning, which involves cleaning missing entries, semantic errors and other inconsistencies. Some experts estimate that data preparation takes 60 to 80 percent of the whole analytical pipeline in a typical data science project.
Exploratory data analysis
This exercise can harvest early rewards depending on how well the data is already organised. It involves analyzing data sets to summarize their main characteristics, and is often done with visual methods. Inconsistencies and missing values in data may also reveal themselves.
Build model or perform analysis
This part of the project is what all the hype is about and covering it in detail would be well outside the scope of this post, but let me talk about a few things to give you an idea.
Depending on the objectives of the project, different types of machine learning algorithms and analytics techniques are employed by a data scientist. Tools such as Hadoop, R, Python, etc. are used. Apart from deciding the tools and algorithms to be used on projects, data scientists are also responsible for continuously collecting, cleaning, and integrating new data.
Data is more than just numbers. In this phase, the data scientists demonstrate the insights they gleaned from analyzing the data. Humans are more adept at noticing trends in charts, graphs, and other forms of pictorial representation than numbers. The deliverables of this phase would be similar to the reports one can generate while using analytics software.
The trends and patterns depicted have to be relevant to the business objective of the project. This is why data scientists are expected to know how businesses run in general, in addition to their vast technical knowledge.
Insight-driven projects rely on the aesthetic appeal and how convincing the projected increases in business(if changes are made based on the insights) look. On the other hand, only the results of a real-time deployment of the system will eventually decide whether data science will add any real value to the business. Clearly defining the problem that needs to be solved and analyzed to find insights is crucial for any business owner to reap the full benefits of a data science project.
Lastly, it is important to understand that it is an iterative process and that getting real-time results for your business takes time.
Latest posts by Manu Jeevan (see all)
- Anatomy of a data science project - November 10, 2017
- How big data is reshaping financial services - October 30, 2017
- What is a statistical model? Learn the definition & importance - October 30, 2017
Follow us on