What’s the relationship between big data and machine learning?
Since about 2010, “Big Data” has become the ubiquitous term to describe all the data that is generated by people from their smartphones, web browsing history, social media and purchasing behaviour, together with any other information that organizations hold about them.
Why is big data different to any other type of data? In one sense, there isn’t a difference; it’s all just zeros and ones at the end of the day. However, the term “Big Data” tends to be applied to large collections of different types of data which are often volatile and changeable, and where one would struggle to analyse it using traditional computer hardware and software.
It’s also the case that big data often incorporates certain types of data that were not widely used for customer analysis until relatively recently. In particular, big data includes:
Text. What people write and say can be analysed to identify what they are talking about sentiments being expressed. If a product is being discussed in a positive or negative context, this is likely to be predictive of whether someone goes on to buy that product.
Images. This covers photos and video, as well as medical imaging. One application of machine learning is to use features identified in scans and x-rays to predict the likelihood that someone has a specific disease.
Social network data. This is information about people’s connections and who they know. Network data includes the number and type of connections that people have, as well as data about connected individuals. If all your friends are sci-fi geeks, that’s probably a good indication that you might be one too.
Geospatial. Information about peoples’ location and movements, provided by smart phones and other mobile devices.
Biometrics. Data about blood pressure, heart rate and so on, collected from fit bands, smart watches and so on.
Product (machine) generated. Everyday devices from televisions to coffee makers are being designed to share information between themselves and over the internet. These days your heating, kettle, washing machine, and so on can be all be controlled via your smart phone. The “Internet of Things” (IoT) concept is still developing, but will eventually provide lots of data that can be used to infer people’s behaviour using machine learning.
In the “good old days” back in the 1990s, smart devices didn’t exist. Few people even had a cell phone back then, and the internet was still in its infancy. Very little electronic data about people or their activities existed. What there was usually limited to a few geo-demographics such as address, age, income, gender and so on. This may then have been supplemented by data supplied from a direct marketing company or a credit reference agency if financial services products were involved (e.g. arrears status on loans and credit cards). Supermarkets had no idea what individual customers bought each week, insurance companies didn’t know how people drove, and health services held most of their patient records in paper files.
Life for a data scientist back then was pretty straightforward because all of this (very limited) electronic data was usually held in a nice neat format of rows and columns (like one would find in a spreadsheet). The data was also relatively static, usually only being updated very infrequently – typically at month or year end.
In today’s world of big data, data is being updated much more frequently, often in real time. In addition, a lot more of it is “free form” unstructured data such as speech, e-mails, tweets, blogs, and so on. Another factor is that much of this data is often generated independently of the organization that wants to use it. This is problematic, because if data is captured or generated by an organization itself, then they can control how that data is formatted, and put checks and controls in place to ensure that the data is accurate and complete. However, if data is being generated from external sources then there are no guarantees that the data is correct.
Externally sourced data is often “Messy.” It requires a significant amount of work to tidy it up and to get it in to a useable format. In addition, there may be concerns over the stability and on-going availability of that data, which presents a business risk if it becomes part of an organization’s core decision making capability.
What this means is that traditional computer architectures (Hardware and software) that organizations use for things like processing sales transactions, maintaining customer account records, billing and debt collection, are not well suited to storing and analyzing all of the new and different types of data that are now available. Consequently, over the last few years a whole host of new and interesting hardware and software solutions have been developed to deal with these new types of data.
In particular, modern big data computer systems are good at:
Storing massive amounts of data. Traditional databases are limited in the amount of data that they can hold at reasonable cost. New ways of storing data as allowed an almost limitless expansion in cheap storage capacity.
Data cleaning and formatting. Diverse and messy data needs to be transformed into a standard format before it can be used for machine learning, management reporting, or other data related tasks.
Processing data quickly. Big data is not just about there being more data. It needs to be processed and analysed quickly to be of greatest use.
The issue with traditional computer systems wasn’t that there was any theoretically barrier to them undertaking the processing required to utilize big data, but in practice they were too slow, too cumbersome and too expensive to do so.
New data storage and processing paradigms such as Hadoop/MapReduce have enabled tasks which would have taken weeks or months to process to be undertaken in just a few hours, and at a fraction of the cost of more traditional data processing approaches. The way that Hadoop does this is to allow data and data processing to be spread across networks of cheap desktop PCs. In theory, tens of thousands of PCs can be connected together to deliver massive computational capabilities that are comparable to the largest supercomputers in existence.
Data (whether “Big” or “Small”) has no intrinsic value in itself. A big mistake that an organization can make is to think that if they invest in a mass storage system, such as Hadoop, and collect every scrap of data they can about people, then that’s going to add value. The data has to be worked into something useful if it’s going to be of benefit. Machine learning is the key tool that does that – applying algorithms to all that data and producing predictive models that can tell you something about people’s behavior, based on what has happened before in the past.
A good way to think about the relationship between big data and machine learning is that the data is the raw material that feeds the machine learning process. The tangible benefit to a business is derived from the predictive model(s) that comes out at the end of the process, not the data used to construct it.
Machine learning and big data are therefore often talked about in the same breath, but it’s not a symmetrical relationship. You need machine learning to get the best out of big data, but you don’t need big data to be able use machine learning effectively. If you have a just few items of information about a few hundred people then that’s enough to begin building predictive models and making useful predictions.
The more and better data that you have, then the better at making predictions your models will be, but having gigabytes or terabytes of data is not a prerequisite for building useful models.
Follow us on