Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure. It was traditionally used as a preliminary step for a data mining process. More recently, these techniques have evolved for training machine learning and AI models and for running inferences against them. Also, these techniques can be used in combination with a variety of data sources, including data stored in files or databases, or being emitted by streaming data systems.
Data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user -- for example, in a neural network. There are a number of different tools and methods used for preprocessing, including:
- sampling, which selects a representative subset from a large population of data;
- transformation, which manipulates raw data to produce a single input;
- denoising, which removes noise from data;
- normalization, which organizes data for more efficient access; and
- feature extraction, which pulls out specified data that is significant in some particular context.
In a customer relationship management (CRM) context, data preprocessing is a component of web mining. Web usage logs may be preprocessed to extract meaningful sets of data called user transactions, which consist of groups of URL references. User sessions may be tracked to identify the user, the websites requested and their order, and the length of time spent on each one. Once these have been pulled out of the raw data, they yield more useful information that can be put to the user's purposes, such as consumer research, marketing or personalization.
In an AI context, data preprocessing is used to improve the way data is cleansed, transformed and structured to improve the accuracy of a new model, while reducing the amount of compute required.
Why is data preprocessing important?
Virtually any type of data analytics, data science or AI development requires some type of data preprocessing to provide reliable, precise and robust results for enterprise applications. Good preprocessing can help align the way data is fed into various algorithms for building machine learning or deep learning models.
Real-world data is messy and is often created, processed and stored by a variety of humans, business processes and applications. While it may be suitable for the purpose at hand, a data set may be missing individual fields, contain manual input errors, or have duplicate data or different names to describe the same thing. Although humans can often identify and rectify these problems in the line of business, this data needs to be automatically preprocessed when it is used to train machine learning or deep learning algorithms.
Machine learning and deep learning algorithms work best when data is presented in a particular format that highlights the relevant aspects required to solve a problem. Feature engineering practices that involve data wrangling, data transformation, data reduction and feature scaling help restructure raw data into a form better suited for a particular type of algorithm. This can significantly reduce the processing and time required to train a new machine learning or AI algorithm, or run an inference against it.
One caution that should be observed in preprocessing is identifying the possibility of reencoding bias into the data set. This is critical for applications that help make decisions that affect people, such as loan approvals. Although data scientists may deliberately ignore variables like gender, race or religion, these traits may be correlated with other variables like zip codes or schools attended.
Most modern data science packages and services now include various preprocessing libraries that help to automate many of these tasks.
Data preprocessing steps
The steps used in data preprocessing include:
- Inventory data sources. Data scientists should survey the data sources to form an understanding of where it came from, identify any quality issues and form a hypothesis of features that might be relevant for the analytics or machine learning task at hand. They should also consider which preprocessing libraries could be used on a given data set and goal.
- Fix quality issues. The next step lies in finding the easiest way to rectify quality issues, such as eliminating bad data, filling in missing data or otherwise ensuring the raw data is suitable for feature engineering.
- Identify important features. The data scientist needs to think about how different aspects of the data need to be organized to make the most sense for the goal. This could include things like structuring unstructured data, combining salient variables when it makes sense or identifying important ranges to focus on.
- Feature engineering. In this step, the data scientist applies the various feature engineering libraries to the data to effect the desired transformations. The result should be a data set organized to affect the optimal balance between the training time for a new model and the required compute.
- Validate results. At this stage, the data scientist needs to split their data into two sets for training and inference. The first set is used to train a machine learning or deep learning model. The second set of testing data is used to test the accuracy and robustness of the resulting model. This step will help the data engineer assess any problems in their hypothesis about cleaning and feature engineering the data.
- Repeat or complete. If the data scientist is satisfied with the results, they can push the preprocessing task to a data engineer who can figure out how to scale it for production. If not, the data scientists can go back and make changes to the way they implemented the data cleansing and feature engineering steps. It's important to note that preprocessing, like other aspects of data science, is an iterative process for testing out various hypothesis about the best way to perform each step.
Data preprocessing techniques
There are two main categories of preprocessing, each of which includes a variety of techniques: data cleansing and feature engineering.
Data cleansing includes various approaches for cleaning up messy data, such as:
Identify and sort out missing data. There are a variety of reasons that a data set might be missing individual fields of data. Data scientists need to decide whether it is better to discard records with missing fields, ignore them or fill them in with a probable value. For example, in an IoT application that records temperature, it may be safe to add in the average temperature between the previous and subsequent record when required.
Noisy data. Real-world data is often noisy, which can distort an analytic or AI model. For example, a temperature sensor might erroneously report a temperature as 250 degrees Fahrenheit, while previous and subsequent measurements might be about 75 degrees. A variety of statistical approaches can be used to reduce the noise, including binning, regression and clustering.
Identify and remove duplicates. When two records seem to repeat, an algorithm needs to determine if the same measurement was recorded twice or the records represent different events. In some cases, there may be slight differences in a record because one field was recorded incorrectly. In other cases, different records might represent a father and son living in the same house, which really do represent separate individuals. Techniques for identifying and removing or joining duplicates can help to automatically address these types of problems.
Feature engineering relates to various techniques to organize the data in ways that make it more efficient to train data models and run inferences against them. These techniques include:
Feature scaling or normalization. Often, multiple variables change over different scales, or one will change linearly while another will change exponentially. For example, salary might be measured in thousands of dollars, while age will be represented in double digits. Scaling helps to transform the data in a way that its easier for algorithms to tease apart a meaningful relationship between variables.
Data reduction. A data scientist may wish to combine a variety of data sources for creating a new AI or analytics model. Some of the variables may not be correlated with a given outcome -- such as the likelihood of loan repayment -- and may be safely discarded. Other variables might be relevant, but only in terms of relationship - such as the ratio of debt to credit -- and may be combined into a single variable.
Discretization. It's often useful to lump raw numbers into discrete intervals. For example, income might be broken into five ranges that are representative of people who typically apply for a given type of loan. This can reduce the overhead of training a model or running inferences against it.
Feature encoding. Another aspect of feature engineering lies in organizing unstructured data into a structured format. Unstructured data formats can include text, audio and video. For example, the process of developing natural language processing algorithms typically starts by using data transformation algorithms like Word2vec to translate words into numerical vectors. This make is easy to represent to the algorithm that words like "mail" and "parcel" are similar, while a word like "house" is completely different. Similarly, a facial recognition algorithm might reencode raw pixel data into vectors representing the distances between parts of the face.