Data Preparation for ML: A Brief Guide

01/06/2021
6 minute read

Data preparation techniques for your machine learning (ML) model to yield better predictive power.

Perhaps the most pivotal step in your machine learning application is the data preparation phase. On average, data scientists spend more time prepping and transforming datasets before actually training a machine learning model than any other task. Each machine learning algorithm requires the data to be in a certain format. Transforming the data into these specific formats will not only have a massive impact on the performance of the model but will also yield higher predictive power. Understanding and implementing the proper data preparation methods will strengthen any machine learning algorithm. Below is a brief guide on common data preparation techniques. 

What is Data Preparation?

Data preparation is an imperative step in the machine learning process, in which raw captured data is transformed into a format that is compatible with the given machine learning algorithm. Data preparation involves analyzing and transforming data types through data cleaning methodologies. These include data selection, data cleaning and feature engineering techniques.

Data Selection 

Before getting into data cleaning techniques, it is important not to overlook data quality. No amount of data cleaning could fix “garbage in - garbage out” in machine learning. Hence, being mindful of quality and integrity during data capture is a key step. Data should be captured from a reliable source. Once the source is known and trusted, statistical data sampling techniques can help you avoid high bias or high variance. There are many types of data biases, such as selection, exclusion, observer, and measurement bias, to be wary of. In addition to this, the volume of your data should be sufficient enough for your model. A small training size can lead to higher error rates in your results. 

Data Cleaning 

Once you confirm that the data has been sourced reliably, you can further add to its quality through data cleaning techniques. Data cleaning is a vital process in machine learning. Oftentimes, a given machine learning algorithm will have specific data formatting requirements. Through data cleaning techniques, your data will transform to be ready for training. 

If you’d like to get professional help to clean and fine-tune your data for your specific ML project, be sure to check out the data services provided by the TAUS NLP Team, that help make your datasets ready for training. Additionally, datasets that are published on the TAUS Data Marketplace automatically get cleaned and anonymized. Every data publisher on the TAUS Data Marketplace can download the clean version of their datasets free of charge.  

Formatting 

Data formatting consists of proper data types and structures. For example, dates may be in string formats and you need them to be in datetime formats And the data may be in a proprietary format while you wish to load into a relational database. These initial steps are crucial for all proceeding data cleaning steps. 

Feature Engineering 

Feature engineering focuses specifically on the attributes of a dataset, acting as the direct input to the machine learning model. Hence, the features used and steps taken to improve the quality of the features influence your output more than anything else. Some common and powerful feature engineering techniques are feature normalization, standardization, null handling, dealing with sparse features and outlier detection.

Feature Normalization and Standardization

Inconsistent values in your dataset can consist of data points that are scaled differently. For example, you may have data samples that have values over 1, but you need everything to lie between 0 and 1. Feature normalization is used to scale a data attribute, such that the data falls into a particular range of values, namely 0 to 1 or some pre-defined scale. A common data normalization technique is known as min-max normalization, which is a linear transformation using the minimum and maximum values during scaling. Normalization can also be applied to string values when fixing syntax errors or naming conventions. 

Standardization (also known as z-score normalization) is similar to normalization, with the difference being that the standard deviation is taken into account. This is important when all features have different ranges of values, in which case normalization would not capture differentiation of values properly. This is also a handy technique because it reduces the effects of outliers. 

Null Handling 

Null handling is another critical step in feature engineering for machine learning. Null values can throw off your model, hence these need to be dealt with before training. Common techniques include dropping rows with nulls, imputing missing values for continuous or categorical variables, and using another modeling technique to fill in missing values. Numerical imputation is generally preferred over dropping rows because the size of the dataset is preserved. One simple way to impute a numerical feature is to use some default value given the dataset, such as 0, 1, or the median value. Categorical features can be imputed with some default value as well, or the maximum occurring value (or mode) of the feature column. 

Sparse Features 

Sparse features are features that have values of mostly zero. Unlike nulls, the values are known, however, they provide little value in many cases. When a model has abundant sparse features, the space and time complexity of your model will increase. Furthermore, your model may end up behaving in unexpected or unknown ways, due to the noise the sparsity presents. The best way to handle sparse data is to remove sparse features from your dataset. The alternative is to fill in those values with a pre-defined method, such as an average or mode, when applicable. 

Outlier Detection 

Outliers have the potential to add noise to your dataset, introducing bias errors in your output. Certain machine learning algorithms are highly sensitive to outliers, such as linear and logistic regression, decision trees, and support vector machines. One way to deal with outliers is to use statistical methodologies such as standard deviation and percentiles. Taking a close look at the distribution of your data can help you to understand where your outliers live. Using the standard deviation method to filter out outliers can be handy. For example, for normally distributed data, all data points should fall within the range of the mean +/- two standard deviations. Identifying outliers using percentiles is also another simple way to remove these extraneous values. Data points that lie within a predefined percentile can be deemed as outliers. Visualizations like boxplots or scatterplots are great to visualize where the noise is occurring in your dataset. 

Summary

Implementing proper data collection, data cleaning, and feature engineering strategies should not be overlooked. Data preparation in machine learning may seem like a tedious process, however, this step, if done properly, can either make or break model performance. 

 

data-preparation-for-ml-a-brief-guide
Author
husna-sayedi

Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.

Related Articles
11/03/2024
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
09/11/2023
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
19/12/2022
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.