Machine Learning “Data Preprocessing”

5 min readJun 12, 2021

Why do we need preprocessing?

Real-world data we get is never clean, raw data has dummy/duplicate data; it needs to be properly cleaned so that the model understands the data and does not get confused when doing predictions. Preprocessing allows us to purify raw data with the use of data cleaning, it is said that “the more the data is cleaned the more useful insights we get”. For the sake of not disturbing Machine Learning’s model’s accuracy, we have to first do data preprocessing.

Machine Learning and Data

For ML data is a pattern, ML understands patterns from the dataset but if the data has not been through the entire process of data preprocessing and data is not cleaned or data is unscaled, predictions will always be unacceptable.

Preprocessing Techniques

There are various types of techniques used for different types of data with problems.

1. StandardScalling

Why do we use it? It transforms continuous data into normally distributed data. Just a quick glance at the data!! There are mainly two types of data

Qualitative/Descriptive
Quantitative

Quantitative data further has two types

Discrete data
Continuous data

So when we have data in continuous form (2.4, 5.3, 6.1, 6,3) and it has high variances means data is wildly spread and data is not normally distributed we use “StandardScaler” to form the entire data into normal distribution so that the mean of the dataset is Zero and Std is 1 or approx.

MEAN:

Mean is the average of data, to find out the mean of data, we need to add up all the numbers and divide the sum with the total number of data points. Eg: 6 + 3 + 100 + 3 + 13 = 125 ÷ 5 = 25. The mean is 25.

Standard Deviation (STD)

It is the measurement of how spread out the data points are. To find std we have to first find mean. Eg : 6 + 4 = 10 / 2 = 5. The mean is 25. Then calculate the Variance, take each difference, square it, and then average the result:
Variance
σ2 = 62–2 + 42–2 / 2
= 36 + 4 / 2
= 40/2
= 20

Standard Deviation

σ = √20
= 4.472135955

= 4 (to the nearest mm)

StandardScalling Example:

Here I have generally created a DataFrame given it some values with ranges, First of all, I have just imported required modules with an alias,

Then created a DataFrame using pandas and NumPy, this is a randomly generated dataset.

Then visualized it using matplotlib

This dataset is not real-world data it's just randomly generated, each line has almost 500000 values, and the data is distributed by given different ranges. The ranges are from 0 to 10.

If we want to train this randomly generated data using ML, as this data is really big because it has 500000 values(data points), if ML starts computing this data, the time complexity is faced. Because when we have more data we will require more computational power and more computational power increases time complexity automatically.

StandardScaling not only helps in reducing time complexity by normally distributing data and turning the mean of the dataset to 0 and std to 1 or approx but it also does not compromise features of data and keeps the data in a format that ML understands.

Scikit-learn

Scikit-learn is the most useful library for machine learning in Python. The sklearn library contains a number of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction. Sklearn is the heart of ML, by importing the package of sklearn, Here I have just imported the function of StandardScaler. The function “StandardScaler” is further defined in a variable “ss”, this variable can be used whenever StandardScaler() function is required.

The function StandardScaler () has a method “.fit_transform()”, this method is used to fit and transform the entire dataset which is widely spread, this means doing some calculations on training data and showing the transformed data. Earlier I have created DataFrame named “df” check image “randomly generated data”. A new variable is created “data_tf”, the DataFrame is transformed by using StandardScaler’s method fit_transform. To fit and transform the data, the following LOC (line of code) is used.

fit_transform

To access the transformed data stored in variable “data_tf” by using the pandas library and mentioned the columns in the dataset simply plotted the columns (x1, x2, x3, x4), all the data in the x-axis is normally distributed and the mean of every data is zero (0) and Std is (1). After applying StandardScaling the values are decreased too.

The next story will be about another preprocessing technique, which is “MinMaxScaler”.