A Quick Guide to Data pre-processing for Machine Learning | Python | IMPUTATION | STANDARDISATION | Data Analysis | Data Science

Data Pre-processing | Imputation | Standardisation | Rescaling | Python




Before feeding your data to the machines you have to prepare this data before inserting it into your Machine Learning Algorithm. Quality of your data has to be good, it doesn't contain null values or out of the constraint values. Because the quality of your data is directly related to the quality of training of your model. For more info about Data-preprocessing's need and requirements, Click Here.

Photo by Mika Baumeister on Unsplash


1. Imputation

Imputation's simply meant the "change", this process will help you to change the missing value from your table. There are lots of algorithms which can't deal with the null values and might give you errors or badly trained model.

Let's have a look at the data below, this data is about the salary of several domains...

THE DATASET

This data has some null values, which is unbearable by the machine learning algorithm. This type of data not called good quality data, so we have to work out with some library and codes to make it more accurate data.

What we can do?

First of all, we should remove all the missing values with the other non-null values. So let's see how to do it...

First of import all libraries...

import pandas as pd
import numpy as np
import sklearn

Line 1: We're importing Pandas library because we have to make a data frame of our data so that we can visualize our whole data in the form of a table in the IDE.

Line 2: We are using NumPy, we have to use np.nan, further for finding null values in the table.

Note: In, Python the NULL is represented by NaN,

Line 3: Importing scikit learn  Library, it is the key library for our today's whole tutorial. It consists of all those data pre-processing classes and objects which we need to prepare data for machine learning.

MyData = pd.read_csv('C://Users//Vicky//Downloads//MyData.txt')

Line 4: Load your data set in the data frame with the help of pandas' attribute read_csv().

from sklearn.impute import SimpleImputer
imputer1 = SimpleImputer(missing_values=np.nan,strategy='median')
imputer1.fit(MyData.iloc[:,1:2])
MyData.iloc[:,1:2]=imputer1.transform(MyData.iloc[:,1:2])

Line 5: We are importing SimpleImputer class from sklearn library so that we can perform the operation of imputation in the table.

Line 6: we are making an object imputer1 from the class SimpleImputer. SimpleImputer class takes these arguments:
  • missing_values --> We can give it an integer or np.nan(for finding null values) for it to find the missing values.

  • strategy --> We can define our strategy that which type of value we want to fill in the place of missing values. We can set it to 'mean', 'median', 'most_frequent' or 'constant'.

  • axis --> We can either assign it 0 or 1, 0 to minute along with columns and 1 to impute along rows. by default, it is set to 0.

Line 7: We are fitting our imputer1 object to our data, means we are forcing it to train our data, by fit() function. In "MyData.iloc[:,1:2]" We basically defined our location of that column, where we have to deal with missing data. In this case, we are selecting the Age column.

It will take an extra minute to learn this, so play around with it to understand this one. You can refer to this link

Line 8: We are transforming all the missing_values and assigning it to the old column, so that old column will update with the new data.

imputer2 = SimpleImputer(missing_values=np.nan,strategy='mean')
MyData.iloc[:,2:]=imputer2.fit_transform(MyData.iloc[:,2:])
print(MyData)

Line 9: This time we use strategy of mean, for another column.

Line 10: It is the single step for the two-step as in Line 7 and Line 8. Also this time we have defined other column location called "Salary".

Line 11: Printing the Final result.

Output:

THE OUTPUT

See, in the output, the NaN values are filled with the respective values which we have coded for.

Congratulations🎉, we have done with the Imputation✨. It so simple naa? So feel free to comment below, I am waiting for your doubts and queries.

2. Standardisation

It is a type of data preprocessing technique which, we use when the difference between the weightage of multiple attributes. For example, in our data, we have a "salary" column which has a large scale as compared to the "age" column. So, when you will feed it to your algorithm then your ml algorithm will do more work on the "salary" part. And we don't want our algorithm to biassed for one type of attribute. This is the reason that your ML Algo could not generate a good quality trained model.

So, we need to standardised all our data before we feed it to our algorithm so that all the data have the same scale and weightage. For standardisation, we will transform our data like the means of all data will be 0 and the standard deviation will be one.

So, mathematically we are transforming attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

Let's have a look at python implementation...


from sklearn.preprocessing import StandardScaler

Line 12: We don't have to write code for means and standard deviation, we have an inbuilt class for this, StandardScaler.

std=StandardScaler()
stddata=std.fit_transform(MyData.iloc[: , 1:3])

Line 13: We are making an object of class StandardScaler()

Line 14: We are fitting and transforming each and every value in "stddata". It will return the value in the form of a NumPy array.

print(stddata)

Line 15: Print the OUTPUT:

OUTPUT: stddata

Congratulations🎉, we have done with the Standardisation✨. It so simple naa? So feel free to comment below, I am waiting for your doubts and queries.

3. Rescale Method

In this type of data pre-processing, we will rescale the data to the same scale within the range between 0 and 1. Mathematically, we are performing the Normalisation here, but in case of Standardisation method, we have standardised the data.

So, let's have a look at normalisation and standardisation:

Normalisation and standardisation both are the part of the feature scaling. Because sometimes we are facing a type of data, in which one attribute is in kilogram and another attribute in gram or litre. So, these are the feature, who vary very vastly, and as I mentioned earlier this is not the type of data which we required to train a good quality model. Therefore here comes the concept of feature scaling

Normalisation: 

It is a feature scaling technique in which every value scaled and shifted between 0 and 1, also called min-max scaling.



Standardisation: 

In this type of feature scaling technique, the values are scaled in this way that the mean of those values is 0 and the standard deviation is 1. This technique shift values to centred around the mean with a unit standard deviation.


Let's have a look at the implementation...

from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler()
normdata = norm.fit_transform(MyData.iloc[: , 1:3])
print(normdata)

Line 16: Importing the class for normalization i.e. MinMaxScaler.

Line 17: Defining an object for the MinMaxScaler() class.

Line 18: Fitting and transforming the columns from the table "MyData". About "MyData.iloc[: , 1:3]" and "fitting/transform", I have explained earlier in this post already.

Line 19: Printing the OUTPUT.

The OUTPUT: normdata



You can observe the values in the OUTPUT, all values are ranging from 0 to 1.

Congratulations🎉, we have done with the Rescale Method✨. It so simple naa? So feel free to comment below, I am waiting for your doubts and queries.


Further, I will upload the other method for data pre-processing, so stay tuned with us.


If you love my work, then you can connect with me on LinkedIn and Github

Comments

Post a Comment

Popular posts from this blog

SMART HOSPITAL | IoT based smart health monitoring system | Real time responses | Internet Of Things | ESP8266

Plotly & Cufflinks | A Data Visualisation Library with Modern Features | Python | Data Science | Data Visualisation