Preparing training data in Machine Learning

5 min readNov 21, 2018

Before proceeding with this article I hope you have read my previous article on Starting in Machine Learning to have a basic understanding of complete workflow of ML. If yes, let’s continue with the preparation of training data which we pass it to ML algorithms to create the model so that we can predict the future outcomes.

For this preparation, we are going to use python because it has all the supported libraries for ML such as:

numpy for scientific computing.
pandas for data frame.
matplotlib for 2D plotting.
and many more..

So, we need to have jupyter notebook installed on our machine, please follow the link https://www.anaconda.com/download and install the latest version of python it includes jupyter notebook as well.

After installation launch jupyter notebook via the command:

launch jupyter notebook

It will launch the notebook in our default web browser and there we will start writing the python script for preparing the training data by selecting the code cell type in the notebook and for simple heading use markdown:

In this article we are going to focus on the following highlighted process which also has several steps to follow to get the training data:

Here, the question is what about “Asking the right question”?

Well, we are here going to ask. Which people will develop the diabetes in future?

So, in order to predict that we will need some historical data which is based on the research done in 90s with Indians. In this data we have a list of people having diabetes true or false with several other parameters which we can find on the link below and download the .csv file containing whole data:

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

www.kaggle.com

Inspect and clean the data

Now, in the above step we have got the data that we need and I have a valid point to share:

“50–80% of a ML project is spent getting, cleaning and organising data”

So, we are starting with jupyter notebook I hope we have that opened in our browser , selected cell type is “code” and we have downloaded .csv file containing data on our machine.

first, do some important imports of the python libraries which are necessary for ML:

Load the .csv file containing data:

and we can check the total number of rows and columns in the table:

Output: (768,10) i.e. 768 rows and 10 columns.

To check the first five rows in the table:

Output:

To check the last five rows:

Output:

For detailed inspection, let’s plot the 2-D graph using:

and now call the function:

Output:

Now in order to understand this graph better we have to do correlation with the values on x and y-axis:

Output:

If we look at the previous two outputs, especially the pictorial one we can easily identify the unrelated patterns between skin and thickness, this is the useless data for us to predict the outcome. Hence, we can remove the column either skin or thickness, with the following command:

and run: df.head(5)

Output:

Now, we can see skin column has been removed. Let’s plot the graph now by calling the function plot_corr(df):

Hence, we have cleaned our data for the next step.

Mold the data:

This step involves changing the data to the consistent format. If you notice, the data we have currently in the column diabetes we have the boolean and rest all the data is in numerical format. So, let’s mold it to int format:

Now, run: df.head(5)

No boolean in our data now :), which will make further analysis easier.

To make our prediction more accurate, let’s predict the number of true and false cases of the diabetes in our data:

Output:

That’s all we have to do to prepare our training data. In the upcoming article we will use this data as an input to the selected ML algorithm so that we can create the model to predict Which people will develop the diabetes in future?

So, keep following us for further articles and if you have liked this article please give us a clap.

Thanks in advance. Happy Machine Learning :).

Preparing training data in Machine Learning

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sachin Kumar

Responses (1)