Python Data Analsysis Tutorial

In this tutorial, we'll use pandas, matplotlib and scikit-learn to examine how the diameter and the weight of abalones relates to each other.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Import data

First, we want to import the data from the csv file. For this, we'll use pandas.

In [2]:
data = pd.read_csv('abalone.csv')

Let's check if the data was loaded correctly. We can access subsets of our data with []. For example, we can select the first five rows by using [:5]

In [3]:
data[:5]
Out[3]:
Diameter Weight
0 0.280 0.2170
1 0.495 1.0805
2 0.345 0.4285
3 0.525 1.4290
4 0.510 1.5270

Plot data

We start with the exploration of the data. Let's create a plot of the abalones' diameter against their weight. For this, we use matplotlib. Using the function scatter, we can create a scatterplot of our datapoints. We also add axis labels and a title for the plot.

In [4]:
plt.scatter(data.Diameter, data.Weight)
plt.xlabel('Diameter')
plt.ylabel('Weight')
plt.title('Abalone Diameter against Weight')
plt.show()

Linear Regression

From the plot above we can see that there is a nice correlation between the diameter and the weight of the abalones. Let's try to model this correlation using linear regression with the diameter as feature (x) and the weight as target (y). The relation between diameter and weight is arguably not exactly linear, but for our purposes here, it will be fine.

For regression, we'll be using the class LinearRegression from scikit-learn. We can fit a linear regression to our data using the fit function. This function expects the features to be a two dimensional array (for regression with multiple features), therefore we have to reshape the array we pass for our feature (=diameter).

In [5]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

x_reshaped = np.array(data.Diameter).reshape(-1, 1)
linear_model.fit(x_reshaped, data.Weight)
Out[5]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Let's create a plot containing the data points and the regression line. For plotting the regression line, we first need some x-values and their corresponding y-values lying on the regression line. For example, we can do this by creating a linear space with np.linspace. This function returns evenly spaced numbers in a specific range. In our case, we'll get 100 evenly spaced x-values in the range from 0.25 to 0.6.

For finding the y-values, we use the predict function of LinearRegression. As with the data fitting before, we also have to reshape the x-values for it to work with one single feature.

In [6]:
regression_x = np.linspace(0.25, 0.6, 100).reshape(-1, 1)
regression_y = linear_model.predict(regression_x)

We'll use the code for plotting the data from above as base for plotting the data with the regression line. For also plotting a line, we call the plot function and pass the x- and y-values on the regression line. We also apply a different color to the regression line using by passing a color of our choice to the argument c. We also would like to create a legend by passing labels to the calls of scatter and plot and then calling the function legend.

In [7]:
plt.scatter(data.Diameter, data.Weight, label='Abalone data')
plt.plot(regression_x, regression_y, c='red', label='Linear Regression')
plt.xlabel('Diameter')
plt.ylabel('Weight')
plt.title('Abalone Diameter against Weight')
plt.legend()
plt.show()
In [ ]: