Getting ready

We start by importing the required libraries and reading our file. We suppress any warnings using the warnings.filterwarnings() function from the warnings library:

import warnings
warnings.filterwarnings('ignore')

import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.utils import resample

import matplotlib.pyplot as plt

We have now set our working folder. Download the autompg.csv file from the GitHub and copy the file into your working folder as follows:

os.chdir('.../.../Chapter 5')
os.getcwd()

We read our data with read_csv() and prefix the name of the data frame with df_ so that it is easier to understand:

df_autodata = pd.read_csv("autompg.csv")

We check whether the dataset has any missing values as follows:

# The below syntax returns the column names which has any missing value
columns_with_missing_values=df_autodata.columns[df_autodata.isnull().any()]

# We pass the column names with missing values to the dataframe to count the number
# of missing values
df_autodata[columns_with_missing_values].isnull().sum()

We notice that the horsepower variable has six missing values. We can fill in the missing values using the median of the horsepower variable's existing values with the following code:

df_autodata['horsepower'].fillna(df_autodata['horsepower'].median(), inplace=True)

We notice that the carname variable is an identifier and is not useful in our model-building exercise, so we can drop it as follows:

df_autodata.drop(['carname'], axis=1, inplace=True)

We can look at the data with the dataframe.head() command:

df_autodata.head()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset