Getting ready

To start with, import the os and the pandas packages and set your working directory according to your requirements:

# import required packages
import os
import pandas as pd
import numpy as np

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.model_selection import train_test_split

# Set working directory as per your need
os.chdir(".../.../Chapter 8")
os.getcwd()

Download the breastcancer.csv dataset from GitHub and copy it to your working directory. Read the dataset:

df_breastcancer = pd.read_csv("breastcancer.csv")

Take a look at the first few rows with the head() function:

df_breastcancer.head(5)

Notice that the diagnosis variable has values such as M and B, representing Malign and Benign, respectively. We will perform label encoding on the diagnosis variable so that we can convert the M and B values into numeric values.

We use head() to see the changes:

# import LabelEncoder from sklearn.preprocessing
from sklearn.preprocessing import LabelEncoder

lb = LabelEncoder()
df_breastcancer['diagnosis'] =lb.fit_transform(df_breastcancer['diagnosis'])
df_breastcancer.head(5)

We then check whether the dataset has any null values:

df_breastcancer.isnull().sum()

We check the shape of the dataset with shape():

df_breastcancer.shape

We now separate our target and feature set. We also split our dataset into training and testing subsets:

# Create feature & response variables
# Drop the response var and id column as it'll not make any sense to the analysis
X = df_breastcancer.iloc[:,2:31]

# Target
Y = df_breastcancer.iloc[:,0]

# Create train & test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=0, stratify= Y)

Now, we will move on to building our model using the AdaBoost algorithm.

It is important to note that the accuracy and AUC scores may differ because of random splits and other randomness factors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset