To start with, import the os and the pandas packages and set your working directory according to your requirements:
# import required packages
import os
import pandas as pd
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.model_selection import train_test_split
# Set working directory as per your need
os.chdir(".../.../Chapter 8")
os.getcwd()
Download the breastcancer.csv dataset from GitHub and copy it to your working directory. Read the dataset:
df_breastcancer = pd.read_csv("breastcancer.csv")
Take a look at the first few rows with the head() function:
df_breastcancer.head(5)
Notice that the diagnosis variable has values such as M and B, representing Malign and Benign, respectively. We will perform label encoding on the diagnosis variable so that we can convert the M and B values into numeric values.
We use head() to see the changes:
# import LabelEncoder from sklearn.preprocessing
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
df_breastcancer['diagnosis'] =lb.fit_transform(df_breastcancer['diagnosis'])
df_breastcancer.head(5)
We then check whether the dataset has any null values:
df_breastcancer.isnull().sum()
We check the shape of the dataset with shape():
df_breastcancer.shape
We now separate our target and feature set. We also split our dataset into training and testing subsets:
# Create feature & response variables
# Drop the response var and id column as it'll not make any sense to the analysis
X = df_breastcancer.iloc[:,2:31]
# Target
Y = df_breastcancer.iloc[:,0]
# Create train & test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=0, stratify= Y)
Now, we will move on to building our model using the AdaBoost algorithm.