Machine-learning-based insights

Unlike the previous analysis methods, the methods discussed in this subsection and others similar are based on more complex mathematical models and machine learning algorithms. Given the scope of this book, we will not be going into the specific theoretical details for these models, but it's still worth seeing some of them in action by applying them to our dataset:

  1. First, let's consider the feature correlation matrix for our dataset. As the name suggests, this model is a matrix (a 2D table) that contains the correlation between each pair of numerical attributes (or features) within our dataset. A correlation between two features is a real number between -1 and 1, indicating the magnitude and direction of the correlation. The higher the value is, the more correlated the two features are.

    To obtain the feature correlation matrix from a Pandas DataFrame, we call the corr() method, like in our next code cell:
corr_matrix = combined_user_df.corr()
  1. We usually visualize a correlation matrix using a heat map, as implemented in the same code cell:
f, ax = plt.subplots(1, 1, figsize=(15, 10))
sns.heatmap(corr_matrix)

plt.show()

This code will produce the following visualization:

A feature correlation matrix heat map
  1. From this heat map, we can focus on the cells that are especially bright (which indicates a strong positive correlation), as well as the ones that are especially dark (which indicates a strong negative correlation). For example, we see high correlations between the time-based attributes in the lower-right corner of the heat map. This is reasonable as they all describe some statistics about a patient's typing speed.
  2. Next, we will try applying a machine learning model for our dataset. Contrary to popular belief, in many data science projects, we don't take advantage of machine learning models for predictive tasks, where we train our models to be able to predict future data. Instead, we feed our dataset to a specific model so we can extract more insights from that current dataset.

    Here, we are using the linear Support Vector Classifier (SVC) model from scikit-learn to analyze the data we have and return the feature importance list:
#%%

from sklearn.svm import LinearSVC


combined_user_df['BirthYear'].fillna(combined_user_df['BirthYear'].mode(dropna=True)[0], inplace=True)
combined_user_df['DiagnosisYear'].fillna(combined_user_df['DiagnosisYear'].mode(dropna=True)[0], inplace=True)

X_train = combined_user_df.drop(['Parkinsons'], axis=1)
y_train = combined_user_df['Parkinsons']

clf = LinearSVC()
clf.fit(X_train, y_train)


nfeatures = 10

coef = clf.coef_.ravel()
top_positive_coefs = np.argsort(coef)[-nfeatures :]
top_negative_coefs = np.argsort(coef)[: nfeatures]
top_coefs = np.hstack([top_negative_coefs, top_positive_coefs])

Note that, before we feed the data we have to the machine learning model, we need to fill in the missing values we have in the two columns we identified earlier—BirthYear and DiagnosisYear. This is because some (if not most) machine learning models cannot handle missing values very well, and it is up to the data engineers to choose how these values should be filled.

Here, we are using the mode (the most commonly occurring data point) of these two columns to fill in the missing values. This is because the mode is one of the statistics that tend to represent the range of different kinds of data well, especially for discrete/nominal attributes (which is what we have here). If you are working with numerical and continuous data such as length or area, it is also common practice to use the mean of a given attribute. Finally, getting back to our current process, this code trains the model on our dataset and obtains the coef_ attribute of the model afterward.

  1. This attribute contains the feature importance list, which is visualized by the last section of the code:
plt.figure(figsize=(15, 5))
colors = ['red' if c < 0 else 'blue' for c in coef[top_coefs]]
plt.bar(np.arange(2 * nfeatures), coef[top_coefs], color = colors)
feature_names = np.array(X_train.columns)
plt.xticks(np.arange(0, 1 + 2 * nfeatures), feature_names[top_coefs], rotation=60, ha='right')

plt.show()

This code produces the following graph:

Feature importance from SVC
  1. From the feature importance list, we can identify any features that were used extensively by the machine learning model while training. A feature with a very high importance value could be correlated with the target attribute (whether someone has Parkinson's or not) in some interesting way. For example, we see that Tremors (which we know are quite correlated to our target attribute) is the third most important feature for our current machine learning model.

That's our last discussion point regarding the analysis of our dataset. In the last section of our chapter, we will have a brief discussion on deciding how to write a script in a Python data science project.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset