Interaction features

Interaction features are obtained by performing mathematical operations on sets of features and indicate the effect of the relationship between variables. We use basic mathematical operations on the numerical features and see the effects of the relationship between variables:

# Constructing features manually based on  the interaction between the individual features
numeric_features = df_titanic_data.loc[:,
['Age_scaled', 'Fare_scaled', 'Pclass_scaled', 'Parch_scaled', 'SibSp_scaled',
'Names_scaled', 'CabinNumber_scaled', 'Age_bin_id_scaled', 'Fare_bin_id_scaled']]
print(" Using only numeric features for automated feature generation: ", numeric_features.head(10))

new_fields_count = 0
for i in range(0, numeric_features.columns.size - 1):
for j in range(0, numeric_features.columns.size - 1):
if i <= j:
name = str(numeric_features.columns.values[i]) + "*" + str(numeric_features.columns.values[j])
df_titanic_data = pd.concat(
[df_titanic_data, pd.Series(numeric_features.iloc[:, i] * numeric_features.iloc[:, j], name=name)],
axis=1)
new_fields_count += 1
if i < j:
name = str(numeric_features.columns.values[i]) + "+" + str(numeric_features.columns.values[j])
df_titanic_data = pd.concat(
[df_titanic_data, pd.Series(numeric_features.iloc[:, i] + numeric_features.iloc[:, j], name=name)],
axis=1)
new_fields_count += 1
if not i == j:
name = str(numeric_features.columns.values[i]) + "/" + str(numeric_features.columns.values[j])
df_titanic_data = pd.concat(
[df_titanic_data, pd.Series(numeric_features.iloc[:, i] / numeric_features.iloc[:, j], name=name)],
axis=1)
name = str(numeric_features.columns.values[i]) + "-" + str(numeric_features.columns.values[j])
df_titanic_data = pd.concat(
[df_titanic_data, pd.Series(numeric_features.iloc[:, i] - numeric_features.iloc[:, j], name=name)],
axis=1)
new_fields_count += 2

print(" ", new_fields_count, "new features constructed")

This kind of feature engineering can produce lots of features. In the preceding code snippet, we used 9 features to generate 176 interaction features.

We can also remove highly correlated features as the existence of these features won't add any information to the model. We can use Spearman's correlation to identify and remove highly correlated features. The Spearman method has a rank coefficient in its output that can be used to identity the highly correlated features:

# using Spearman correlation method to remove the feature that have high correlation

# calculating the correlation matrix
df_titanic_data_cor = df_titanic_data.drop(['Survived', 'PassengerId'], axis=1).corr(method='spearman')

# creating a mask that will ignore correlated ones
mask_ignore = np.ones(df_titanic_data_cor.columns.size) - np.eye(df_titanic_data_cor.columns.size)
df_titanic_data_cor = mask_ignore * df_titanic_data_cor

features_to_drop = []

# dropping the correclated features
for column in df_titanic_data_cor.columns.values:

# check if we already decided to drop this variable
if np.in1d([column], features_to_drop):
continue

# finding highly correlacted variables
corr_vars = df_titanic_data_cor[abs(df_titanic_data_cor[column]) > 0.98].index
features_to_drop = np.union1d(features_to_drop, corr_vars)

print(" We are going to drop", features_to_drop.shape[0], " which are highly correlated features... ")
df_titanic_data.drop(features_to_drop, axis=1, inplace=True)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset