In order to build a statistical model in an informed way, an intimate knowledge of the dataset is necessary. Without knowing the data it is possible to build a successful model, but it is then a much more arduous task, or it would require more technical resources to test all the possible combinations of features. Therefore, after spending the required 80% of the time cleaning the data, we spend the next 15% getting to know it!
I normally start with descriptive statistics. Even though the DataFrames expose the .describe()
method, since we are working with MLlib
, we will use the .colStats(...)
method.
The method takes an RDD
of data to calculate the descriptive statistics of and return a MultivariateStatisticalSummary
object that contains the following descriptive statistics:
count()
: This holds a row countmax()
: This holds maximum value in the columnmean():
This holds the value of the mean for the values in the columnmin()
: This holds the minimum value in the columnnormL1()
: This holds the value of the L1-Norm for the values in the columnnormL2()
: This holds the value of the L2-Norm for the values in the columnnumNonzeros()
: This holds the number of nonzero values in the columnvariance()
: This holds the value of the variance for the values in the columnYou can read more about the L1- and L2-norms here http://bit.ly/2jJJPJ0
We recommend checking the documentation of Spark to learn more about these. The following is a snippet that calculates the descriptive statistics of the numeric features:
import pyspark.mllib.stat as st import numpy as np numeric_cols = ['MOTHER_AGE_YEARS','FATHER_COMBINED_AGE', 'CIG_BEFORE','CIG_1_TRI','CIG_2_TRI','CIG_3_TRI', 'MOTHER_HEIGHT_IN','MOTHER_PRE_WEIGHT', 'MOTHER_DELIVERY_WEIGHT','MOTHER_WEIGHT_GAIN' ] numeric_rdd = births_transformed .select(numeric_cols) .rdd .map(lambda row: [e for e in row]) mllib_stats = st.Statistics.colStats(numeric_rdd) for col, m, v in zip(numeric_cols, mllib_stats.mean(), mllib_stats.variance()): print('{0}: {1:.2f} {2:.2f}'.format(col, m, np.sqrt(v)))
The preceding code produces the following result:
As you can see, mothers, compared to fathers, are younger: the average age of mothers was 28 versus over 44 for fathers. A good indication (at least for some of the infants) was that many mothers quit smoking while being pregnant; it is horrifying, though, that there still were some that continued smoking.
For the categorical variables, we will calculate the frequencies of their values:
categorical_cols = [e for e in births_transformed.columns if e not in numeric_cols] categorical_rdd = births_transformed .select(categorical_cols) .rdd .map(lambda row: [e for e in row]) for i, col in enumerate(categorical_cols): agg = categorical_rdd .groupBy(lambda row: row[i]) .map(lambda row: (row[0], len(row[1]))) print(col, sorted(agg.collect(), key=lambda el: el[1], reverse=True))
Here is what the results look like:
Most of the deliveries happened in hospital (BIRTH_PLACE
equal to 1
). Around 550 deliveries happened at home: some intentionally ('BIRTH_PLACE'
equal to 3
), and some not ('BIRTH_PLACE'
equal to 4
).
Correlations help to identify collinear numeric features and handle them appropriately. Let's check the correlations between our features:
corrs = st.Statistics.corr(numeric_rdd) for i, el in enumerate(corrs > 0.5): correlated = [ (numeric_cols[j], corrs[i][j]) for j, e in enumerate(el) if e == 1.0 and j != i] if len(correlated) > 0: for e in correlated: print('{0}-to-{1}: {2:.2f}' .format(numeric_cols[i], e[0], e[1]))
The preceding code will calculate the correlation matrix and will print only those features that have a correlation coefficient greater than 0.5
: the corrs > 0.5
part takes care of that.
Here's what we get:
As you can see, the 'CIG_...'
features are highly correlated, so we can drop most of them. Since we want to predict the survival chances of an infant as soon as possible, we will keep only the 'CIG_1_TRI'
. Also, as expected, the weight features are also highly correlated and we will only keep the 'MOTHER_PRE_WEIGHT'
:
features_to_keep = [ 'INFANT_ALIVE_AT_REPORT', 'BIRTH_PLACE', 'MOTHER_AGE_YEARS', 'FATHER_COMBINED_AGE', 'CIG_1_TRI', 'MOTHER_HEIGHT_IN', 'MOTHER_PRE_WEIGHT', 'DIABETES_PRE', 'DIABETES_GEST', 'HYP_TENS_PRE', 'HYP_TENS_GEST', 'PREV_BIRTH_PRETERM' ] births_transformed = births_transformed.select([e for e in features_to_keep])
We cannot calculate correlations for the categorical features. However, we can run a Chi-square test to determine if there are significant differences.
Here's how you can do it using the .chiSqTest(...)
method of MLlib
:
import pyspark.mllib.linalg as ln for cat in categorical_cols[1:]: agg = births_transformed .groupby('INFANT_ALIVE_AT_REPORT') .pivot(cat) .count() agg_rdd = agg .rdd .map(lambda row: (row[1:])) .flatMap(lambda row: [0 if e == None else e for e in row]) .collect() row_length = len(agg.collect()[0]) - 1 agg = ln.Matrices.dense(row_length, 2, agg_rdd) test = st.Statistics.chiSqTest(agg) print(cat, round(test.pValue, 4))
We loop through all the categorical variables and pivot them by the 'INFANT_ALIVE_AT_REPORT'
feature to get the counts. Next, we transform them into an RDD, so we can then convert them into a matrix using the pyspark.mllib.linalg
module. The first parameter to the .Matrices.dense(...)
method specifies the number of rows in the matrix; in our case, it is the length of distinct values of the categorical feature.
The second parameter specifies the number of columns: we have two as our 'INFANT_ALIVE_AT_REPORT'
target variable has only two values.
The last parameter is a list of values to be transformed into a matrix.
Here's an example that shows this more clearly:
print(ln.Matrices.dense(3,2, [1,2,3,4,5,6]))
The preceding code produces the following matrix:
Once we have our counts in a matrix form, we can use the .chiSqTest(...)
to calculate our test.
Our tests reveal that all the features should be significantly different and should help us predict the chance of survival of an infant.