Getting to know your data

In order to build a statistical model in an informed way, an intimate knowledge of the dataset is necessary. Without knowing the data it is possible to build a successful model, but it is then a much more arduous task, or it would require more technical resources to test all the possible combinations of features. Therefore, after spending the required 80% of the time cleaning the data, we spend the next 15% getting to know it!

Descriptive statistics

I normally start with descriptive statistics. Even though the DataFrames expose the .describe() method, since we are working with MLlib, we will use the .colStats(...) method.

Note

A word of warning: the .colStats(...) calculates the descriptive statistics based on a sample. For real world datasets this should not really matter but if your dataset has less than 100 observations you might get some strange results.

The method takes an RDD of data to calculate the descriptive statistics of and return a MultivariateStatisticalSummary object that contains the following descriptive statistics:

  • count(): This holds a row count
  • max(): This holds maximum value in the column
  • mean(): This holds the value of the mean for the values in the column
  • min(): This holds the minimum value in the column
  • normL1(): This holds the value of the L1-Norm for the values in the column
  • normL2(): This holds the value of the L2-Norm for the values in the column
  • numNonzeros(): This holds the number of nonzero values in the column
  • variance(): This holds the value of the variance for the values in the column

Note

You can read more about the L1- and L2-norms here http://bit.ly/2jJJPJ0

We recommend checking the documentation of Spark to learn more about these. The following is a snippet that calculates the descriptive statistics of the numeric features:

import pyspark.mllib.stat as st
import numpy as np
numeric_cols = ['MOTHER_AGE_YEARS','FATHER_COMBINED_AGE',
                'CIG_BEFORE','CIG_1_TRI','CIG_2_TRI','CIG_3_TRI',
                'MOTHER_HEIGHT_IN','MOTHER_PRE_WEIGHT',
                'MOTHER_DELIVERY_WEIGHT','MOTHER_WEIGHT_GAIN'
               ]
numeric_rdd = births_transformed
                       .select(numeric_cols)
                       .rdd 
                       .map(lambda row: [e for e in row])
mllib_stats = st.Statistics.colStats(numeric_rdd)
for col, m, v in zip(numeric_cols, 
                     mllib_stats.mean(), 
                     mllib_stats.variance()):
    print('{0}: 	{1:.2f} 	 {2:.2f}'.format(col, m, np.sqrt(v)))

The preceding code produces the following result:

Descriptive statistics

As you can see, mothers, compared to fathers, are younger: the average age of mothers was 28 versus over 44 for fathers. A good indication (at least for some of the infants) was that many mothers quit smoking while being pregnant; it is horrifying, though, that there still were some that continued smoking.

For the categorical variables, we will calculate the frequencies of their values:

categorical_cols = [e for e in births_transformed.columns 
                    if e not in numeric_cols]
categorical_rdd = births_transformed
                       .select(categorical_cols)
                       .rdd 
                       .map(lambda row: [e for e in row])
for i, col in enumerate(categorical_cols):
    agg = categorical_rdd 
        .groupBy(lambda row: row[i]) 
        .map(lambda row: (row[0], len(row[1])))
    print(col, sorted(agg.collect(), 
                      key=lambda el: el[1], 
                      reverse=True))

Here is what the results look like:

Descriptive statistics

Most of the deliveries happened in hospital (BIRTH_PLACE equal to 1). Around 550 deliveries happened at home: some intentionally ('BIRTH_PLACE' equal to 3), and some not ('BIRTH_PLACE' equal to 4).

Correlations

Correlations help to identify collinear numeric features and handle them appropriately. Let's check the correlations between our features:

corrs = st.Statistics.corr(numeric_rdd)
for i, el in enumerate(corrs > 0.5):
    correlated = [
        (numeric_cols[j], corrs[i][j]) 
        for j, e in enumerate(el) 
        if e == 1.0 and j != i]
    if len(correlated) > 0:
        for e in correlated:
            print('{0}-to-{1}: {2:.2f}' 
                  .format(numeric_cols[i], e[0], e[1]))

The preceding code will calculate the correlation matrix and will print only those features that have a correlation coefficient greater than 0.5: the corrs > 0.5 part takes care of that.

Here's what we get:

Correlations

As you can see, the 'CIG_...' features are highly correlated, so we can drop most of them. Since we want to predict the survival chances of an infant as soon as possible, we will keep only the 'CIG_1_TRI'. Also, as expected, the weight features are also highly correlated and we will only keep the 'MOTHER_PRE_WEIGHT':

features_to_keep = [
    'INFANT_ALIVE_AT_REPORT', 
    'BIRTH_PLACE', 
    'MOTHER_AGE_YEARS', 
    'FATHER_COMBINED_AGE', 
    'CIG_1_TRI', 
    'MOTHER_HEIGHT_IN', 
    'MOTHER_PRE_WEIGHT', 
    'DIABETES_PRE', 
    'DIABETES_GEST', 
    'HYP_TENS_PRE', 
    'HYP_TENS_GEST', 
    'PREV_BIRTH_PRETERM'
]
births_transformed = births_transformed.select([e for e in features_to_keep])

Statistical testing

We cannot calculate correlations for the categorical features. However, we can run a Chi-square test to determine if there are significant differences.

Here's how you can do it using the .chiSqTest(...) method of MLlib:

import pyspark.mllib.linalg as ln
for cat in categorical_cols[1:]:
    agg = births_transformed 
        .groupby('INFANT_ALIVE_AT_REPORT') 
        .pivot(cat) 
        .count()    
    agg_rdd = agg 
        .rdd 
        .map(lambda row: (row[1:])) 
        .flatMap(lambda row: 
                 [0 if e == None else e for e in row]) 
        .collect()
    row_length = len(agg.collect()[0]) - 1
    agg = ln.Matrices.dense(row_length, 2, agg_rdd)
    
    test = st.Statistics.chiSqTest(agg)
    print(cat, round(test.pValue, 4))

We loop through all the categorical variables and pivot them by the 'INFANT_ALIVE_AT_REPORT' feature to get the counts. Next, we transform them into an RDD, so we can then convert them into a matrix using the pyspark.mllib.linalg module. The first parameter to the .Matrices.dense(...) method specifies the number of rows in the matrix; in our case, it is the length of distinct values of the categorical feature.

The second parameter specifies the number of columns: we have two as our 'INFANT_ALIVE_AT_REPORT' target variable has only two values.

The last parameter is a list of values to be transformed into a matrix.

Here's an example that shows this more clearly:

print(ln.Matrices.dense(3,2, [1,2,3,4,5,6]))

The preceding code produces the following matrix:

Statistical testing

Once we have our counts in a matrix form, we can use the .chiSqTest(...) to calculate our test.

Here's what we get in return:

Statistical testing

Our tests reveal that all the features should be significantly different and should help us predict the chance of survival of an infant.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset