Selecting out or dropping missing data

One technique of handling missing data is to simply remove it from your dataset. A scenario for this would be where data is sampled at regular intervals but devices are offline and hence a reading is not recorded.

The pandas library makes this possible using several techniques. One is through Boolean selection using the results of .isnull() or .notnull() to retrieve the values that are NaN or non NaN out of a Series object. The following example demonstrates selection of all non-NaN values from the c4 column of a DataFrame:

Pandas also provides a convenience function, .dropna(), which drops the items in a Series where the value is NaN:

Note that .dropna() has actually returned a copy of DataFrame without the rows. The original DataFrame is not changed:

When .dropna() is applied to a DataFrame object, it drops all rows from the DataFrame object that have at least one NaN value. The following code demonstrates this in action, and since each row has at least one NaN value, there are zero rows in the result:

If you want to drop only those rows where all values are NaN, you can use the how='all' parameter. The following sample drops only the g row, since it has all NaN values:

This can also be applied to the columns instead of the rows, by changing the axis parameter to axis=1. The following drops the c5 column, as it is the only one with all NaN values:

Now let's examine this process using a slightly different DataFrame object, which has columns c1 and c3 with all values that are not NaN. In this case, all columns except c1 and c3 will be dropped:


The .dropna() method also has a parameter thresh, which when given an integer value, specifies the minimum number of NaN values that must exist before the drop is performed. The following code drops all the columns with at least five NaN values (in this case, these are the c4 and c5 columns):

Again, note that the .dropna() method (and the Boolean selection) returns a copy of the DataFrame object, and the data is dropped from that copy. If you want to drop the data in the actual DataFrame, use the inplace=True parameter.

