In this chapter, we will look at some code that analyzes survey data that Kaggle did in 2018. The survey queried Kaggle users about socio-economic information.
This section will present the survey data along with some code to analyze it. The subtitle for this data is "the most comprehensive dataset available on the state of machine learning and data science". Let's dig into this data and see what it has. The data was originally available at https://www.kaggle.com/kaggle/kaggle-survey-2018.
>>> import pandas as pd
>>> import numpy as np
>>> import zipfile
>>> url = 'data/kaggle-survey-2018.zip'
>>> with zipfile.ZipFile(url) as z:
... print(z.namelist())
... kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
... df = kag.iloc[1:]
['multipleChoiceResponses.csv', 'freeFormResponses.csv', 'SurveySchema.csv']
>>> df.T
1 2 3 ... 23857
Time from... 710 434 718 ... 370
Q1 Female Male Female ... Male
Q1_OTHER_... -1 -1 -1 ... -1
Q2 45-49 30-34 30-34 ... 22-24
Q3 United S... Indonesia United S... ... Turkey
... ... ... ... ... ...
Q50_Part_5 NaN NaN NaN ... NaN
Q50_Part_6 NaN NaN NaN ... NaN
Q50_Part_7 NaN NaN NaN ... NaN
Q50_Part_8 NaN NaN NaN ... NaN
Q50_OTHER... -1 -1 -1 ... -1
>>> df.dtypes
Time from Start to Finish (seconds) object
Q1 object
Q1_OTHER_TEXT object
Q2 object
Q3 object
...
Q50_Part_5 object
Q50_Part_6 object
Q50_Part_7 object
Q50_Part_8 object
Q50_OTHER_TEXT object
Length: 395, dtype: object
object
. We could go through our standard process of exploring these values using the .value_counts
method:
>>> df.Q1.value_counts(dropna=False)
Male 19430
Female 4010
Prefer not to say 340
Prefer to self-describe 79
Name: Q1, dtype: int64
.rename
method to give the column a better name. Some of the values, such as the Q2
, Q8
, and Q9
, have range answers. In the case of age (Q2
), you have values like 55-59 and 60-69. I use the .str.slice
method to pull out the first two characters, and convert the type from string to integer.For the education column (Q4
), I convert the values to ordinal numbers. Finally, after I have converted many columns I'm working with to numbers and cleaned up some of the others, I put all of the Series back in a DataFrame with pd.concat
.
I put all of this code into a function, tweak_kag
:
>>> def tweak_kag(df):
... na_mask = df.Q9.isna()
... hide_mask = df.Q9.str.startswith('I do not').fillna(False)
... df = df[~na_mask & ~hide_mask]
...
... q1 = (df.Q1
... .replace({'Prefer not to say': 'Another',
... 'Prefer to self-describe': 'Another'})
... .rename('Gender')
... )
... q2 = df.Q2.str.slice(0,2).astype(int).rename('Age')
... def limit_countries(val):
... if val in {'United States of America', 'India', 'China'}:
... return val
... return 'Another'
... q3 = df.Q3.apply(limit_countries).rename('Country')
...
... q4 = (df.Q4
... .replace({'Master's degree': 18,
... 'Bachelor's degree': 16,
... 'Doctoral degree': 20,
... 'Some college/university study without earning a bachelor's degree': 13,
... 'Professional degree': 19,
... 'I prefer not to answer': None,
... 'No formal education past high school': 12})
... .fillna(11)
... .rename('Edu')
... )
...
... def only_cs_stat_val(val):
... if val not in {'cs', 'eng', 'stat'}:
... return 'another'
... return val
...
... q5 = (df.Q5
... .replace({
... 'Computer science (software engineering, etc.)': 'cs',
... 'Engineering (non-computer focused)': 'eng',
... 'Mathematics or statistics': 'stat'})
... .apply(only_cs_stat_val)
... .rename('Studies'))
... def limit_occupation(val):
... if val in {'Student', 'Data Scientist', 'Software Engineer', 'Not employed',
... 'Data Engineer'}:
... return val
... return 'Another'
...
... q6 = df.Q6.apply(limit_occupation).rename('Occupation')
...
... q8 = (df.Q8
... .str.replace('+', '')
... .str.split('-', expand=True)
... .iloc[:,0]
... .fillna(-1)
... .astype(int)
... .rename('Experience')
... )
...
... q9 = (df.Q9
... .str.replace('+','')
... .str.replace(',','')
... .str.replace('500000', '500')
... .str.replace('I do not wish to disclose my approximate yearly compensation','')
... .str.split('-', expand=True)
... .iloc[:,0]
... .astype(int)
... .mul(1000)
... .rename('Salary'))
... return pd.concat([q1, q2, q3, q4, q5, q6, q8, q9], axis=1)
>>> tweak_kag(df)
Gender Age Country ... Occupation Experience
2 Male 30 Another ... Another 5
3 Female 30 United S... ... Data Sci... 0
5 Male 22 India ... Another 0
7 Male 35 Another ... Another 10
8 Male 18 India ... Another 0
... ... ... ... ... ... ...
23844 Male 30 Another ... Software... 10
23845 Male 22 Another ... Student 0
23854 Male 30 Another ... Another 5
23855 Male 45 Another ... Another 5
23857 Male 22 Another ... Software... 0
>>> tweak_kag(df).dtypes
Gender object
Age int64
Country object
Edu float64
Studies object
Occupation object
Experience int64
Salary int64
dtype: object
The survey data is rich with information, but it's a little hard to analyze it because all of the columns come in as objects. Our tweak_kag
function filters out respondents who did not provide salary information. We also convert a few of the columns (Age
, Edu
, Experience
, and Salary
) to numeric values for easier quantification. The remaining categorical columns are pruned down to lower cardinality.
Cleaning up our data makes it easier to analyze. For example, we can easily group by country and correlate salary and experience:
>>> kag = tweak_kag(df)
>>> (kag
... .groupby('Country')
... .apply(lambda g: g.Salary.corr(g.Experience))
... )
Country
Another 0.289827
China 0.252974
India 0.167335
United States of America 0.354125
dtype: float64
The .apply
method on a Series and DataFrame is one of the slowest operations in pandas. In this recipe, we will explore the speed of it and see if we can debug what is going on.
.apply
method takes using the %%timeit
cell magic in Jupiter. This is the code from the tweak_kag
function that limits the cardinality of the country column (Q3
):
>>> %%timeit
>>> def limit_countries(val):
... if val in {'United States of America', 'India', 'China'}:
... return val
... return 'Another'
>>> q3 = df.Q3.apply(limit_countries).rename('Country')
6.42 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
.replace
method instead of .apply
and see if that improves performance:
>>> %%timeit
>>> other_values = df.Q3.value_counts().iloc[3:].index
>>> q3_2 = df.Q3.replace(other_values, 'Another')
27.7 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
.apply
method! Let's try again. If we recreate this code using the .isin
method combined with .where
, it runs over twice as fast as .apply
:
>>> %%timeit
>>> values = {'United States of America', 'India', 'China'}
>>> q3_3 = df.Q3.where(df.Q3.isin(values), 'Another')
3.39 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.where
function. This is not part of pandas, but pandas often works with NumPy functions:
>>> %%timeit
>>> values = {'United States of America', 'India', 'China'}
>>> q3_4 = pd.Series(np.where(df.Q3.isin(values), df.Q3, 'Another'),
... index=df.index)
2.75 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> q3.equals(q3_2)
True
>>> q3.equals(q3_3)
True
>>> q3.equals(q3_4)
True
This recipe benchmarked the .apply
, .replace
, and .where
methods. Of those three, the .where
method was the quickest. Finally, it showed the NumPy where
function, which is even faster than pandas. However, if we use the NumPy function, we need to convert the result back into a series (and give it the same index as the original DataFrame).
The documentation for the .apply
method states that if you pass in a NumPy function, it will run a fast path and pass the whole series to the function. However, if you pass in a Python function, that function will be called for each value in the Series. This can be confusing because the method behaves differently depending on the parameter that is passed into it.
If you find yourself in a situation where you are passing in a function to .apply
(or have done a groupby operation and are calling .agg
, .transform
, or some other method that takes a function as a parameter) and cannot remember what arguments will be passed into the function, you can use the following code to help. (Of course, you can also look at the documentation or even look at the code for .apply
):
>>> def limit_countries(val):
... if val in {'United States of America', 'India', 'China'}:
... return val
... return 'Another'
>>> q3 = df.Q3.apply(limit_countries).rename('Country')
>>> def debug(something):
... # what is something? A cell, series, dataframe?
... print(type(something), something)
... 1/0
>>> q3.apply(debug)
<class 'str'> United States of America
Traceback (most recent call last)
...
ZeroDivisionError: division by zero
The output shows that a string (a scalar value from the series q3
) was passed into the debug
function.
If you do not want to throw an exception, you can set a global variable to hold the parameter passed into the function:
>>> the_item = None
>>> def debug(something):
... global the_item
... the_item = something
... return something
>>> _ = q3.apply(debug)
>>> the_item
'Another'
One thing to keep in mind is that the function we pass into the .apply
method is called once per item in the Series. Operating on single items is a slow path, and we should try to avoid it if possible. The next recipe will show another option for speeding calls to .apply
.
Sometimes .apply
is convenient. Various libraries enable parallelizing such operations. There are various mechanisms to do this. The easiest is to try and leverage vectorization. Math operations are vectorized in pandas, if you add a number (say 5) to a numerical series, pandas will not add 5 to each value. Rather it will leverage a feature of modern CPUs to do the operation one time.
If you cannot vectorize, as is the case with our limit_countries
function, you have other options. This section will show a few of them.
Note that you will need to install these libraries as they are not included with pandas.
The examples show limiting values in the country column from the survey data to a few values.
>>> from pandarallel import pandarallel
>>> pandarallel.initialize()
.parallel_apply
method:
>>> def limit_countries(val):
... if val in {'United States of America', 'India', 'China'}:
... return val
... return 'Another'
>>> %%timeit
>>> res_p = df.Q3.parallel_apply(limit_countries).rename('Country')
133 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
swifter
library:
>>> import swifter
.swifter
accessor. Use the swifter
library:
>>> %%timeit
>>> res_s = df.Q3.swifter.apply(limit_countries).rename('Country')
187 ms ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> import dask
.map_partitions
function:
>>> %%timeit
>>> res_d = (dask.dataframe.from_pandas(
... df, npartitions=4)
... .map_partitions(lambda df: df.Q3.apply(limit_countries))
... .rename('Countries')
... )
29.1 s ± 1.75 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.vectorize
:
>>> np_fn = np.vectorize(limit_countries)
>>> %%timeit
>>> res_v = df.Q3.apply(np_fn).rename('Country')
643 ms ± 86.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numba
and decorate the function with the jit
decorator:
>>> from numba import jit
>>> @jit
... def limit_countries2(val):
... if val in ['United States of America', 'India', 'China']:
... return val
... return 'Another'
numba
function:
>>> %%timeit
>>> res_n = df.Q3.apply(limit_countries2).rename('Country')
158 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that there is overhead to parallelizing code. In the examples above, all of the code ran faster in serial with normal pandas code. There is a crossover point where the overhead penalty makes sense. The examples for the Pandarallel library use at least a million samples. Our dataset is much smaller than that, so the vanilla .apply
method is faster in our case.
In step 1 and 2 we use the Pandarallel library. This library leverages the multiprocessing
library from the standard library to try and run computations in parallel. When you initialize the library, you can specify an nb_workers
parameter that indicates how many CPUs to use (by default it will use all of the CPUs). The example shows how to use the .parallel_apply
method which is analogous to the .apply
method in pandas. This library also works with groupby objects and series objects.
Step 3 and 4 show use of the swifter
library. This library adds a .swifter
attribute to a DataFrame and series. This library takes a different approach to speeding up code. It will try to see if the operation can be vectorized. Otherwise, it will see how long pandas will take (by running on a small sample), it then determines whether to leverage the Dask library, or to just stick with pandas. Again, the logic to even determine which path to use has overhead, so blindly using this library might not lead to the most efficient code.
The Swifter website has a notebook where they performed comparisons of Swifter, np.vectorize
, Dask, and pandas. It has extensive benchmarking on different types of functions. For what it calls non-vectorized functions (which our limit_countries
is as it has normal Python logic), it isn't until you get to almost a million rows that the vanilla pandas .apply
method starts to lose out.
In step 5 and 6 the Dask library is presented. Note that there is a bit of overhead loading the data and leveraging the parallelization afforded by the library. Many users of Dask forgo pandas completely and just use Dask, as it implements similar functionality but allows processing to scale out to big data (and running on a cluster).
Next, we try the vectorize
function from NumPy in step 7. It creates a NumPy ufunc
(a universal function that operates on NumPy arrays) from an arbitrary Python function. It tries to leverage NumPy broadcasting rules. In this case, there is no performance increase by using it.
Step 8 and 9 demonstrate using the Numba library. We leverage the jit
decorator to create a new function limit_countries2
. This decorator converts the Python function into native code. Again, this function is not amenable to speed increases from this decorator.
Many of the options illustrated here may provide a performance boost with larger datasets. In our case, blindly applying them would slow down the code.
The Jupyter environment has an extension that allows you to quickly pull up the documentation or the source code for a class, method, or function. I strongly encourage you to get used to using these. If you can stay in the Jupyter environment to answer questions that may come up, you will increase your productivity.
In this section, we will show how to look at the source code for the .apply
method. It is easiest to look at the documentation for a DataFrame or series method directly on the DataFrame or series object, respectively. Throughout this book, we have heavily recommended chaining operations on pandas objects. Sadly Jupyter (and any other editor environment) is not able to perform code completion or look up documentation on the intermediate object returned from a chained method call. Hence the recommendation to perform the lookup directly on a method that is not chained.
>>> import zipfile
>>> url = 'data/kaggle-survey-2018.zip'
>>> with zipfile.ZipFile(url) as z:
... kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
... df = kag.iloc[1:]
.apply
using the Jupyter ?
extension. (We could also hit Shift + Tab four times to get this in Jupyter):
>>> df.Q3.apply?
Signature: df.Q3.apply(func, convert_dtype=True, args=(), **kwds)
Docstring:
Invoke function on values of Series.
Can be ufunc (a NumPy function that applies to the entire Series)
or a Python function that only works on single values.
Parameters
----------
func : function
Python function or NumPy ufunc to apply.
convert_dtype : bool, default True
Try to find better dtype for elementwise function results. If
False, leave as dtype=object.
args : tuple
Positional arguments passed to func after the series value.
**kwds
Additional keyword arguments passed to func.
Returns
-------
Series or DataFrame
If func returns a Series object the result will be a DataFrame.
See Also
--------
Series.map: For element-wise operations.
Series.agg: Only perform aggregating type operations.
Series.transform: Only perform transforming type operations.
Examples
--------
...
File: ~/.env/364/lib/python3.6/site-packages/pandas/core/series.py
Type: method
??
. (There is no Shift + Tab keyboard shortcut to get the code):
>>> df.Q3.apply??
Signature: df.Q3.apply(func, convert_dtype=True, args=(), **kwds)
Source:
def apply(self, func, convert_dtype=True, args=(), **kwds):
...
if len(self) == 0:
return self._constructor(dtype=self.dtype, index=self.index).__finalize__(
self
)
# dispatch to agg
if isinstance(func, (list, dict)):
return self.aggregate(func, *args, **kwds)
# if we are a string, try to dispatch
if isinstance(func, str):
return self._try_aggregate_string_function(func, *args, **kwds)
# handle ufuncs and lambdas
if kwds or args and not isinstance(func, np.ufunc):
def f(x):
return func(x, *args, **kwds)
else:
f = func
with np.errstate(all="ignore"):
if isinstance(f, np.ufunc):
return f(self)
# row-wise access
if is_extension_type(self.dtype):
mapped = self._values.map(f)
else:
values = self.astype(object).values
mapped = lib.map_infer(values, f, convert=convert_dtype)
if len(mapped) and isinstance(mapped[0], Series):
# GH 25959 use pd.array instead of tolist
# so extension arrays can be used
return self._constructor_expanddim(pd.array(mapped), index=self.index)
else:
return self._constructor(mapped, index=self.index).__finalize__(self)
File: ~/.env/364/lib/python3.6/site-packages/pandas/core/series.py
Type: method
mapped
variable. Let's try and figure out what lib.map_infer
does:
>>> import pandas.core.series
>>> pandas.core.series.lib
<module 'pandas._libs.lib' from '.env/364/lib/python3.6/site-packages/pandas/_libs/lib.cpython-36m-darwin.so'>
>>> pandas.core.series.lib.map_infer??
Docstring:
Substitute for np.vectorize with pandas-friendly dtype inference
Parameters
----------
arr : ndarray
f : function
Returns
-------
mapped : ndarray
Type: builtin_function_or_method
Jupyter has the ability to inspect both the docstrings and the source code for Python objects. The standard Python REPL can leverage the built-in help
function to view a docstring, but it cannot display the source code.
Jupyter, however has some tricks up its sleeves. If you tack on a single question mark (?
) following a function or method, it will show the documentation for that code. Note that this is not valid Python syntax, it is a feature of Jupyter. If you add on two question marks (??
), then Jupyter will display the source code of the function or method.
This recipe showed tracing through the source code to see how the .apply
method in pandas works under the covers.
We can see a shortcut in step 3 if there are no results. We can also see how string functions (that is, passing in the string literal mean
) work. The getattr
function pulls off the corresponding method from the DataFrame.
Next, the code checks if it is dealing with a NumPy function. Eventually, it will call the function if it is an instance of np.ufunc
, or it will call the .map
method on the underlying ._values
attribute, or it will call lib.map_infer
.
In step 4, we tried to inspect lib.map_infer
but saw that it was an so
file (pyd
on Windows). This is a compiled file that is usually the result of writing Python in C or using Cython. Jupyter cannot show us the source of compiled files.
When you view the source code for a function or method, Jupyter will display the file that it belongs to at the bottom of pane. If I really need to dig into the source code, I will open that in an editor outside of Jupyter. Then I can browse through that code and any corresponding code with my editor (most editors have better code navigation capabilities than Jupyter).
The previous recipes have shown how to understand pandas code and inspect it from Jupyter. In this section, we will look at using the IPython debugger (ipdb) in Jupyter.
In this section, I will create a function that throws an error when I try to use it with the series .apply
method. I will use ipdb to debug it.
>>> import zipfile
>>> url = 'data/kaggle-survey-2018.zip'
>>> with zipfile.ZipFile(url) as z:
... kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
... df = kag.iloc[1:]
>>> def add1(x):
... return x + 1
>>> df.Q3.apply(add1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-6ce28d2fea57> in <module>
2 return x + 1
3
----> 4 df.Q3.apply(add1)
~/.env/364/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
4043 else:
4044 values = self.astype(object).values
-> 4045 mapped = lib.map_infer(values, f, convert=convert_dtype)
4046
4047 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-9-6ce28d2fea57> in add1(x)
1 def add1(x):
----> 2 return x + 1
3
4 df.Q3.apply(add1)
TypeError: must be str, not int
%debug
cell magic immediately following an exception to drop into a debug window. (This might seem a little backward because you call this after you have run a cell with an exception). This will open the debugger to the point where the exception was thrown.You can use the debugger commands to navigate through the stack. Hitting U key will pop the stack to the function that called the current line. You can inspect objects using the print command (p
):
Jupyter debugging
set_trace
function from the IPython debugger. This will drop you into the debugger immediately following that line:
>>> from IPython.core.debugger import set_trace
>>> def add1(x):
... set_trace()
... return x + 1
>>> df.Q3.apply(add1)
Jupyter debugging
Jupyter (which is derived from IPython) ships with the IPython debugger. This replicates the functionality of the pdb
module in the standard library, but with niceties such as syntax highlighting. (It also has tab completion, but this does not work in Jupyter, only in the IPython console).
If you are unfamiliar with using the debugger, here is a lifejacket for you: The command h
will print out all of the commands that you can run from the debugger:
ipdb> h
Documented commands (type help <topic>):
========================================
EOF cl disable interact next psource rv unt
a clear display j p q s until
alias commands down jump pdef quit source up
args condition enable l pdoc r step w
b cont exit list pfile restart tbreak whatis
break continue h ll pinfo return u where
bt d help longlist pinfo2 retval unalias
c debug ignore n pp run undisplay
The most common commands that I use are s
, n
, l
, u
, d
, and c
. If you want to know what s
does, then type:
ipdb> h s
s(tep)
Execute the current line, stop at the first possible occasion
(either in a function that is called or in the current
function).
This tells the debugger to print the help
(h
) documentation for step
(s
). Because we are usually coding in small steps in Jupyter, a debugger is often overkill. But knowing how to use it can come in handy, especially if you want to jump into pandas source code and understand what is going on.
Great Expectations is a third-party tool that allows you to capture and define the properties of a dataset. You can save these properties and then use them to validate future data to ensure data integrity. This can be very useful when building machine learning models, as new categorical data values and numeric outliers tend to cause a model to perform poorly or error out.
In this section, we will look at the Kaggle dataset and make an expectation suite to test and validate the data.
tweak_kag
function previously defined:
>>> kag = tweak_kag(df)
from_pandas
function to read in a Great Expectations DataFrame (a subclass of DataFrame with some extra methods):
>>> import great_expectations as ge
>>> kag_ge = ge.from_pandas(kag)
>>> sorted([x for x in set(dir(kag_ge)) - set(dir(kag))
... if not x.startswith('_')])
['autoinspect',
'batch_fingerprint',
'batch_id',
'batch_kwargs',
'column_aggregate_expectation',
'column_map_expectation',
'column_pair_map_expectation',
'discard_failing_expectations',
'edit_expectation_suite',
'expect_column_bootstrapped_ks_test_p_value_to_be_greater_than',
'expect_column_chisquare_test_p_value_to_be_greater_than',
'expect_column_distinct_values_to_be_in_set',
'expect_column_distinct_values_to_contain_set',
'expect_column_distinct_values_to_equal_set',
'expect_column_kl_divergence_to_be_less_than',
'expect_column_max_to_be_between',
'expect_column_mean_to_be_between',
'expect_column_median_to_be_between',
'expect_column_min_to_be_between',
'expect_column_most_common_value_to_be_in_set',
'expect_column_pair_values_A_to_be_greater_than_B',
'expect_column_pair_values_to_be_equal',
'expect_column_pair_values_to_be_in_set',
'expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than',
'expect_column_proportion_of_unique_values_to_be_between',
'expect_column_quantile_values_to_be_between',
'expect_column_stdev_to_be_between',
'expect_column_sum_to_be_between',
'expect_column_to_exist',
'expect_column_unique_value_count_to_be_between',
'expect_column_value_lengths_to_be_between',
'expect_column_value_lengths_to_equal',
'expect_column_values_to_be_between',
'expect_column_values_to_be_dateutil_parseable',
'expect_column_values_to_be_decreasing',
'expect_column_values_to_be_in_set',
'expect_column_values_to_be_in_type_list',
'expect_column_values_to_be_increasing',
'expect_column_values_to_be_json_parseable',
'expect_column_values_to_be_null',
'expect_column_values_to_be_of_type',
'expect_column_values_to_be_unique',
'expect_column_values_to_match_json_schema',
'expect_column_values_to_match_regex',
'expect_column_values_to_match_regex_list',
'expect_column_values_to_match_strftime_format',
'expect_column_values_to_not_be_in_set','expect_column_values_to_not_be_null',
'expect_column_values_to_not_match_regex','expect_column_values_to_not_match_regex_list',
'expect_multicolumn_values_to_be_unique',
'expect_table_column_count_to_be_between',
'expect_table_column_count_to_equal',
'expect_table_columns_to_match_ordered_list',
'expect_table_row_count_to_be_between',
'expect_table_row_count_to_equal',
'expectation',
'find_expectation_indexes',
'find_expectations',
'from_dataset',
'get_column_count',
'get_column_count_in_range',
'get_column_hist',
'get_column_max',
'get_column_mean',
'get_column_median',
'get_column_min',
'get_column_modes',
'get_column_nonnull_count',
'get_column_partition',
'get_column_quantiles',
'get_column_stdev',
'get_column_sum',
'get_column_unique_count',
'get_column_value_counts',
'get_config_value',
'get_data_asset_name',
'get_default_expectation_arguments',
'get_evaluation_parameter',
'get_expectation_suite',
'get_expectation_suite_name',
'get_expectations_config',
'get_row_count',
'get_table_columns',
'hashable_getters',
'multicolumn_map_expectation',
'profile',
'remove_expectation',
'save_expectation_suite',
'save_expectation_suite_name',
'set_config_value',
'set_data_asset_name',
'set_default_expectation_argument',
'set_evaluation_parameter',
'test_column_aggregate_expectation_function',
'test_column_map_expectation_function',
'test_expectation_function',
'validate']
>>> kag_ge.expect_column_to_exist('Salary')
{'success': True}
>>> kag_ge.expect_column_mean_to_be_between(
... 'Salary', min_value=10_000, max_value=100_000)
{'success': True,
'result': {'observed_value': 43869.66102793441,
'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0}}
>>> kag_ge.expect_column_values_to_be_between(
... 'Salary', min_value=0, max_value=500_000)
{'success': True,
'result': {'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'unexpected_percent_nonmissing': 0.0,
'partial_unexpected_list': []}}
>>> kag_ge.expect_column_values_to_not_be_null('Salary')
{'success': True,
'result': {'element_count': 15429,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'partial_unexpected_list': []}}
>>> kag_ge.expect_column_values_to_match_regex(
... 'Country', r'America|India|Another|China')
{'success': True,
'result': {'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'unexpected_percent_nonmissing': 0.0,
'partial_unexpected_list': []}}
>>> kag_ge.expect_column_values_to_be_of_type(
... 'Salary', type_='int')
{'success': True, 'result': {'observed_value': 'int64'}}
>>> kag_ge.save_expectation_suite('kaggle_expectations.json')
The file should look like this:
{
"data_asset_name": null,
"expectation_suite_name": "default",
"meta": {
"great_expectations.__version__": "0.8.6"
},
"expectations": [
{
"expectation_type": "expect_column_to_exist",
"kwargs": {
"column": "Salary"
}
},
{
"expectation_type": "expect_column_mean_to_be_between",
"kwargs": {
"column": "Salary",
"min_value": 10000,
"max_value": 100000
}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": "Salary",
"min_value": 0,
"max_value": 500000
}
},
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "Salary"
}
},
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {
"column": "Country",
"regex": "America|India|Another|China"
}
},
{
"expectation_type": "expect_column_values_to_be_of_type",
"kwargs": {
"column": "Salary",
"type_": "int"
}
}
],
"data_asset_type": "Dataset"
}
>>> kag_ge.to_csv('kag.csv')
>>> import json
>>> ge.validate(ge.read_csv('kag.csv'),
... expectation_suite=json.load(
... open('kaggle_expectations.json')))
{'results': [{'success': True,
'expectation_config': {'expectation_type': 'expect_column_to_exist',
'kwargs': {'column': 'Salary'}},
'exception_info': {'raised_exception': False,
'exception_message': None,
'exception_traceback': None}},
{'success': True,
'result': {'observed_value': 43869.66102793441,
'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0},
'expectation_config': {'expectation_type': 'expect_column_mean_to_be_between',
'kwargs': {'column': 'Salary', 'min_value': 10000, 'max_value': 100000}},
'exception_info': {'raised_exception': False,
'exception_message': None,
'exception_traceback': None}},
{'success': True,
'result': {'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'unexpected_percent_nonmissing': 0.0,
'partial_unexpected_list': []},
'expectation_config': {'expectation_type': 'expect_column_values_to_be_between',
'kwargs': {'column': 'Salary', 'min_value': 0, 'max_value': 500000}},
'exception_info': {'raised_exception': False,
'exception_message': None,
'exception_traceback': None}},
{'success': True,
'result': {'element_count': 15429,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'partial_unexpected_list': []},
'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null',
'kwargs': {'column': 'Salary'}},
'exception_info': {'raised_exception': False,
'exception_message': None,
'exception_traceback': None}},
{'success': True,
'result': {'observed_value': 'int64'},
'expectation_config': {'expectation_type': 'expect_column_values_to_be_of_type',
'kwargs': {'column': 'Salary', 'type_': 'int'}},
'exception_info': {'raised_exception': False,
'exception_message': None,
'exception_traceback': None}},
{'success': True,
'result': {'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'unexpected_percent_nonmissing': 0.0,
'partial_unexpected_list': []},
'expectation_config': {'expectation_type': 'expect_column_values_to_match_regex',
'kwargs': {'column': 'Country', 'regex': 'America|India|Another|China'}},
'exception_info': {'raised_exception': False,
'exception_message': None,
'exception_traceback': None}}],
'success': True,
'statistics': {'evaluated_expectations': 6,
'successful_expectations': 6,
'unsuccessful_expectations': 0,
'success_percent': 100.0},
'meta': {'great_expectations.__version__': '0.8.6',
'data_asset_name': None,
'expectation_suite_name': 'default',
'run_id': '2020-01-08T214957.098199Z'}}
The Great Expectations library extends a pandas DataFrame. You can use it to validate raw data, or data that you have used pandas to tweak. In our example, we showed how to create expectations for a DataFrame.
There are numerous built-in expectations that are listed in step 3. You can leverage those, or build a custom expectation if you desire. The result of validating the data is a JSON object with entries for "success". You can integrate these into a test suite to ensure that your data processing pipeline will work with new data.
In this section, we will show how to test your pandas code. We do this by testing the artifacts. We will use the third-party library, pytest
, to do this testing.
For this recipe, we will not be using Jupyter, but rather the command line.
pytest
library supports projects laid out in a couple different styles. We will create a folder structure that looks like this:
kag-demo-pytest/
├── data
│ └── kaggle-survey-2018.zip
├── kag.py
└── test
└── test_kag.py
The kag.py
file has code to load the raw data and code to tweak it. It looks like this:
import pandas as pd
import zipfile
def load_raw(zip_fname):
with zipfile.ZipFile(zip_fname) as z:
kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
df = kag.iloc[1:]
return df
def tweak_kag(df):
na_mask = df.Q9.isna()
hide_mask = df.Q9.str.startswith('I do not').fillna(False)
df = df[~na_mask & ~hide_mask]
q1 = (df.Q1
.replace({'Prefer not to say': 'Another',
'Prefer to self-describe': 'Another'})
.rename('Gender')
)
q2 = df.Q2.str.slice(0,2).astype(int).rename('Age')
def limit_countries(val):
if val in {'United States of America', 'India', 'China'}:
return val
return 'Another'
q3 = df.Q3.apply(limit_countries).rename('Country')
q4 = (df.Q4
.replace({'Master's degree': 18,
'Bachelor's degree': 16,
'Doctoral degree': 20,
'Some college/university study without earning a bachelor's degree': 13,
'Professional degree': 19,
'I prefer not to answer': None,
'No formal education past high school': 12})
.fillna(11)
.rename('Edu')
)
def only_cs_stat_val(val):
if val not in {'cs', 'eng', 'stat'}:
return 'another'
return val
q5 = (df.Q5
.replace({
'Computer science (software engineering, etc.)': 'cs',
'Engineering (non-computer focused)': 'eng',
'Mathematics or statistics': 'stat'})
.apply(only_cs_stat_val)
.rename('Studies'))
def limit_occupation(val):
if val in {'Student', 'Data Scientist', 'Software Engineer', 'Not employed',
'Data Engineer'}:
return val
return 'Another'
q6 = df.Q6.apply(limit_occupation).rename('Occupation')
q8 = (df.Q8
.str.replace('+', '')
.str.split('-', expand=True)
.iloc[:,0]
.fillna(-1)
.astype(int)
.rename('Experience')
)
q9 = (df.Q9
.str.replace('+','')
.str.replace(',','')
.str.replace('500000', '500')
.str.replace('I do not wish to disclose my approximate yearly compensation','')
.str.split('-', expand=True)
.iloc[:,0]
.astype(int)
.mul(1000)
.rename('Salary'))
return pd.concat([q1, q2, q3, q4, q5, q6, q8, q9], axis=1)
The test_kag.py
file looks like this:
import pytest
import kag
@pytest.fixture(scope='session')
def df():
df = kag.load_raw('data/kaggle-survey-2018.zip')
return kag.tweak_kag(df)
def test_salary_mean(df):
assert 10_000 < df.Salary.mean() < 100_000
def test_salary_between(df):
assert df.Salary.min() >= 0
assert df.Salary.max() <= 500_000
def test_salary_not_null(df):
assert not df.Salary.isna().any()
def test_country_values(df):
assert set(df.Country.unique()) == {'Another', 'United States of America', 'India', 'China'}
def test_salary_dtype(df):
assert df.Salary.dtype == int
kag-demo
directory. If you installed the pytest
library, you will have a pytest executable. If you try to run that command you will get an error:
(env)$ pytest
================== test session starts ==================
platform darwin -- Python 3.6.4, pytest-3.10.1, py-1.7.0, pluggy-0.8.0
rootdir: /Users/matt/pandas-cookbook/kag-demo, inifile:
plugins: asyncio-0.10.0
collected 0 items / 1 errors
======================== ERRORS =========================
___________ ERROR collecting test/test_kag.py ___________
ImportError while importing test module '/Users/matt/pandas-cookbook/kag
demo/test/test_kag.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
test/test_kag.py:3: in <module>
import kag
E ModuleNotFoundError: No module named 'kag'
!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!
================ 1 error in 0.15 seconds ================
This error is because pytest wants to use installed code to run the tests. Because I have not used pip
(or another mechanism) to install kag.py
, pytest complains that it cannot find the module in locations where code is installed.
kag.py
file is to invoke pytest as a module. Run this command instead:
$ python -m pytest
=========================== test session starts
===========================
platform darwin -- Python 3.6.4, pytest-3.10.1, py-1.7.0, pluggy-0.8.0
rootdir: /Users/matt/pandas-cookbook/kag-demo, inifile:
collected 5 items
test/test_kag.py ..... [100%]
================== 5 passed, 1 warnings in 3.51 seconds ==================
Invoking pytest in this manner adds the current directory to the PYTHONPATH
and now the import for the kag
module succeeds.
Complete coverage of using the pytest
library is beyond the scope of this book. However, the test_kag.py
file contains tests specified so that pytest understands them. Any function name that begins with test_
will be recognized as a test. The parameter to these test functions, df
, is called a fixture.
Near the top of the file, I specified a function named df
that was decorated with @pytest.fixture(scope='session')
. This function will be called once when the test session begins. Any test function with the parameter named df
will get the output of this function. The scope is specified as a session scope, so that the data is only loaded once (for the entire test session). If we did not specify the scope, the fixture scope would be at the function-level (the default). With function-level scope, the fixture would be executed once for every test function that uses it as a parameter, which makes the tests run in 12 seconds (instead of three on my machine).
You can run Great Expectations test from pytest too. Add the following function to test_kag.py
(You will need to update the path to the expectation suite):
def test_ge(df):
import json
import great_expectations as ge
res = ge.validate(ge.from_pandas(df),
expectation_suite=json.load(open('kaggle_expectations.json')))
failures = []
for exp in res['results']:
if not exp['success']:
failures.append(json.dumps(exp, indent=2))
if failures:
assert False, '
'.join(failures)
else:
assert True
The Hypothesis library is a third-party library for generating tests, or performing property-based testing. You create a strategy (an object that generates samples of data) and then run your code against the generated output of the strategy. You want to test an invariant, or something about your data that you presume to always hold true.
Again, there could be a book written solely about this type of testing, but in this section we will show an example of using the library.
We will show how to generate Kaggle survey data, then using that generated survey data, we will run it against the tweak_kag
function and validate that the function will work on new data.
We will leverage the testing code found in the previous section. The Hypothesis library works with pytest, so we can use the same layout.
test_hypot.py
file and a conftest.py
file:
kag-demo-hypo/
├── data
│ └── kaggle-survey-2018.zip
├── kag.py
└── test
├── conftest.py
├── test_hypot.py
└── test_kag.py
conftest.py
. This file is a special file that pytest looks for when trying to find fixtures. We do not need to import it, but any fixture defined in there can be used by the other test files.Move the fixture code from test_kag.py
to conftest.py
so that it has the following code. We will also do a little refactoring to create a raw_
function that is not a fixture that we can call outside of tests:
import pytest
import kag
@pytest.fixture(scope='session')
def raw():
return raw_()
def raw_():
return kag.load_raw('data/kaggle-survey-2018.zip')
@pytest.fixture(scope='session')
def df(raw):
return kag.tweak_kag(raw)
Put the following code in test_hypot.py
:
from hypothesis import given, strategies
from hypothesis.extra.pandas import column, data_frames
from conftest import raw_
import kag
def hypot_df_generator():
df = raw_()
cols = []
for col in ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q8', 'Q9']:
cols.append(column(col, elements=strategies.sampled_from(df[col].unique())))
return data_frames(columns=cols)
@given(hypot_df_generator())
def test_countries(gen_df):
if gen_df.shape[0] == 0:
return
kag_ = kag.tweak_kag(gen_df)
assert len(kag_.Country.unique()) <= 4
The function hypot_df_generator
constructs a Hypothesis search strategy. The search strategy can generate data of different types. We can manually create these strategies. In this case, I'm using the existing CSV file to populate the different values that are possible for the columns that I am interested in.
The function test_countries
is a pytest test that is decorated with the @given(hypot_df_generator())
decorator. The decoration will pass a gen_df
object into the test function. This object will be a DataFrame that complies with the specifications that the search strategy has. We can now test our invariants against that DataFrame. In this case, we will run the tweak_kag
function and ensure that the number of unique countries in the Country
column is less than or equal to four.
kag_demo
directory and run the test. Here is a command to run only the test_countries
test:
$ python -m pytest -k test_countries
The output looks like this:
====================== test session starts ======================
platform darwin -- Python 3.6.4, pytest-5.3.2, py-1.7.0, pluggy-0.13.1
rootdir: /Users/matt/kag-demo
plugins: asyncio-0.10.0, hypothesis-5.1.2
collected 6 items / 5 deselected / 1 selected
test/test_hypot.py F [100%]
=========================== FAILURES ============================
________________________ test_countries _________________________
@given(hypot_df_generator())
> def test_countries(gen_df):
test/test_hypot.py:19:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/test_hypot.py:23: in test_countries
kag_ = kag.tweak_kag(gen_df)
kag.py:63: in tweak_kag
q8 = (df.Q8
/Users/matt/.env/364/lib/python3.6/site-packages/pandas/core/generic.py:5175: in
__getattr__
return object.__getattribute__(self, name)
/Users/matt/.env/364/lib/python3.6/site-packages/pandas/core/accessor.py:175: in
__get__
accessor_obj = self._accessor(obj)
/Users/matt/.env/364/lib/python3.6/site-packages/pandas/core/strings.py:1917: in __init__
self._inferred_dtype = self._validate(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = Series([], Name: Q8, dtype: float64)
@staticmethod
def _validate(data):
"""
Auxiliary function for StringMethods, infers and checks dtype of data.
This is a "first line of defence" at the creation of the StringMethods-
object (see _make_accessor), and just checks that the dtype is in the
*union* of the allowed types over all string methods below; this
restriction is then refined on a per-method basis using the decorator
@forbid_nonstring_types (more info in the corresponding docstring).
This really should exclude all series/index with any non-string values,
but that isn't practical for performance reasons until we have a str
dtype (GH 9343 / 13877)
Parameters
----------
data : The content of the Series
Returns
-------
dtype : inferred dtype of data
"""
if isinstance(data, ABCMultiIndex):
raise AttributeError(
"Can only use .str accessor with Index, " "not MultiIndex"
)
# see _libs/lib.pyx for list of inferred types
allowed_types = ["string", "empty", "bytes", "mixed", "mixed-integer"]
values = getattr(data, "values", data) # Series / Index
values = getattr(values, "categories", values) # categorical / normal
try:
inferred_dtype = lib.infer_dtype(values, skipna=True)
except ValueError:
# GH#27571 mostly occurs with ExtensionArray
inferred_dtype = None
if inferred_dtype not in allowed_types:
> raise AttributeError("Can only use .str accessor with string " "values!")
E AttributeError: Can only use .str accessor with string values!
/Users/matt/.env/364/lib/python3.6/site-packages/pandas/core/strings.py:1967: AttributeError
-------------------------- Hypothesis ---------------------------
Falsifying example: test_countries(
gen_df= Q1 Q2 Q3 ... Q6 Q8 Q9
0 Female 45-49 United States of America ... Consultant NaN NaN
[1 rows x 8 columns],
)
========== 1 failed, 5 deselected, 1 warning in 2.23s ===========
There is a lot of noise in the output, but if you scan through it you will find that it is complaining about the code that processes the column Q8
. The reason for this is that it generated a single row with a NaN entry for Q8
. If we run tweak_kag
with this DataFrame, pandas infers that the Q8
column has a float type and errors out when trying to use the .str
accessor.
Is this a bug? It's hard to give a definitive answer on that. But this shows that if our raw data has only missing values then our code will not work.
The Hypothesis library tries to generate a span of data that conforms to a specification. You can use this generated data to test that invariants hold. In our case, we saw that the survey data had missing data. When we generated a DataFrame with a single row of missing data, our tweak_kag
function did not work. The .str
accessor only works if there is at least one string value in a column, and our column only had missing data (a float value).
We could address these issues and continue to test other invariants. This illustrates another point that comes up when programming. We get caught in the forest and only see specific trees. Sometimes we need to take a step back and look at things from a different perspective. Using Hypothesis is one way to do this.