Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

14 Debugging and Testing Pandas

Code to transform data

In this chapter, we will look at some code that analyzes survey data that Kaggle did in 2018. The survey queried Kaggle users about socio-economic information.

This section will present the survey data along with some code to analyze it. The subtitle for this data is "the most comprehensive dataset available on the state of machine learning and data science". Let's dig into this data and see what it has. The data was originally available at https://www.kaggle.com/kaggle/kaggle-survey-2018.

How to do it…

Load the data into a DataFrame:

>>> import pandas as pd
>>> import numpy as np
>>> import zipfile
>>> url = 'data/kaggle-survey-2018.zip'
>>> with zipfile.ZipFile(url) as z:
...     print(z.namelist())
...     kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
...     df = kag.iloc[1:]
['multipleChoiceResponses.csv', 'freeFormResponses.csv', 'SurveySchema.csv']

Look at the data and data types:

>>> df.T
                    1          2            3      ...   23857
Time from...          710        434          718  ...     370
Q1                 Female       Male       Female  ...    Male
Q1_OTHER_...           -1         -1           -1  ...      -1
Q2                  45-49      30-34        30-34  ...   22-24
Q3            United S...  Indonesia  United S...  ...  Turkey
...                   ...        ...          ...  ...     ...
Q50_Part_5            NaN        NaN          NaN  ...     NaN
Q50_Part_6            NaN        NaN          NaN  ...     NaN
Q50_Part_7            NaN        NaN          NaN  ...     NaN
Q50_Part_8            NaN        NaN          NaN  ...     NaN
Q50_OTHER...           -1         -1           -1  ...      -1
>>> df.dtypes
Time from Start to Finish (seconds)    object
Q1                                     object
Q1_OTHER_TEXT                          object
Q2                                     object
Q3                                     object
                                        ...  
Q50_Part_5                             object
Q50_Part_6                             object
Q50_Part_7                             object
Q50_Part_8                             object
Q50_OTHER_TEXT                         object
Length: 395, dtype: object

It turns out that most of the survey data was selecting from options of responses. We see that the type of all of the columns is object. We could go through our standard process of exploring these values using the .value_counts method:

>>> df.Q1.value_counts(dropna=False)
Male                       19430
Female                      4010
Prefer not to say            340
Prefer to self-describe       79
Name: Q1, dtype: int64

To make a long story short, I pull out each column of interest as a Series. I filtered most of the values to a limited number of values. I used the Series .rename method to give the column a better name. Some of the values, such as the Q2, Q8, and Q9, have range answers. In the case of age (Q2), you have values like 55-59 and 60-69. I use the .str.slice method to pull out the first two characters, and convert the type from string to integer.

For the education column (Q4), I convert the values to ordinal numbers. Finally, after I have converted many columns I'm working with to numbers and cleaned up some of the others, I put all of the Series back in a DataFrame with pd.concat.

I put all of this code into a function, tweak_kag:

>>> def tweak_kag(df):
...     na_mask = df.Q9.isna()
...     hide_mask = df.Q9.str.startswith('I do not').fillna(False)
...     df = df[~na_mask & ~hide_mask]
...     
...     q1 = (df.Q1
...       .replace({'Prefer not to say': 'Another',
...                'Prefer to self-describe': 'Another'})
...       .rename('Gender')
...     )
...     q2 = df.Q2.str.slice(0,2).astype(int).rename('Age')
...     def limit_countries(val):
...         if val in  {'United States of America', 'India', 'China'}:
...             return val
...         return 'Another'
...     q3 = df.Q3.apply(limit_countries).rename('Country')
...    
...     q4 = (df.Q4
...      .replace({'Master's degree': 18,
...      'Bachelor's degree': 16,
...      'Doctoral degree': 20,
...      'Some college/university study without earning a bachelor's degree': 13,
...      'Professional degree': 19,
...      'I prefer not to answer': None,
...      'No formal education past high school': 12})
...      .fillna(11)
...      .rename('Edu')
...     )
...     
...     def only_cs_stat_val(val):
...         if val not in {'cs', 'eng', 'stat'}:
...             return 'another'
...         return val
...   
...     q5 = (df.Q5
...             .replace({
...                 'Computer science (software engineering, etc.)': 'cs',
...                 'Engineering (non-computer focused)': 'eng',
...                 'Mathematics or statistics': 'stat'})
...              .apply(only_cs_stat_val)
...              .rename('Studies'))
...     def limit_occupation(val):
...         if val in {'Student', 'Data Scientist', 'Software Engineer', 'Not employed',
...                   'Data Engineer'}:
...             return val
...         return 'Another'
...   
...     q6 = df.Q6.apply(limit_occupation).rename('Occupation')
...     
...     q8 = (df.Q8
...       .str.replace('+', '')
...       .str.split('-', expand=True)
...       .iloc[:,0]
...       .fillna(-1)
...       .astype(int)
...       .rename('Experience')
...     )
...     
...     q9 = (df.Q9
...      .str.replace('+','')
...      .str.replace(',','')
...      .str.replace('500000', '500')
...      .str.replace('I do not wish to disclose my approximate yearly compensation','')
...      .str.split('-', expand=True)
...      .iloc[:,0]
...      .astype(int)
...      .mul(1000)
...      .rename('Salary'))
...     return pd.concat([q1, q2, q3, q4, q5, q6, q8, q9], axis=1)
>>> tweak_kag(df)
       Gender  Age      Country  ...   Occupation Experience
2        Male   30      Another  ...      Another          5
3      Female   30  United S...  ...  Data Sci...          0
5        Male   22        India  ...      Another          0
7        Male   35      Another  ...      Another         10
8        Male   18        India  ...      Another          0
...       ...  ...          ...  ...          ...        ...
23844    Male   30      Another  ...  Software...         10
23845    Male   22      Another  ...      Student          0
23854    Male   30      Another  ...      Another          5
23855    Male   45      Another  ...      Another          5
23857    Male   22      Another  ...  Software...          0
>>> tweak_kag(df).dtypes
Gender         object
Age             int64
Country        object
Edu           float64
Studies        object
Occupation     object
Experience      int64
Salary          int64
dtype: object

How it works…

The survey data is rich with information, but it's a little hard to analyze it because all of the columns come in as objects. Our tweak_kag function filters out respondents who did not provide salary information. We also convert a few of the columns (Age, Edu, Experience, and Salary) to numeric values for easier quantification. The remaining categorical columns are pruned down to lower cardinality.

Cleaning up our data makes it easier to analyze. For example, we can easily group by country and correlate salary and experience:

>>> kag = tweak_kag(df)
>>> (kag
...     .groupby('Country')
...     .apply(lambda g: g.Salary.corr(g.Experience))
... )
Country
Another                     0.289827
China                       0.252974
India                       0.167335
United States of America    0.354125
dtype: float64

Apply performance

The .apply method on a Series and DataFrame is one of the slowest operations in pandas. In this recipe, we will explore the speed of it and see if we can debug what is going on.

How to do it…

Let's time how long one use of the .apply method takes using the %%timeit cell magic in Jupiter. This is the code from the tweak_kag function that limits the cardinality of the country column (Q3):

>>> %%timeit
>>> def limit_countries(val):
...      if val in  {'United States of America', 'India', 'China'}:
...          return val
...      return 'Another'
>>> q3 = df.Q3.apply(limit_countries).rename('Country')
6.42 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let's look at using the .replace method instead of .apply and see if that improves performance:

>>> %%timeit
>>> other_values = df.Q3.value_counts().iloc[3:].index
>>> q3_2 = df.Q3.replace(other_values, 'Another')
27.7 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Woah! That was slower than the .apply method! Let's try again. If we recreate this code using the .isin method combined with .where, it runs over twice as fast as .apply:

>>> %%timeit
>>> values = {'United States of America', 'India', 'China'}
>>> q3_3 = df.Q3.where(df.Q3.isin(values), 'Another')
3.39 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Finally, let's try the np.where function. This is not part of pandas, but pandas often works with NumPy functions:

>>> %%timeit
>>> values = {'United States of America', 'India', 'China'}
>>> q3_4 = pd.Series(np.where(df.Q3.isin(values), df.Q3, 'Another'), 
...      index=df.index)
2.75 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let's check if the results are the same:

>>> q3.equals(q3_2)
True
>>> q3.equals(q3_3)
True
>>> q3.equals(q3_4)
True

How it works…

This recipe benchmarked the .apply, .replace, and .where methods. Of those three, the .where method was the quickest. Finally, it showed the NumPy where function, which is even faster than pandas. However, if we use the NumPy function, we need to convert the result back into a series (and give it the same index as the original DataFrame).

There's more…

The documentation for the .apply method states that if you pass in a NumPy function, it will run a fast path and pass the whole series to the function. However, if you pass in a Python function, that function will be called for each value in the Series. This can be confusing because the method behaves differently depending on the parameter that is passed into it.

If you find yourself in a situation where you are passing in a function to .apply (or have done a groupby operation and are calling .agg, .transform, or some other method that takes a function as a parameter) and cannot remember what arguments will be passed into the function, you can use the following code to help. (Of course, you can also look at the documentation or even look at the code for .apply):

>>> def limit_countries(val):
...      if val in  {'United States of America', 'India', 'China'}:
...          return val
...      return 'Another'
>>> q3 = df.Q3.apply(limit_countries).rename('Country')
>>> def debug(something):
...     # what is something? A cell, series, dataframe?
...     print(type(something), something)
...     1/0
>>> q3.apply(debug)
<class 'str'> United States of America
Traceback (most recent call last)
...
ZeroDivisionError: division by zero

The output shows that a string (a scalar value from the series q3) was passed into the debug function.

If you do not want to throw an exception, you can set a global variable to hold the parameter passed into the function:

>>> the_item = None
>>> def debug(something):
...     global the_item
...     the_item = something
...     return something
>>> _ = q3.apply(debug)
>>> the_item
'Another'

One thing to keep in mind is that the function we pass into the .apply method is called once per item in the Series. Operating on single items is a slow path, and we should try to avoid it if possible. The next recipe will show another option for speeding calls to .apply.

Improving apply performance with Dask, Pandarell, Swifter, and more

Sometimes .apply is convenient. Various libraries enable parallelizing such operations. There are various mechanisms to do this. The easiest is to try and leverage vectorization. Math operations are vectorized in pandas, if you add a number (say 5) to a numerical series, pandas will not add 5 to each value. Rather it will leverage a feature of modern CPUs to do the operation one time.

If you cannot vectorize, as is the case with our limit_countries function, you have other options. This section will show a few of them.

Note that you will need to install these libraries as they are not included with pandas.

The examples show limiting values in the country column from the survey data to a few values.

How to do it…

Import and initialize the Pandarallel library. This library tries to parallelize pandas operations across all available CPUs. Note that this library runs fine on Linux and Mac. Because of the shared memory technique it leverages, it will not work on Windows unless Python is being executed with the Windows Subsystem for Linux:
```
>>> from pandarallel import pandarallel
>>> pandarallel.initialize()
```

This library augments the DataFrame to add some extra methods. Use the .parallel_apply method:

>>> def limit_countries(val):
...      if val in  {'United States of America', 'India', 'China'}:
...          return val
...      return 'Another'
>>> %%timeit
>>> res_p = df.Q3.parallel_apply(limit_countries).rename('Country')
133 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Let's try another library. Import the swifter library:
```
>>> import swifter
```

This library also augments the DataFrame to add a .swifter accessor. Use the swifter library:

>>> %%timeit
>>> res_s = df.Q3.swifter.apply(limit_countries).rename('Country')
187 ms ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Import the Dask library:
```
>>> import dask
```

Use the Dask .map_partitions function:

>>> %%timeit
>>> res_d = (dask.dataframe.from_pandas(
...        df, npartitions=4)
...    .map_partitions(lambda df: df.Q3.apply(limit_countries))
...    .rename('Countries')
... )
29.1 s ± 1.75 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use np.vectorize:

>>> np_fn = np.vectorize(limit_countries)
>>> %%timeit
>>> res_v = df.Q3.apply(np_fn).rename('Country')
643 ms ± 86.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Import numba and decorate the function with the jit decorator:

>>> from numba import jit
>>> @jit
... def limit_countries2(val):
...      if val in  ['United States of America', 'India', 'China']:
...          return val
...      return 'Another'

Use the decorated numba function:

>>> %%timeit
>>> res_n = df.Q3.apply(limit_countries2).rename('Country')
158 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How it works…

Note that there is overhead to parallelizing code. In the examples above, all of the code ran faster in serial with normal pandas code. There is a crossover point where the overhead penalty makes sense. The examples for the Pandarallel library use at least a million samples. Our dataset is much smaller than that, so the vanilla .apply method is faster in our case.

In step 1 and 2 we use the Pandarallel library. This library leverages the multiprocessing library from the standard library to try and run computations in parallel. When you initialize the library, you can specify an nb_workers parameter that indicates how many CPUs to use (by default it will use all of the CPUs). The example shows how to use the .parallel_apply method which is analogous to the .apply method in pandas. This library also works with groupby objects and series objects.

Step 3 and 4 show use of the swifter library. This library adds a .swifter attribute to a DataFrame and series. This library takes a different approach to speeding up code. It will try to see if the operation can be vectorized. Otherwise, it will see how long pandas will take (by running on a small sample), it then determines whether to leverage the Dask library, or to just stick with pandas. Again, the logic to even determine which path to use has overhead, so blindly using this library might not lead to the most efficient code.

The Swifter website has a notebook where they performed comparisons of Swifter, np.vectorize, Dask, and pandas. It has extensive benchmarking on different types of functions. For what it calls non-vectorized functions (which our limit_countries is as it has normal Python logic), it isn't until you get to almost a million rows that the vanilla pandas .apply method starts to lose out.

In step 5 and 6 the Dask library is presented. Note that there is a bit of overhead loading the data and leveraging the parallelization afforded by the library. Many users of Dask forgo pandas completely and just use Dask, as it implements similar functionality but allows processing to scale out to big data (and running on a cluster).

Next, we try the vectorize function from NumPy in step 7. It creates a NumPy ufunc (a universal function that operates on NumPy arrays) from an arbitrary Python function. It tries to leverage NumPy broadcasting rules. In this case, there is no performance increase by using it.

Step 8 and 9 demonstrate using the Numba library. We leverage the jit decorator to create a new function limit_countries2. This decorator converts the Python function into native code. Again, this function is not amenable to speed increases from this decorator.

Many of the options illustrated here may provide a performance boost with larger datasets. In our case, blindly applying them would slow down the code.

Inspecting code

The Jupyter environment has an extension that allows you to quickly pull up the documentation or the source code for a class, method, or function. I strongly encourage you to get used to using these. If you can stay in the Jupyter environment to answer questions that may come up, you will increase your productivity.

In this section, we will show how to look at the source code for the .apply method. It is easiest to look at the documentation for a DataFrame or series method directly on the DataFrame or series object, respectively. Throughout this book, we have heavily recommended chaining operations on pandas objects. Sadly Jupyter (and any other editor environment) is not able to perform code completion or look up documentation on the intermediate object returned from a chained method call. Hence the recommendation to perform the lookup directly on a method that is not chained.

How to do it…

Load the survey data:

>>> import zipfile
>>> url = 'data/kaggle-survey-2018.zip'
>>> with zipfile.ZipFile(url) as z:
...     kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
...     df = kag.iloc[1:]

Let's look up the documentation for .apply using the Jupyter ? extension. (We could also hit Shift + Tab four times to get this in Jupyter):

>>> df.Q3.apply?
Signature: df.Q3.apply(func, convert_dtype=True, args=(), **kwds)
Docstring:
Invoke function on values of Series.
Can be ufunc (a NumPy function that applies to the entire Series)
or a Python function that only works on single values.
Parameters
----------
func : function
    Python function or NumPy ufunc to apply.
convert_dtype : bool, default True
    Try to find better dtype for elementwise function results. If
    False, leave as dtype=object.
args : tuple
    Positional arguments passed to func after the series value.
**kwds
    Additional keyword arguments passed to func.
Returns
-------
Series or DataFrame
    If func returns a Series object the result will be a DataFrame.
See Also
--------
Series.map: For element-wise operations.
Series.agg: Only perform aggregating type operations.
Series.transform: Only perform transforming type operations.
Examples
--------
  ...
File:      ~/.env/364/lib/python3.6/site-packages/pandas/core/series.py
Type:      method

Let's look at the source code by using ??. (There is no Shift + Tab keyboard shortcut to get the code):

>>> df.Q3.apply??
Signature: df.Q3.apply(func, convert_dtype=True, args=(), **kwds)
Source:   
    def apply(self, func, convert_dtype=True, args=(), **kwds):
    ...
        if len(self) == 0:
            return self._constructor(dtype=self.dtype, index=self.index).__finalize__(
                self
            )
        # dispatch to agg
        if isinstance(func, (list, dict)):
            return self.aggregate(func, *args, **kwds)
        # if we are a string, try to dispatch
        if isinstance(func, str):
            return self._try_aggregate_string_function(func, *args, **kwds)
        # handle ufuncs and lambdas
        if kwds or args and not isinstance(func, np.ufunc):
            def f(x):
                return func(x, *args, **kwds)
        else:
            f = func
        with np.errstate(all="ignore"):
            if isinstance(f, np.ufunc):
                return f(self)
            # row-wise access
            if is_extension_type(self.dtype):
                mapped = self._values.map(f)
            else:
                values = self.astype(object).values
                mapped = lib.map_infer(values, f, convert=convert_dtype)
        if len(mapped) and isinstance(mapped[0], Series):
            # GH 25959 use pd.array instead of tolist
            # so extension arrays can be used
            return self._constructor_expanddim(pd.array(mapped), index=self.index)
        else:
            return self._constructor(mapped, index=self.index).__finalize__(self)
File:      ~/.env/364/lib/python3.6/site-packages/pandas/core/series.py
Type:      method

We can see that this method tries to figure out the appropriate code to call. If those all fail, eventually it calculates the mapped variable. Let's try and figure out what lib.map_infer does:

>>> import pandas.core.series
>>> pandas.core.series.lib
<module 'pandas._libs.lib' from '.env/364/lib/python3.6/site-packages/pandas/_libs/lib.cpython-36m-darwin.so'>
>>> pandas.core.series.lib.map_infer??
Docstring:
Substitute for np.vectorize with pandas-friendly dtype inference
Parameters
----------
arr : ndarray
f : function
Returns
-------
mapped : ndarray
Type:      builtin_function_or_method

How it works…

Jupyter has the ability to inspect both the docstrings and the source code for Python objects. The standard Python REPL can leverage the built-in help function to view a docstring, but it cannot display the source code.

Jupyter, however has some tricks up its sleeves. If you tack on a single question mark (?) following a function or method, it will show the documentation for that code. Note that this is not valid Python syntax, it is a feature of Jupyter. If you add on two question marks (??), then Jupyter will display the source code of the function or method.

This recipe showed tracing through the source code to see how the .apply method in pandas works under the covers.

We can see a shortcut in step 3 if there are no results. We can also see how string functions (that is, passing in the string literal mean) work. The getattr function pulls off the corresponding method from the DataFrame.

Next, the code checks if it is dealing with a NumPy function. Eventually, it will call the function if it is an instance of np.ufunc, or it will call the .map method on the underlying ._values attribute, or it will call lib.map_infer.

In step 4, we tried to inspect lib.map_infer but saw that it was an so file (pyd on Windows). This is a compiled file that is usually the result of writing Python in C or using Cython. Jupyter cannot show us the source of compiled files.

There's more…

When you view the source code for a function or method, Jupyter will display the file that it belongs to at the bottom of pane. If I really need to dig into the source code, I will open that in an editor outside of Jupyter. Then I can browse through that code and any corresponding code with my editor (most editors have better code navigation capabilities than Jupyter).

Debugging in Jupyter

The previous recipes have shown how to understand pandas code and inspect it from Jupyter. In this section, we will look at using the IPython debugger (ipdb) in Jupyter.

In this section, I will create a function that throws an error when I try to use it with the series .apply method. I will use ipdb to debug it.

How to do it…

Load the survey data:

>>> import zipfile
>>> url = 'data/kaggle-survey-2018.zip'
>>> with zipfile.ZipFile(url) as z:
...     kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
...     df = kag.iloc[1:]

Try and run a function to add one to a series:

>>> def add1(x):
...     return x + 1
>>> df.Q3.apply(add1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-6ce28d2fea57> in <module>
      2     return x + 1
      3 
----> 4 df.Q3.apply(add1)
~/.env/364/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4043             else:
   4044                 values = self.astype(object).values
-> 4045                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4046 
   4047         if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-9-6ce28d2fea57> in add1(x)
      1 def add1(x):
----> 2     return x + 1
      3 
      4 df.Q3.apply(add1)
TypeError: must be str, not int

Use the %debug cell magic immediately following an exception to drop into a debug window. (This might seem a little backward because you call this after you have run a cell with an exception). This will open the debugger to the point where the exception was thrown.
You can use the debugger commands to navigate through the stack. Hitting U key will pop the stack to the function that called the current line. You can inspect objects using the print command (p):

Jupyter debugging
If you want to step into code without requiring that an exception be thrown, you can use the set_trace function from the IPython debugger. This will drop you into the debugger immediately following that line:
```
>>> from IPython.core.debugger import set_trace
>>> def add1(x):
...     set_trace()
...     return x + 1
>>> df.Q3.apply(add1)
```
Jupyter debugging

How it works…

Jupyter (which is derived from IPython) ships with the IPython debugger. This replicates the functionality of the pdb module in the standard library, but with niceties such as syntax highlighting. (It also has tab completion, but this does not work in Jupyter, only in the IPython console).

There's more…

If you are unfamiliar with using the debugger, here is a lifejacket for you: The command h will print out all of the commands that you can run from the debugger:

ipdb> h
Documented commands (type help <topic>):
========================================
EOF    cl         disable  interact  next    psource  rv         unt   
a      clear      display  j         p       q        s          until 
alias  commands   down     jump      pdef    quit     source     up    
args   condition  enable   l         pdoc    r        step       w     
b      cont       exit     list      pfile   restart  tbreak     whatis
break  continue   h        ll        pinfo   return   u          where 
bt     d          help     longlist  pinfo2  retval   unalias  
c      debug      ignore   n         pp      run      undisplay

The most common commands that I use are s, n, l, u, d, and c. If you want to know what s does, then type:

ipdb> h s
s(tep)
        Execute the current line, stop at the first possible occasion
        (either in a function that is called or in the current
        function).

This tells the debugger to print the help (h) documentation for step (s). Because we are usually coding in small steps in Jupyter, a debugger is often overkill. But knowing how to use it can come in handy, especially if you want to jump into pandas source code and understand what is going on.

Managing data integrity with Great Expectations

Great Expectations is a third-party tool that allows you to capture and define the properties of a dataset. You can save these properties and then use them to validate future data to ensure data integrity. This can be very useful when building machine learning models, as new categorical data values and numeric outliers tend to cause a model to perform poorly or error out.

In this section, we will look at the Kaggle dataset and make an expectation suite to test and validate the data.

How to do it…

Read the data using the tweak_kag function previously defined:
```
>>> kag = tweak_kag(df)
```
Use the Great Expectations from_pandas function to read in a Great Expectations DataFrame (a subclass of DataFrame with some extra methods):
```
>>> import great_expectations as ge
>>> kag_ge = ge.from_pandas(kag)
```

Examine the extra methods on the DataFrame:

>>> sorted([x for x in set(dir(kag_ge)) - set(dir(kag))
...     if not x.startswith('_')])
['autoinspect',
'batch_fingerprint',
'batch_id',
'batch_kwargs',
'column_aggregate_expectation',
'column_map_expectation',
'column_pair_map_expectation',
'discard_failing_expectations',
'edit_expectation_suite',
'expect_column_bootstrapped_ks_test_p_value_to_be_greater_than',
'expect_column_chisquare_test_p_value_to_be_greater_than',
'expect_column_distinct_values_to_be_in_set',
'expect_column_distinct_values_to_contain_set',
'expect_column_distinct_values_to_equal_set',
'expect_column_kl_divergence_to_be_less_than',
'expect_column_max_to_be_between',
'expect_column_mean_to_be_between',
'expect_column_median_to_be_between',
'expect_column_min_to_be_between',
'expect_column_most_common_value_to_be_in_set',
'expect_column_pair_values_A_to_be_greater_than_B',
'expect_column_pair_values_to_be_equal',
'expect_column_pair_values_to_be_in_set',
'expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than',
'expect_column_proportion_of_unique_values_to_be_between',
'expect_column_quantile_values_to_be_between',
'expect_column_stdev_to_be_between',
'expect_column_sum_to_be_between',
'expect_column_to_exist',
'expect_column_unique_value_count_to_be_between',
'expect_column_value_lengths_to_be_between',
'expect_column_value_lengths_to_equal',
'expect_column_values_to_be_between',
'expect_column_values_to_be_dateutil_parseable',
'expect_column_values_to_be_decreasing',
'expect_column_values_to_be_in_set',
'expect_column_values_to_be_in_type_list',
'expect_column_values_to_be_increasing',
'expect_column_values_to_be_json_parseable',
'expect_column_values_to_be_null',
'expect_column_values_to_be_of_type',
'expect_column_values_to_be_unique',
'expect_column_values_to_match_json_schema',
'expect_column_values_to_match_regex',
'expect_column_values_to_match_regex_list',
'expect_column_values_to_match_strftime_format',
'expect_column_values_to_not_be_in_set','expect_column_values_to_not_be_null',
'expect_column_values_to_not_match_regex','expect_column_values_to_not_match_regex_list',
'expect_multicolumn_values_to_be_unique',
'expect_table_column_count_to_be_between',
'expect_table_column_count_to_equal',
'expect_table_columns_to_match_ordered_list',
'expect_table_row_count_to_be_between',
'expect_table_row_count_to_equal',
'expectation',
'find_expectation_indexes',
'find_expectations',
'from_dataset',
'get_column_count',
'get_column_count_in_range',
'get_column_hist',
'get_column_max',
'get_column_mean',
'get_column_median',
'get_column_min',
'get_column_modes',
'get_column_nonnull_count',
'get_column_partition',
'get_column_quantiles',
'get_column_stdev',
'get_column_sum',
'get_column_unique_count',
'get_column_value_counts',
'get_config_value',
'get_data_asset_name',
'get_default_expectation_arguments',
'get_evaluation_parameter',
'get_expectation_suite',
'get_expectation_suite_name',
'get_expectations_config',
'get_row_count',
'get_table_columns',
'hashable_getters',
'multicolumn_map_expectation',
'profile',
'remove_expectation',
'save_expectation_suite',
'save_expectation_suite_name',
'set_config_value',
'set_data_asset_name',
'set_default_expectation_argument',
'set_evaluation_parameter',
'test_column_aggregate_expectation_function',
'test_column_map_expectation_function',
'test_expectation_function',
'validate']

Great Expectations has expectations for table shape, missing values, types, ranges, strings, dates, aggregate functions, column pairs, distributions, and file properties. Let's use some of them. As we do, the library will track the expectations we use. We can later save these as a suite of expectations:

>>> kag_ge.expect_column_to_exist('Salary')
{'success': True}
>>> kag_ge.expect_column_mean_to_be_between(
...    'Salary', min_value=10_000, max_value=100_000)
{'success': True,
'result': {'observed_value': 43869.66102793441,
'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0}}
>>> kag_ge.expect_column_values_to_be_between(
...    'Salary', min_value=0, max_value=500_000)
{'success': True,
'result': {'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'unexpected_percent_nonmissing': 0.0,
'partial_unexpected_list': []}}
>>> kag_ge.expect_column_values_to_not_be_null('Salary')
{'success': True,
'result': {'element_count': 15429,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'partial_unexpected_list': []}}
>>> kag_ge.expect_column_values_to_match_regex(
...     'Country', r'America|India|Another|China')
{'success': True,
'result': {'element_count': 15429,
'missing_count': 0,
'missing_percent': 0.0,
'unexpected_count': 0,
'unexpected_percent': 0.0,
'unexpected_percent_nonmissing': 0.0,
'partial_unexpected_list': []}}
>>> kag_ge.expect_column_values_to_be_of_type(
...    'Salary', type_='int')
{'success': True, 'result': {'observed_value': 'int64'}}

Save the expectations to a file. Great Expectations uses JSON to specify them:

>>> kag_ge.save_expectation_suite('kaggle_expectations.json')

The file should look like this:

{
  "data_asset_name": null,
  "expectation_suite_name": "default",
  "meta": {
    "great_expectations.__version__": "0.8.6"
  },
  "expectations": [
    {
      "expectation_type": "expect_column_to_exist",
      "kwargs": {
        "column": "Salary"
      }
    },
    {
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "Salary",
        "min_value": 10000,
        "max_value": 100000
      }
    },
    {
      "expectation_type": "expect_column_values_to_be_between",
      "kwargs": {
        "column": "Salary",
        "min_value": 0,
        "max_value": 500000
      }
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "Salary"
      }
    },
    {
      "expectation_type": "expect_column_values_to_match_regex",
      "kwargs": {
        "column": "Country",
        "regex": "America|India|Another|China"
      }
    },
    {
      "expectation_type": "expect_column_values_to_be_of_type",
      "kwargs": {
        "column": "Salary",
        "type_": "int"
      }
    }
  ],
  "data_asset_type": "Dataset"
}

Use the suite to evaluate data found in a CSV file. We will persist our Kaggle data to a CSV file and test that to make sure it still passes:

>>> kag_ge.to_csv('kag.csv')
>>> import json
>>> ge.validate(ge.read_csv('kag.csv'), 
...     expectation_suite=json.load(
...         open('kaggle_expectations.json')))
{'results': [{'success': True,
   'expectation_config': {'expectation_type': 'expect_column_to_exist',
    'kwargs': {'column': 'Salary'}},
   'exception_info': {'raised_exception': False,
    'exception_message': None,
    'exception_traceback': None}},
  {'success': True,
   'result': {'observed_value': 43869.66102793441,
    'element_count': 15429,
    'missing_count': 0,
    'missing_percent': 0.0},
   'expectation_config': {'expectation_type': 'expect_column_mean_to_be_between',
    'kwargs': {'column': 'Salary', 'min_value': 10000, 'max_value': 100000}},
   'exception_info': {'raised_exception': False,
    'exception_message': None,
    'exception_traceback': None}},
  {'success': True,
   'result': {'element_count': 15429,
    'missing_count': 0,
    'missing_percent': 0.0,
    'unexpected_count': 0,
    'unexpected_percent': 0.0,
    'unexpected_percent_nonmissing': 0.0,
    'partial_unexpected_list': []},
   'expectation_config': {'expectation_type': 'expect_column_values_to_be_between',
    'kwargs': {'column': 'Salary', 'min_value': 0, 'max_value': 500000}},
   'exception_info': {'raised_exception': False,
    'exception_message': None,
    'exception_traceback': None}},
  {'success': True,
   'result': {'element_count': 15429,
    'unexpected_count': 0,
    'unexpected_percent': 0.0,
    'partial_unexpected_list': []},
   'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null',
    'kwargs': {'column': 'Salary'}},
   'exception_info': {'raised_exception': False,
    'exception_message': None,
    'exception_traceback': None}},
  {'success': True,
   'result': {'observed_value': 'int64'},
   'expectation_config': {'expectation_type': 'expect_column_values_to_be_of_type',
    'kwargs': {'column': 'Salary', 'type_': 'int'}},
   'exception_info': {'raised_exception': False,
    'exception_message': None,
    'exception_traceback': None}},
  {'success': True,
   'result': {'element_count': 15429,
    'missing_count': 0,
    'missing_percent': 0.0,
    'unexpected_count': 0,
    'unexpected_percent': 0.0,
    'unexpected_percent_nonmissing': 0.0,
    'partial_unexpected_list': []},
   'expectation_config': {'expectation_type': 'expect_column_values_to_match_regex',
    'kwargs': {'column': 'Country', 'regex': 'America|India|Another|China'}},
   'exception_info': {'raised_exception': False,
    'exception_message': None,
    'exception_traceback': None}}],
 'success': True,
 'statistics': {'evaluated_expectations': 6,
  'successful_expectations': 6,
  'unsuccessful_expectations': 0,
  'success_percent': 100.0},
 'meta': {'great_expectations.__version__': '0.8.6',
  'data_asset_name': None,
  'expectation_suite_name': 'default',
  'run_id': '2020-01-08T214957.098199Z'}}

How it works…

The Great Expectations library extends a pandas DataFrame. You can use it to validate raw data, or data that you have used pandas to tweak. In our example, we showed how to create expectations for a DataFrame.

There are numerous built-in expectations that are listed in step 3. You can leverage those, or build a custom expectation if you desire. The result of validating the data is a JSON object with entries for "success". You can integrate these into a test suite to ensure that your data processing pipeline will work with new data.

Using pytest with pandas

In this section, we will show how to test your pandas code. We do this by testing the artifacts. We will use the third-party library, pytest, to do this testing.

For this recipe, we will not be using Jupyter, but rather the command line.

How to do it…

Create a project data layout. The pytest library supports projects laid out in a couple different styles. We will create a folder structure that looks like this:

kag-demo-pytest/
├── data
│ └── kaggle-survey-2018.zip
├── kag.py
└── test
    └── test_kag.py

The kag.py file has code to load the raw data and code to tweak it. It looks like this:

import pandas as pd
import zipfile
def load_raw(zip_fname):
    with zipfile.ZipFile(zip_fname) as z:
        kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
        df = kag.iloc[1:]
    return df
def tweak_kag(df):
    na_mask = df.Q9.isna()
    hide_mask = df.Q9.str.startswith('I do not').fillna(False)
    df = df[~na_mask & ~hide_mask]
    q1 = (df.Q1
      .replace({'Prefer not to say': 'Another',
               'Prefer to self-describe': 'Another'})
      .rename('Gender')
    )
    q2 = df.Q2.str.slice(0,2).astype(int).rename('Age')
    def limit_countries(val):
        if val in  {'United States of America', 'India', 'China'}:
            return val
        return 'Another'
    q3 = df.Q3.apply(limit_countries).rename('Country')
    q4 = (df.Q4
     .replace({'Master's degree': 18,
     'Bachelor's degree': 16,
     'Doctoral degree': 20,
     'Some college/university study without earning a bachelor's degree': 13,
     'Professional degree': 19,
     'I prefer not to answer': None,
     'No formal education past high school': 12})
     .fillna(11)
     .rename('Edu')
    )
    def only_cs_stat_val(val):
        if val not in {'cs', 'eng', 'stat'}:
            return 'another'
        return val
    q5 = (df.Q5
            .replace({
                'Computer science (software engineering, etc.)': 'cs',
                'Engineering (non-computer focused)': 'eng',
                'Mathematics or statistics': 'stat'})
             .apply(only_cs_stat_val)
             .rename('Studies'))
    def limit_occupation(val):
        if val in {'Student', 'Data Scientist', 'Software Engineer', 'Not employed',
                  'Data Engineer'}:
            return val
        return 'Another'
    q6 = df.Q6.apply(limit_occupation).rename('Occupation')
    q8 = (df.Q8
      .str.replace('+', '')
      .str.split('-', expand=True)
      .iloc[:,0]
      .fillna(-1)
      .astype(int)
      .rename('Experience')
    )
    q9 = (df.Q9
     .str.replace('+','')
     .str.replace(',','')
     .str.replace('500000', '500')
     .str.replace('I do not wish to disclose my approximate yearly compensation','')
     .str.split('-', expand=True)
     .iloc[:,0]
     .astype(int)
     .mul(1000)
     .rename('Salary'))
    return pd.concat([q1, q2, q3, q4, q5, q6, q8, q9], axis=1)

The test_kag.py file looks like this:

import pytest
import kag
@pytest.fixture(scope='session')
def df():
    df = kag.load_raw('data/kaggle-survey-2018.zip')
    return kag.tweak_kag(df)
def test_salary_mean(df):
    assert 10_000 < df.Salary.mean() < 100_000
def test_salary_between(df):
    assert df.Salary.min() >= 0
    assert df.Salary.max() <= 500_000
def test_salary_not_null(df):
    assert not df.Salary.isna().any()
def test_country_values(df):
    assert set(df.Country.unique()) == {'Another', 'United States of America', 'India', 'China'}
def test_salary_dtype(df):
    assert df.Salary.dtype == int

Run the tests from the kag-demo directory. If you installed the pytest library, you will have a pytest executable. If you try to run that command you will get an error:

(env)$ pytest
================== test session starts ==================
platform darwin -- Python 3.6.4, pytest-3.10.1, py-1.7.0, pluggy-0.8.0
rootdir: /Users/matt/pandas-cookbook/kag-demo, inifile:
plugins: asyncio-0.10.0
collected 0 items / 1 errors
======================== ERRORS =========================
___________ ERROR collecting test/test_kag.py ___________
ImportError while importing test module '/Users/matt/pandas-cookbook/kag
demo/test/test_kag.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
test/test_kag.py:3: in <module>
    import kag
E   ModuleNotFoundError: No module named 'kag'
!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!
================ 1 error in 0.15 seconds ================

This error is because pytest wants to use installed code to run the tests. Because I have not used pip (or another mechanism) to install kag.py, pytest complains that it cannot find the module in locations where code is installed.

A workaround to help pytest find the kag.py file is to invoke pytest as a module. Run this command instead:

$ python -m pytest
=========================== test session starts
===========================
platform darwin -- Python 3.6.4, pytest-3.10.1, py-1.7.0, pluggy-0.8.0
rootdir: /Users/matt/pandas-cookbook/kag-demo, inifile:
collected 5 items
test/test_kag.py .....                                              [100%]
================== 5 passed, 1 warnings in 3.51 seconds ==================

Invoking pytest in this manner adds the current directory to the PYTHONPATH and now the import for the kag module succeeds.

How it works…

Complete coverage of using the pytest library is beyond the scope of this book. However, the test_kag.py file contains tests specified so that pytest understands them. Any function name that begins with test_ will be recognized as a test. The parameter to these test functions, df, is called a fixture.

Near the top of the file, I specified a function named df that was decorated with @pytest.fixture(scope='session'). This function will be called once when the test session begins. Any test function with the parameter named df will get the output of this function. The scope is specified as a session scope, so that the data is only loaded once (for the entire test session). If we did not specify the scope, the fixture scope would be at the function-level (the default). With function-level scope, the fixture would be executed once for every test function that uses it as a parameter, which makes the tests run in 12 seconds (instead of three on my machine).

There's more…

You can run Great Expectations test from pytest too. Add the following function to test_kag.py (You will need to update the path to the expectation suite):

def test_ge(df):
    import json
    import great_expectations as ge
    res = ge.validate(ge.from_pandas(df),
        expectation_suite=json.load(open('kaggle_expectations.json')))
    failures = []
    for exp in res['results']:
        if not exp['success']:
            failures.append(json.dumps(exp, indent=2))
    if failures:
        assert False, '
'.join(failures)
    else:
        assert True

Generating tests with Hypothesis

The Hypothesis library is a third-party library for generating tests, or performing property-based testing. You create a strategy (an object that generates samples of data) and then run your code against the generated output of the strategy. You want to test an invariant, or something about your data that you presume to always hold true.

Again, there could be a book written solely about this type of testing, but in this section we will show an example of using the library.

We will show how to generate Kaggle survey data, then using that generated survey data, we will run it against the tweak_kag function and validate that the function will work on new data.

We will leverage the testing code found in the previous section. The Hypothesis library works with pytest, so we can use the same layout.

How to do it…

Create a project data layout. If you had the code from the previous section, add a test_hypot.py file and a conftest.py file:

kag-demo-hypo/
├── data
│ └── kaggle-survey-2018.zip
├── kag.py
└── test
    ├── conftest.py
    ├── test_hypot.py
    └── test_kag.py

We will put shared fixtures into conftest.py. This file is a special file that pytest looks for when trying to find fixtures. We do not need to import it, but any fixture defined in there can be used by the other test files.
Move the fixture code from test_kag.py to conftest.py so that it has the following code. We will also do a little refactoring to create a raw_ function that is not a fixture that we can call outside of tests:
```
import pytest
import kag
@pytest.fixture(scope='session')
def raw():
    return raw_()
def raw_():
    return kag.load_raw('data/kaggle-survey-2018.zip')
@pytest.fixture(scope='session')
def df(raw):
    return kag.tweak_kag(raw)
```
Put the following code in test_hypot.py:
```
from hypothesis import given, strategies
from hypothesis.extra.pandas import column, data_frames
from conftest import raw_
import kag
def hypot_df_generator():
    df = raw_()
    cols = []
    for col in ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q8', 'Q9']:
        cols.append(column(col, elements=strategies.sampled_from(df[col].unique())))
    return data_frames(columns=cols)
@given(hypot_df_generator())
def test_countries(gen_df):
    if gen_df.shape[0] == 0:
        return
    kag_ = kag.tweak_kag(gen_df)
    assert len(kag_.Country.unique()) <= 4
```
The function hypot_df_generator constructs a Hypothesis search strategy. The search strategy can generate data of different types. We can manually create these strategies. In this case, I'm using the existing CSV file to populate the different values that are possible for the columns that I am interested in.

The function test_countries is a pytest test that is decorated with the @given(hypot_df_generator()) decorator. The decoration will pass a gen_df object into the test function. This object will be a DataFrame that complies with the specifications that the search strategy has. We can now test our invariants against that DataFrame. In this case, we will run the tweak_kag function and ensure that the number of unique countries in the Country column is less than or equal to four.

Go to the kag_demo directory and run the test. Here is a command to run only the test_countries test:

$ python -m pytest -k test_countries
The output looks like this:
====================== test session starts ======================
platform darwin -- Python 3.6.4, pytest-5.3.2, py-1.7.0, pluggy-0.13.1
rootdir: /Users/matt/kag-demo
plugins: asyncio-0.10.0, hypothesis-5.1.2
collected 6 items / 5 deselected / 1 selected
test/test_hypot.py F                                      [100%]
=========================== FAILURES ============================
________________________ test_countries _________________________
    @given(hypot_df_generator())
>   def test_countries(gen_df):
test/test_hypot.py:19:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/test_hypot.py:23: in test_countries
    kag_ = kag.tweak_kag(gen_df)
kag.py:63: in tweak_kag
    q8 = (df.Q8
/Users/matt/.env/364/lib/python3.6/site-packages/pandas/core/generic.py:5175: in
__getattr__
    return object.__getattribute__(self, name)
/Users/matt/.env/364/lib/python3.6/site-packages/pandas/core/accessor.py:175: in
__get__
    accessor_obj = self._accessor(obj)
/Users/matt/.env/364/lib/python3.6/site-packages/pandas/core/strings.py:1917: in __init__
    self._inferred_dtype = self._validate(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = Series([], Name: Q8, dtype: float64)
    @staticmethod
    def _validate(data):
        """
        Auxiliary function for StringMethods, infers and checks dtype of data.
        This is a "first line of defence" at the creation of the StringMethods-
        object (see _make_accessor), and just checks that the dtype is in the
        *union* of the allowed types over all string methods below; this
        restriction is then refined on a per-method basis using the decorator
        @forbid_nonstring_types (more info in the corresponding docstring).
        This really should exclude all series/index with any non-string values,
        but that isn't practical for performance reasons until we have a str
        dtype (GH 9343 / 13877)
        Parameters
        ----------
        data : The content of the Series
        Returns
        -------
        dtype : inferred dtype of data
        """
        if isinstance(data, ABCMultiIndex):
            raise AttributeError(
                "Can only use .str accessor with Index, " "not MultiIndex"
            )
        # see _libs/lib.pyx for list of inferred types
        allowed_types = ["string", "empty", "bytes", "mixed", "mixed-integer"]
        values = getattr(data, "values", data)  # Series / Index
        values = getattr(values, "categories", values)  # categorical / normal
        try:
            inferred_dtype = lib.infer_dtype(values, skipna=True)
        except ValueError:
            # GH#27571 mostly occurs with ExtensionArray
            inferred_dtype = None
        if inferred_dtype not in allowed_types:
>           raise AttributeError("Can only use .str accessor with string " "values!")
E           AttributeError: Can only use .str accessor with string values!
/Users/matt/.env/364/lib/python3.6/site-packages/pandas/core/strings.py:1967: AttributeError
-------------------------- Hypothesis ---------------------------
Falsifying example: test_countries(
    gen_df=       Q1     Q2                        Q3  ...          Q6  Q8  Q9
    0  Female  45-49  United States of America  ...  Consultant NaN NaN
    [1 rows x 8 columns],
)
========== 1 failed, 5 deselected, 1 warning in 2.23s ===========

There is a lot of noise in the output, but if you scan through it you will find that it is complaining about the code that processes the column Q8. The reason for this is that it generated a single row with a NaN entry for Q8. If we run tweak_kag with this DataFrame, pandas infers that the Q8 column has a float type and errors out when trying to use the .str accessor.

Is this a bug? It's hard to give a definitive answer on that. But this shows that if our raw data has only missing values then our code will not work.

How it works…

The Hypothesis library tries to generate a span of data that conforms to a specification. You can use this generated data to test that invariants hold. In our case, we saw that the survey data had missing data. When we generated a DataFrame with a single row of missing data, our tweak_kag function did not work. The .str accessor only works if there is at least one string value in a column, and our column only had missing data (a float value).

We could address these issues and continue to test other invariants. This illustrates another point that comes up when programming. We get caught in the forest and only see specific trees. Sometimes we need to take a step back and look at things from a different perspective. Using Hypothesis is one way to do this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Debugging and Testing Pandas

Create new playlist

Sign In

Sign Up

14

Debugging and Testing Pandas

Code to transform data

How to do it…

How it works…

Apply performance

How to do it…

How it works…

There's more…

Improving apply performance with Dask, Pandarell, Swifter, and more

How to do it…

How it works…

Inspecting code

How to do it…

How it works…

There's more…

Debugging in Jupyter

How to do it…

How it works…

There's more…

Managing data integrity with Great Expectations

How to do it…

How it works…

Using pytest with pandas

How to do it…

How it works…

There's more…

Generating tests with Hypothesis

How to do it…

How it works…

Table of Contents for
Debugging and Testing Pandas