How it works...

Finding streaks in the data is not a straightforward operation in pandas and requires methods that look ahead or behind, such as diff or shift, or those that remember their current state, such as cumsum. The final result from the first seven steps is a Series the same length as the original that keeps track of all consecutive ones. Throughout these steps, we use the mul and add methods instead of their operator equivalents (*) and (+). In my opinion, this allows for a slightly cleaner progression of calculations from left to right. You, of course, can replace these with the actual operators.

Ideally, we would like to tell pandas to apply the cumsum method to the start of each streak and reset itself after the end of each one. It takes many steps to convey this message to pandas. Step 2 accumulates all the ones in the Series as a whole. The rest of the steps slowly remove any excess accumulation. In order to identify this excess accumulation, we need to find the end of each streak and subtract this value from the beginning of the next streak.

To find the end of each streak, we cleverly make all values not part of the streak zero by multiplying s1 by the original Series of zeros and ones in step 3. The first zero following a non-zero, marks the end of a streak. That's good, but again, we need to eliminate the excess accumulation. Knowing where the streak ends doesn't exactly get us there.

In step 4, we use the diff method to find this excess. The diff method takes the difference between the current value and any value located at a set number of rows away from it. By default, the difference between the current and the immediately preceding value is returned.

Only negative values are meaningful in step 4. Those are the ones immediately following the end of a streak. These values need to be propagated down until the end of the following streak. To eliminate (make missing) all the values we don't care about, we use the where method, which takes a Series of conditionals of the same size as the calling Series. By default, all the True values remain the same, while the False values become missing. The where method allows you to use the calling Series as part of the conditional by taking a function as its first parameter. An anonymous function is used, which gets passed the calling Series implicitly and checks whether each value is less than zero. The result of step 5 is a Series where only the negative values are preserved with the rest changed to missing.

The ffill method in step 6 replaces missing values with the last non-missing value going forward/down a Series. As the first three values don't follow a non-missing value, they remain missing. We finally have our Series that removes the excess accumulation. We add our accumulation Series to the result of step 6 to get the streaks all beginning from zero. The add method allows us to replace the missing values with the fill_value parameter. This completes the process of finding streaks of ones in the dataset. When doing complex logic like this, it is a good idea to use a small dataset where you know what the final output will be. It would be quite a difficult task to start at step 8 and build this streak-finding logic while grouping.

In step 8, we create the ON_TIME column. One item of note is that the cancelled flights have missing values for ARR_DELAY, which do not pass the boolean condition and therefore result in a zero for the ON_TIME column. Canceled flights are treated the same as delayed.

Step 9 turns our logic from the first seven steps into a function and chains the max method to return the longest streak. As our function returns a single value, it is formally an aggregating function and can be passed to the agg method as done in step 10. To ensure that we are looking at actual consecutive flights, we use the sort_values method to sort by date and scheduled departure time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset