- Read in the flights dataset, and output the first five rows:
>>> flights = pd.read_csv('data/flights.csv')
>>> flights.head()
- If we want to find the distribution of airlines over a range of distances, we need to place the values of the DIST column into discrete bins. Let's use the pandas cut function to split the data into five bins:
>>> bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
>>> cuts = pd.cut(flights['DIST'], bins=bins)
>>> cuts.head()
0 (500.0, 1000.0]
1 (1000.0, 2000.0]
2 (500.0, 1000.0]
3 (1000.0, 2000.0]
4 (1000.0, 2000.0]
Name: DIST, dtype: category
Categories (5, interval[float64]): [(-inf, 200.0] < (200.0, 500.0] < (500.0, 1000.0] < (1000.0, 2000.0] < (2000.0, inf]]
- An ordered categorical Series is created. To help get an idea of what happened, let's count the values of each category:
>>> cuts.value_counts()
(500.0, 1000.0] 20659
(200.0, 500.0] 15874
(1000.0, 2000.0] 14186
(2000.0, inf] 4054
(-inf, 200.0] 3719
Name: DIST, dtype: int64
- The cuts Series can now be used to form groups. Pandas allows you to form groups in any way you wish. Pass the cuts Series to the groupby method and then call the value_counts method on the AIRLINE column to find the distribution for each distance group. Notice that SkyWest (OO) makes up 33% of flights less than 200 miles but only 16% of those between 200 and 500 miles:
>>> flights.groupby(cuts)['AIRLINE'].value_counts(normalize=True)
.round(3).head(15)
DIST AIRLINE
(-inf, 200.0] OO 0.326
EV 0.289
MQ 0.211
DL 0.086
AA 0.052
UA 0.027
WN 0.009
(200.0, 500.0] WN 0.194
DL 0.189
OO 0.159
EV 0.156
MQ 0.100
AA 0.071
UA 0.062
VX 0.028
Name: AIRLINE, dtype: float64