How to do it...

  1. Read the college dataset with the institution name as the index:
>>> college = pd.read_csv('data/college.csv', index_col='INSTNM')
>>> college.dtypes
CITY object STABBR object HBCU float64 MENONLY float64 ... PCTFLOAN float64 UG25ABV float64 MD_EARN_WNE_P10 object GRAD_DEBT_MDN_SUPP object Length: 26, dtype: object

  1. All the other columns besides CITY and STABBR appear to be numeric. Examining the data types from the preceding step reveals unexpectedly that the MD_EARN_WNE_P10 and GRAD_DEBT_MDN_SUPP columns are of type object and not numeric. To help get a better idea of what kind of values are in these columns, let's examine their first value:
>>> college.MD_EARN_WNE_P10.iloc[0]
'30300'

>>> college.GRAD_DEBT_MDN_SUPP.iloc[0]
'33888'
  1. These values are strings but we would like them to be numeric. This means that there are likely to be non-numeric characters that appear elsewhere in the Series. One way to check for this is to sort these columns in descending order and examine the first few rows:
>>> college.MD_EARN_WNE_P10.sort_values(ascending=False).head()
INSTNM Sharon Regional Health System School of Nursing PrivacySuppressed Northcoast Medical Training Academy PrivacySuppressed Success Schools PrivacySuppressed Louisiana Culinary Institute PrivacySuppressed Bais Medrash Toras Chesed PrivacySuppressed Name: MD_EARN_WNE_P10, dtype: object
  1. The culprit appears to be that some schools have privacy concerns about these two columns of data. To force these columns to be numeric, use the pandas function to_numeric:
>>> cols = ['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP']
>>> for col in cols:
college[col] = pd.to_numeric(college[col], errors='coerce')

>>> college.dtypes.loc[cols]
MD_EARN_WNE_P10 float64 GRAD_DEBT_MDN_SUPP float64 dtype: object

  1. Use the select_dtypes method to filter for only numeric columns. This will exclude STABBR and CITY columns, where a maximum value doesn't make sense with this problem:
>>> college_n = college.select_dtypes(include=[np.number])
>>> college_n.head()
  1. By utilizing the data dictionary, there are several columns that have only binary (0/1) values that will not provide useful information. To programmatically find these columns, we can create boolean Series and find all the columns that have two unique values with the nunique method:
>>> criteria = college_n.nunique() == 2
>>> criteria.head()
HBCU True MENONLY True WOMENONLY True RELAFFIL True SATVRMID False dtype: bool
  1. Pass this boolean Series to the indexing operator of the columns index object and create a list of the binary columns:
>>> binary_cols = college_n.columns[criteria].tolist()
>>> binary_cols
['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER']
  1. Remove the binary columns with the drop method:
>>> college_n2 = college_n.drop(labels=binary_cols, axis='columns')
>>> college_n2.head()
  1. Use the idxmax method to find the index label of the maximum value for each column:
>>> max_cols = college_n2.idxmax()
>>> max_cols
SATVRMID California Institute of Technology SATMTMID California Institute of Technology UGDS University of Phoenix-Arizona UGDS_WHITE Mr Leon's School of Hair Design-Moscow ... PCTFLOAN ABC Beauty College Inc UG25ABV Dongguk University-Los Angeles MD_EARN_WNE_P10 Medical College of Wisconsin GRAD_DEBT_MDN_SUPP Southwest University of Visual Arts-Tucson Length: 18, dtype: object
  1. Call the unique method on the max_cols Series. This returns an ndarray of the unique column names:
>>> unique_max_cols = max_cols.unique()
>>> unique_max_cols[:5]
array(['California Institute of Technology', 'University of Phoenix-Arizona', "Mr Leon's School of Hair Design-Moscow", 'Velvatex College of Beauty Culture', 'Thunderbird School of Global Management'], dtype=object)
  1. Use the values of max_cols to select only the rows that have schools with a maximum value and then use the style attribute to highlight these values:
>>> college_n2.loc[unique_max_cols].style.highlight_max()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset