Summary

The main learning outcomes of this chapter are summarized as follows:

  • Various methods and variations in importing a dataset using pandas: read_csv and its variations, reading a dataset using open method in Python, reading a file in chunks using the open method, reading directly from a URL, specifying the column names from a list, changing the delimiter of a dataset, and so on.
  • Basic exploratory analysis of data: observing a thumbnail of data, shape, column names, column types, and summary statistics for numerical variables
  • Handling missing values: The reason for incorporation of missing values, why it is important to treat them properly, how to treat them properly by deletion and imputation, and various methods of imputing data.
  • Creating dummy variables: creating dummy variables for categorical variables to be used in the predictive models.
  • Basic plotting: scatter plotting, histograms and boxplots; their meaning and relevance; and how they are plotted.

This chapter is a head start into our journey to explore our data and wrangle it to make it modelling-worthy. The next chapter will go deeper in this pursuit whereby we will learn to aggregate values for categorical variables, sub-set the dataset, merge two datasets, generate random numbers, and sample a dataset.

Cleaning, as we have seen in the last chapter takes about 80% of the modelling time, so it's of critical importance and the methods we are learning will come in handy in the pursuit of that goal.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset