Chapter 2. Managing and Understanding Data

A key early component of any machine learning project involves managing and understanding the data you have collected. Although you may not find it as gratifying as building and deploying models—the stages in which you begin to see the fruits of your labor—you cannot ignore the preparatory work.

Any learning algorithm is only as good as its input data, and in many cases, input data is complex, messy, and spread across multiple sources and formats. Because of this complexity, the largest portion of effort invested in machine learning projects is spent on the data preparation and exploration process.

This chapter is divided into three main sections. The first section discusses the basic data structures R uses to store data. You will become very familiar with these structures as you create and manipulate datasets. The second section is practical, as it covers several functions that are useful for getting data in and out of R. In the third section, methods for understanding data are illustrated throughout the process of exploring a real-world dataset.

By the end of this chapter, you will understand:

  • The basic R data structures and how to use them to store and extract data
  • How to get data into R from a variety of source formats
  • Common methods for understanding and visualizing complex data

Since the way R thinks about data will define the way you think about data, it is helpful to understand the basic R data structures before jumping into data preparation. However, if you are already familiar with R data structures, feel free to skip ahead to the section on data preprocessing.

R data structures

There are numerous types of data structures across programming languages, each with strengths and weaknesses specific to particular tasks. Since R is a programming language used widely for statistical data analysis, the data structures it utilizes are designed to make it easy to manipulate data for this type of work. The R data structures used most frequently in machine learning are vectors, factors, lists, arrays, and data frames. Each of these data types is specialized for a specific data management task, which makes it important to understand how they will interact in your R project.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset