Getting a sense of your data structure with R

By following the recipes given in the previous chapter, you got your data. Everything went smoothly, and you may also already have the data as a data frame object.

However, do you know what your data looks like?

Getting to know your data structure is a crucial step within a data analysis project. It will suggest the appropriate treatment and analysis, and will help you avoid error and redundancy in the coding activity that follows.

In this recipe, we will look at a dataset structure by leveraging the describe() function from the Hmisc package. For further preliminary analysis on your data structure, you can also refer to the data visualization recipes in Chapter 3, Basic Visualization Techniques.

Getting ready

This example will be built around a dataset provided in the RStudio project related to this book.

You can download it by authenticating your account at http://packtpub.com.

This dataset is named world_gdp_data.csv and stores GDP values for 248 countries around the globe, from 1960 to 2015.

Before you begin with this recipe, you will need to load this data into R by leveraging the import function from the rio package:

install.packages("rio")
library(rio)
messy_gdp <- import("world_gdp_data.csv")

You can refer to the Loading your data into R with rio packages recipe in Chapter 1, Acquiring Data for Your Project, for details on this powerful tool's functionalities.

As mentioned earlier, we will employ functions from the Hmisc and e1071 packages.

Use the following code to install and load packages:

install.packages(c("Hmisc","e1071")
library(Hmisc)
library(e1071)

How to do it...

  1. Create a data dictionary:
    data_dictionary <- describe(messy_gdp)
  2. Save your data dictionary as a separate file to document it:
    sink("data_dictionary.txt", append=TRUE)
    data_dictionary
    sink()
  3. Look at your data dictionary:
    file.show(file = "data_dictionary.txt",pager = "internal")

How it works...

Preforming step 1 will produce a data_dictionary object, which is a list of as many lists as there are columns in your data frame plus one, the contents of which we are going to discover lately.

For each column, the following details are exposed:

  • Variability domain, showing the lowest and highest values
  • Number of non-missing values
  • Number of missing values
  • Number of unique values
  • For categorical variables (for instance, country names), a frequency table is produced, showing the number of occurrences for each possible value of the variable

The last list is populated only if the columns of all missing values are read and contain the name of those columns.

Step 2 lets you create a document to which you will be able to refer, even outside R, mainly for documentation purposes. This step will produce a .txt file named data_dictionary placed within the current directory of your R session.

Since the data_dictionary object is a list object, we can't simply save it as a .txt file (we could easily do this with the write() function when dealing with a data frame). So, we used a workaround involving the sink() function.

This function sends the output of R to an external connection.

The logical phases of this process are as follows:

  1. Establish a connection by running sink() for the first time
  2. Run the R code you are interested in
  3. Close the connection by running sink()again

Step 3 is the final step and involves calling the file.show function to show you your previously created data dictionary. Be aware that changing the pager argument to console would make the .txt file content show up in the R console.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset