By following the recipes given in the previous chapter, you got your data. Everything went smoothly, and you may also already have the data as a data frame object.
However, do you know what your data looks like?
Getting to know your data structure is a crucial step within a data analysis project. It will suggest the appropriate treatment and analysis, and will help you avoid error and redundancy in the coding activity that follows.
In this recipe, we will look at a dataset structure by leveraging the describe()
function from the Hmisc
package. For further preliminary analysis on your data structure, you can also refer to the data visualization recipes in Chapter 3, Basic Visualization Techniques.
This example will be built around a dataset provided in the RStudio project related to this book.
You can download it by authenticating your account at http://packtpub.com.
This dataset is named world_gdp_data.csv
and stores GDP values for 248 countries around the globe, from 1960 to 2015.
Before you begin with this recipe, you will need to load this data into R by leveraging the import
function from the rio
package:
install.packages("rio") library(rio) messy_gdp <- import("world_gdp_data.csv")
You can refer to the Loading your data into R with rio packages recipe in Chapter 1, Acquiring Data for Your Project, for details on this powerful tool's functionalities.
As mentioned earlier, we will employ functions from the Hmisc
and e1071
packages.
Use the following code to install and load packages:
install.packages(c("Hmisc","e1071") library(Hmisc) library(e1071)
data_dictionary <- describe(messy_gdp)
sink("data_dictionary.txt", append=TRUE) data_dictionary sink()
file.show(file = "data_dictionary.txt",pager = "internal")
Preforming step 1 will produce a data_dictionary
object, which is a list of as many lists as there are columns in your data frame plus one, the contents of which we are going to discover lately.
For each column, the following details are exposed:
The last list is populated only if the columns of all missing values are read and contain the name of those columns.
Step 2 lets you create a document to which you will be able to refer, even outside R, mainly for documentation purposes. This step will produce a .txt
file named data_dictionary
placed within the current directory of your R session.
Since the data_dictionary
object is a list
object, we can't simply save it as a .txt
file (we could easily do this with the write()
function when dealing with a data frame). So, we used a workaround involving the sink()
function.
This function sends the output of R to an external connection.
The logical phases of this process are as follows:
sink()
for the first timesink()
againStep 3 is the final step and involves calling the file.show
function to show you your previously created data dictionary. Be aware that changing the pager
argument to console
would make the .txt
file content show up in the R console.