Reading data

The formats and structures in which data comes can be varied. However, thanks to its contributive feature and extensibility, there is a package to load data into R for almost every data structure (at least the standard ones). In order to do this, it is always necessary to use functions that have different argument types according to their nature.

Delimited data

All the delimited formats in R use the same base function, that is, read.table(). This function uses many arguments but most of them have a default value. The following is a list of the most important ones:

  • header: If it is set to T, the first row is used to assign the names of the data frame.
  • nrows: This gives the amount of rows to be read. If it is set to -1, all rows are read.
  • skip: This states how many rows to skip before reading is started.
  • encoding: In case the data source contains non-ASCII characters (for example, words in languages different from English), encoding can be passed.

    Note

    For information about the rest of the arguments, visit https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html.

The only non-default argument that this function and its derivatives (read.csv(), read.delim(), and so on) have is file, the path to where the data input file is located. The path can be local or a URL. However, it is usually useful (and safer) to specify the delimiter as read.table, which uses whitespace as a default delimiter:

#URL to Iris Dataset
path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

#Load dataset with generic read.table()
data <- read.table(path, sep=",")

The output of this function is a named data.frame, as shown here:

> class(data)
[1] "data.frame"

Reading line by line

The function to read texts line by line is readLines(). As in the case of read.table(), the only required argument is the file path or a connection object (connection objects will be not covered here, for further information, visit https://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html). The readLines() function mainly reads a string and separates it by newline ( ), as follows:

#URL to Iris Dataset
path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

#Load dataset with readLines()
data <- readLines(path)

The output of readLines() is a character vector whose elements correspond to the lines of the read file, as shown here:

> class(data)
[1] "character"
> length(data)
[1] 151

Reading a character set

The function to read characters is readChar(). In this case, not only the file path or a connection object is needed, but also the number of characters that must be read (the nchars argument). If nchars is greater than the total number of characters in the string, it will stop at the end of the string, as follows:

#URL to Iris Dataset
path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

#Load dataset with readChar
data <- readChar(path,nchars= 1e5)

Tip

XeY format is equal to X10y.

The output of readChar() is a character object, which is equal to a character vector of length 1, as the following code shows:

> class(data)
[1] "character"
> length(data)
[1] 1

Reading JSON

JSON is an acronym that stands for JavaScript Object Notation and is basically a non-structured data storage format, which will be discussed later in this book. As functions to read JSON do not come in the default packages, installing new packages is required. The commonly used packages for this purpose are RJSONIO and rjson. Although both the packages return similar things, the main difference between them is that the first one can load data from connections directly but the second one needs an intermediate step to load data into R.

Here's an example with RJSONIO:

#Load RJSONIO
library(RJSONIO)

#URL Public API Worldbank Data Catalog in JSON format
url <- "http://api.worldbank.org/v2/datacatalog?format=json"

#Read data directly from url
json <- fromJSON(url)

Here's an example with rjson:

#Load RJSONIO
library(rjson)

#URL Public API Worldbank Data Catalog in JSON format
url <- "http://api.worldbank.org/v2/datacatalog?format=json"

#Read data with readChar
raw.json <- readChar(url,nchars=1e6)

#Format into JSON
json <- fromJSON(raw.json)

As both the packages share the same function names, the last loaded package will override the function of the other one. In this case, for instance, if rjson is loaded after RJSONIO, fromJSON() will work as defined in rjson and not RJSONIO. In such cases, you will receive this message:

library(RJSONIO)
library(rjson)
## 
## Attaching package: 'rjson'
## 
## The following objects are masked from 'package:RJSONIO':
## 
##     fromJSON, toJSON

The output in both cases is a list.

Reading XML

XML is another non-structured data storage format and stands for Extended Markup Language. Although it has been lately replaced by JSON, XML is still frequently found, for example, in feeds. To read XML files, the XML package is recommended. This package has a large number of functions. The following is an example of how to load XML data into R:

#Load XML library
library(XML)

#URL Public API Worldbank Data Catalog in XML format
url <- "http://api.worldbank.org/v2/datacatalog?format=xml"

#Load XML document
xml.obj <- xmlTreeParse(url)

The object returned is of the XMLDocument class:

> class(xml.obj)
[1] "XMLDocument"     "XMLAbstractDocument"

Reading databases – SQL

The packages used to interface with relational databases are RODBC for ODBC connectivity and RJDBC for JDBC. For obvious reasons, it is impossible in this case to refer to a concrete example. In order to use and understand in depth the capabilities of these packages, prior knowledge of ODBC/JDBC is required.

The documentation is available at http://cran.r-project.org/web/packages/RODBC/RODBC.pdf and http://cran.r-project.org/web/packages/RJDBC/RJDBC.pdf.

Reading data from external sources

For almost every data file in tabular form, there is a package to import it to R. It is out of the scope of this book to go further into this. The most important ones are xlsx (for Excel files), Hmisc (for SAS and SPSS portable files), and foreign (for SAS, SPSS, Stata, Octave, and Weka among others). However, it is always preferred, when possible, to convert any of these files to a standard text file format, such as .csv to ensure that unexpected (and sometimes very difficult to solve) problems are avoided.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset