Chapter 12. Specialized Machine Learning Topics

Congratulations on reaching this point in your machine learning journey! If you have not already started work on your own projects, you will do so soon. And in doing so, you may find that the task of turning data into action is more difficult than it first appeared.

As you gathered data, you may have realized that the information was trapped in a proprietary format or spread across pages on the Web. Making matters worse, after spending hours reformatting the data, maybe your computer slowed to a crawl after running out of memory. Perhaps R even crashed or froze your machine. Hopefully, you were undeterred, as these issues can be remedied with a bit more effort.

This chapter covers techniques that may not apply to every project, but will prove useful for working around such specialized issues. You might find the information particularly useful if you tend to work with data that is:

  • Stored in unstructured or proprietary formats such as web pages, web APIs, or spreadsheets
  • From a specialized domain such as bioinformatics or social network analysis
  • Too large to fit in memory or analyses take a very long time to complete

You're not alone if you suffer from any of these problems. Although there is no panacea—these issues are the bane of the data scientist as well as the reason data skills are in high demand—through the dedicated efforts of the R community, a number of R packages provide a head start toward solving the problem.

This chapter provides a cookbook of such solutions. Even if you are an experienced R veteran, you may discover a package that simplifies your workflow. Or, perhaps one day, you will author a package that makes work easier for everybody else!

Working with proprietary files and databases

Unlike the examples in this book, real-world data is rarely packaged in a simple CSV form that can be downloaded from a website. Instead, significant effort is needed to prepare data for analysis. Data must be collected, merged, sorted, filtered, or reformatted to meet the requirements of the learning algorithm. This process is informally known as data munging or data wrangling.

Data preparation has become even more important, as the size of typical datasets has grown from megabytes to gigabytes, and data is gathered from unrelated and messy sources, many of which are stored in massive databases. Several packages and resources for retrieving and working with proprietary data formats and databases are listed in the following sections.

Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files

A frustrating aspect of data analysis is the large amount of work required to pull and combine data from various proprietary formats. Vast troves of data exist in files and databases that simply need to be unlocked for use in R. Thankfully, packages exist for exactly this purpose.

What used to be a tedious and time-consuming process, requiring knowledge of specific tricks and tools across multiple R packages, has been made trivial by a relatively new R package called rio (an acronym for R input and output). This package, by Chung-hong Chan, Geoffrey CH Chan, Thomas J. Leeper, and Christopher Gandrud, is described as a "Swiss-army knife for data". It is capable of importing and exporting a large variety of file formats, including but not limited to: tab-separated (.tsv), comma-separated (.csv), JSON (.json), Stata (.dta), SPSS (.sav and .por), Microsoft Excel (.xls and .xlsx), Weka (.arff), and SAS (.sas7bdat and .xpt).

Note

For the complete list of file types rio can import and export, as well as more detailed usage examples, see http://cran.r-project.org/web/packages/rio/vignettes/rio.html.

The rio package consists of three functions for working with proprietary data formats: import(), export(), and convert(). Each does exactly what you'd expect, given their name. Consistent with the package's philosophy of keeping things simple, each function uses the filename extension to guess the type of file to import, export, or convert.

For example, to import the credit data from previous chapters, which is stored in CSV format, simply type:

> library(rio)
> credit <- import("credit.csv")

This creates the credit data frame as expected; as a bonus, not only did we not have to specify the CSV file type, rio automatically set stringsAsFactors = FALSE as well as other reasonable defaults.

To export the credit data frame to Microsoft Excel (.xlsx) format, use the export() function while specifying the desired filename, as follows. For other formats, simply change the file extension to the desired output type:

> export(credit, "credit.xlsx")

It is also possible to convert the CSV file to another format directly, without an import step, using the convert() function. For example, this converts the credit.csv file to Stata (.dta) format:

> convert("credit.csv", "credit.dta")

Though the rio package covers many common proprietary data formats, it does not do everything. The next section covers other ways to get data into R via database queries.

Querying data in SQL databases

Large datasets are often stored in Database Management Systems (DBMSs) such as Oracle, MySQL, PostgreSQL, Microsoft SQL, or SQLite. These systems allow the datasets to be accessed using a Structured Query Language (SQL), a programming language designed to pull data from databases. If your DBMS is configured to allow Open Database Connectivity (ODBC), the RODBC package by Brian Ripley can be used to import this data directly into an R data frame.

Tip

If you have trouble using ODBC to connect to your database, you may try one of the DMBS-specific R packages. These include ROracle, RMySQL, RPostgresSQL, and RSQLite. Though they will function largely similar to the instructions here, refer to the package documentation on CRAN for instructions specific to each package.

ODBC is a standard protocol for connecting to databases regardless of operating system or DBMS. If you were previously connected to an ODBC database, you most likely would have referred to it via its Data Source Name (DSN). You will need the DSN, plus a username and password (if your database requires it) to use RODBC.

Tip

The instructions to configure an ODBC connection are highly specific to the combination of the OS and DBMS. If you are having trouble setting up an ODBC connection, check with your database administrator. Another way to obtain help is via the RODBC package vignette, which can be accessed in R with the vignette("RODBC") command after the RODBC package has been installed.

To open a connection called my_db for the database with the my_dsn DSN, use the odbcConnect() function:

> library(RODBC)
> my_db <- odbcConnect("my_dsn")

Alternatively, if your ODBC connection requires a username and password, they should be specified while calling the odbcConnect() function:

> my_db <- odbcConnect("my_dsn",
    uid = "my_username",
    pwd = "my_password")

With an open database connection, we can use the sqlQuery() function to create an R data frame from the database rows pulled by an SQL query. This function, like the many functions that create data frames, allows us to specify stringsAsFactors = FALSE to prevent R from automatically converting character data into factors.

The sqlQuery() function uses typical SQL queries, as shown in the following command:

> my_query <- "select * from my_table where my_value = 1"
> results_df <- sqlQuery(channel = my_db, query = sql_query,
    stringsAsFactors = FALSE)

The resulting results_df object is a data frame containing all of the rows selected using the SQL query stored in sql_query.

Once you are done using the database, the connection can be closed using the following command:

> odbcClose(my_db)

Although R will automatically close ODBC connections at the end of an R session, it is a better practice to do so explicitly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset