RapidMiner is able to read from databases with ease. Most databases—such as MySQL, PostgresSQL, SQL Server, Sybase, Oracle, and Access—are supported. A Java Database Connectivity (JDBC) driver is generally available. What this means is that it will generally be possible to read data from virtually any database, but it is often the case that some of the specific configuration details can become complex. These configuration details are beyond the scope of this book, because they stray into the specifics of databases rather than RapidMiner. But, generally speaking, it is straightforward to deploy and configure a new driver to connect to virtually any datasource.
The Read Database
operator is the main operator used to access databases. It is generally a good thing to have data in databases because it eases the exploration process. Databases allow larger datasets to be stored, provide tools and a query language to allow data to be retrieved and updated, and impose a type on attributes that can make validation easier.
Database connections are created from the RapidMiner Studio GUI via the Tools menu item. The dialog is straightforward to follow and the test connection button allows a quick confirmation that the connection is working correctly.
It is possible to construct SQL queries using macros. This is vital to allow data to be selected specifically using parameters for the task at hand. The setup of the Read Database
operator is slightly different from the norm but is straightforward once you have it working.
To illustrate this, a simple working example is given in the following points:
details
and the important columns are as follows:A small fragment of the data would look something like the following screenshot, where the column headings correspond to the fields in the database and the whole table has a name, for example, details
:
where
clause might be as follows:Select * from details where length > 50 and source like '%Desktop2%'
To configure the Read Database
operator to submit a query like this requires the following steps:
Select * from details where length > ? and source like ?
The question marks correspond to the parameters.
length
and source
match the names within the query.Note how the source macro value has percent signs around it. This ensures the syntax comes out correctly so that the final issued query is also correct.
Macros are an important part of RapidMiner and are used everywhere. Refer to the Using macros section in a moment to get an overview of these important entities.
When reading large databases using the Read Database
operator, all the data is read into memory. Clearly, this will cause problems as the data increases.
In my experience, large database imports should try to take advantage of the tools within the database rather than get RapidMiner to perform significant processing. Databases have many powerful tools to allow data to be transformed with views or summary tables.