Databases

RapidMiner is able to read from databases with ease. Most databases—such as MySQL, PostgresSQL, SQL Server, Sybase, Oracle, and Access—are supported. A Java Database Connectivity (JDBC) driver is generally available. What this means is that it will generally be possible to read data from virtually any database, but it is often the case that some of the specific configuration details can become complex. These configuration details are beyond the scope of this book, because they stray into the specifics of databases rather than RapidMiner. But, generally speaking, it is straightforward to deploy and configure a new driver to connect to virtually any datasource.

The Read Database operator

The Read Database operator is the main operator used to access databases. It is generally a good thing to have data in databases because it eases the exploration process. Databases allow larger datasets to be stored, provide tools and a query language to allow data to be retrieved and updated, and impose a type on attributes that can make validation easier.

Database connections are created from the RapidMiner Studio GUI via the Tools menu item. The dialog is straightforward to follow and the test connection button allows a quick confirmation that the connection is working correctly.

It is possible to construct SQL queries using macros. This is vital to allow data to be selected specifically using parameters for the task at hand. The setup of the Read Database operator is slightly different from the norm but is straightforward once you have it working.

To illustrate this, a simple working example is given in the following points:

  1. First, the database table that will be queried is a simple single table containing lines read from files. The name of the table is details and the important columns are as follows:
    • line: This corresponds to a line within a file
    • file: This corresponds to the file containing the line
    • source: This gives the source from which the file was copied
    • length: This gives the length of the line

    A small fragment of the data would look something like the following screenshot, where the column headings correspond to the fields in the database and the whole table has a name, for example, details:

    The Read Database operator
  2. To select data from this table, a SQL where clause might be as follows:
    Select * from details where length > 50 and source like '%Desktop2%'
  3. This query will return the fourth row of the previous table.

To configure the Read Database operator to submit a query like this requires the following steps:

  1. First, the Build SQL Query dialog is used to enter the following query in the SQL Query box:
    Select * from details where length > ? and source like ?

    The question marks correspond to the parameters.

  2. Next, the Edit Enumeration dialog is used to create a mapping between the question marks and some macros, shown in the following screenshot. Note how the macro names length and source match the names within the query.
    The Read Database operator
  3. The final thing to do is to create the two macros (macros are discussed in detail in an upcoming section). This can be done using operators or can be done in the process context. The following figure shows the macros defined in the context view:
    The Read Database operator

Note how the source macro value has percent signs around it. This ensures the syntax comes out correctly so that the final issued query is also correct.

Macros are an important part of RapidMiner and are used everywhere. Refer to the Using macros section in a moment to get an overview of these important entities.

Large datasets

When reading large databases using the Read Database operator, all the data is read into memory. Clearly, this will cause problems as the data increases.

In my experience, large database imports should try to take advantage of the tools within the database rather than get RapidMiner to perform significant processing. Databases have many powerful tools to allow data to be transformed with views or summary tables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset