Pentaho Data Integration (PDI)

In the previous chapter, we discussed Pentaho Data Integration (PDI) a little, which is a part of the Pentaho stack. The Pentaho ecosystem enables management of voluminous data with ease and also provides increased velocity and variety. (It does not matter how many data sources or whichever data types…!) PDI delivers "analytics ready" data to end users much faster with a choice of visual tools that reduce the time and complexity of the data analytics life cycle. PDI comes as a standalone Community Edition (CE) as well as bundled with Pentaho BA Server Enterprise Edition (EE).

PDI has some inherent advantages such as beautiful orchestration and integration for all data stores using its very powerful GUI. It has an adaptive Big Data Layer supporting almost any Big Data source with reduced complexity. In this way, data has become abstract from analytics giving a competitive advantage. Its simple drag-and-drop design supports a rich set of mapping objects, including a GUI-based MapReduce designer for Hadoop, with support for custom plugins developed in Java.

Just a month back, Rackspace brought ETL to the Cloud with help from Pentaho, so you don't need to jam your local hardware but you would rather leverage this online service.

Now we will explore much of the capabilities throughout the remainder of the chapter. The latest stable version of PDI at the time of this writing was 4.4. You can obtain the distribution from SourceForge at http://goo.gl/95Ikgp.

Download pdi-ce-4.4.0-stable.zip or 4.4.0-stable.tar.gz. Extract it to any location you prefer. We will refer to the extraction's full path as [PDI_HOME].

For running PDI for the first time, follow these steps:

  1. Navigate to [PDI_HOME] and double-click on Spoon.bat. This will launch Spoon, a GUI application to design and execute a PDI script (job or transformation).
  2. The Repository Connection dialog appears; uncheck the Show this dialog at startup option and click on the Cancel button to close it.
  3. The Spoon Tips... dialog appears; click on Close.
  4. The Spoon application window appears to create a job and transformation.

The Pentaho Big Data plugin configuration

PDI 4.4 comes with a Big Data plugin that is not compatible with Hortonworks distribution. We need to download and configure PDI with the new version of the plugin to make it work. Close any Spoon application if you are running it.

To set up the Big Data plugin for PDI, follow these given steps:

  1. Visit http://ci.pentaho.com.
  2. Click on the Big Data menu tab.
  3. Click on pentaho-big-data-plugin-1.3 or the latest project available.
  4. Download the ZIP version of the pentaho-big-data-plugin file. At the time of this writing, the latest version of the file was pentaho-big-data-plugin-1.3-SNAPSHOT.zip.

    The following screenshot shows the latest version of the Pentaho Big Data plugin:

    The Pentaho Big Data plugin configuration
  5. Delete the [PDI_HOME]/plugins/pentaho-big-data-plugin folder. Replace it with the contents of the ZIP file.
  6. Edit the [PDI_HOME]/plugins/pentaho-big-data-plugin/plugin.properties file and change the active.hadoop.configuration property from hadoop-20 to hdp12, which represents the Hadoop Data Platform.
  7. Copy the core-site.xml file from VM's /etc/hadoop/conf.empty folder using a secure FTP (see Appendix B, Hadoop Setup) into [PDI_HOME]/plugins/pentaho-big-data-plugin/hadoop-configurations/hdp12/. Edit the file by replacing the sandbox host to a working IP address.
  8. Extract the [PDI_HOME]/plugins/pentaho-big-data-plugin/pentaho-mapreduce-libraries.zip file into any folder you prefer. We will refer to the extraction full path as [PDI_MR_LIB].
  9. Delete [PDI_HOME]/libext/JDBC/pentaho-hadoop-hive-jdbc-shim-1.3.0.jar and replace it with [PDI_MR_LIB]/lib/pentaho-hadoop-hive-jdbc-shim-1.3-SNAPSHOT.jar.
  10. Copy the [PDI_MR_LIB]/lib/pentaho-hadoop-shims-api-1.3-SNAPSHOT.jar file into the [PDI_HOME]/libext/ folder.
  11. Delete [PDI_HOME]/lib/kettle-core.jar and replace the file with [PDI_MR_LIB]/lib/kettle-core-4.4.2-SNAPSHOT.jar.
  12. Delete [PDI_HOME]/lib/kettle-db.jar and replace the file with [PDI_MR_LIB]/lib/kettle-db-4.4.2-SNAPSHOT.jar.
  13. Delete [PDI_HOME]/lib/kettle-engine.jar and replace the file with [PDI_MR_LIB]/lib/kettle-engine-4.4.2-SNAPSHOT.jar.
  14. Copy and replace all the following remaining jars from [PDI_MR_LIB]/lib into [PDI_HOME]/libext.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset