In the previous chapter, we discussed Pentaho Data Integration (PDI) a little, which is a part of the Pentaho stack. The Pentaho ecosystem enables management of voluminous data with ease and also provides increased velocity and variety. (It does not matter how many data sources or whichever data types…!) PDI delivers "analytics ready" data to end users much faster with a choice of visual tools that reduce the time and complexity of the data analytics life cycle. PDI comes as a standalone Community Edition (CE) as well as bundled with Pentaho BA Server Enterprise Edition (EE).
PDI has some inherent advantages such as beautiful orchestration and integration for all data stores using its very powerful GUI. It has an adaptive Big Data Layer supporting almost any Big Data source with reduced complexity. In this way, data has become abstract from analytics giving a competitive advantage. Its simple drag-and-drop design supports a rich set of mapping objects, including a GUI-based MapReduce designer for Hadoop, with support for custom plugins developed in Java.
Just a month back, Rackspace brought ETL to the Cloud with help from Pentaho, so you don't need to jam your local hardware but you would rather leverage this online service.
Now we will explore much of the capabilities throughout the remainder of the chapter. The latest stable version of PDI at the time of this writing was 4.4. You can obtain the distribution from SourceForge at http://goo.gl/95Ikgp.
Download pdi-ce-4.4.0-stable.zip
or 4.4.0-stable.tar.gz
. Extract it to any location you prefer. We will refer to the extraction's full path as [PDI_HOME]
.
For running PDI for the first time, follow these steps:
[PDI_HOME]
and double-click on Spoon.bat
. This will launch Spoon, a GUI application to design and execute a PDI script (job or transformation).PDI 4.4 comes with a Big Data plugin that is not compatible with Hortonworks distribution. We need to download and configure PDI with the new version of the plugin to make it work. Close any Spoon application if you are running it.
To set up the Big Data plugin for PDI, follow these given steps:
pentaho-big-data-plugin-1.3
or the latest project available.pentaho-big-data-plugin
file. At the time of this writing, the latest version of the file was pentaho-big-data-plugin-1.3-SNAPSHOT.zip
.The following screenshot shows the latest version of the Pentaho Big Data plugin:
[PDI_HOME]/plugins/pentaho-big-data-plugin
folder. Replace it with the contents of the ZIP file.[PDI_HOME]/plugins/pentaho-big-data-plugin/plugin.properties
file and change the active.hadoop.configuration
property from hadoop-20
to hdp12
, which represents the Hadoop Data Platform.core-site.xml
file from VM's /etc/hadoop/conf.empty
folder using a secure FTP (see Appendix B, Hadoop Setup) into [PDI_HOME]/plugins/pentaho-big-data-plugin/hadoop-configurations/hdp12/
. Edit the file by replacing the sandbox host to a working IP address.[PDI_HOME]/plugins/pentaho-big-data-plugin/pentaho-mapreduce-libraries.zip
file into any folder you prefer. We will refer to the extraction full path as [PDI_MR_LIB]
.[PDI_HOME]/libext/JDBC/pentaho-hadoop-hive-jdbc-shim-1.3.0.jar
and replace it with [PDI_MR_LIB]/lib/pentaho-hadoop-hive-jdbc-shim-1.3-SNAPSHOT.jar
.[PDI_MR_LIB]/lib/pentaho-hadoop-shims-api-1.3-SNAPSHOT.jar
file into the [PDI_HOME]/libext/
folder.[PDI_HOME]/lib/kettle-core.jar
and replace the file with [PDI_MR_LIB]/lib/kettle-core-4.4.2-SNAPSHOT.jar
.[PDI_HOME]/lib/kettle-db.jar
and replace the file with [PDI_MR_LIB]/lib/kettle-db-4.4.2-SNAPSHOT.jar
.[PDI_HOME]/lib/kettle-engine.jar
and replace the file with [PDI_MR_LIB]/lib/kettle-engine-4.4.2-SNAPSHOT.jar
.[PDI_MR_LIB]/lib
into [PDI_HOME]/libext
.