Data source preparation

Throughout the chapter, we will be working with CTools further to build a more interactive dashboard. We will use the nyse_stocks data (please refer to Appendix B, Hadoop Setup, for more details), but need to change its structure. The data source for the dashboard will be a PDI transformation.

Repopulating the nyse_stocks Hive table

Execute the following steps:

  1. Launch Spoon.
  2. Open the nyse_stock_transfer.ktr file from the chapter's code folder.
  3. Move NYSE-2000-2001.tsv.gz within the same folder with the transformation file.
  4. Run the transformation until it is finished. This process will produce the NYSE-2000-2001-convert.tsv.gz file.
  5. Launch your web browser and open Sandbox using the address http://192.168.1.122:8000.
  6. On the menu bar, choose the File Browser menu.
  7. The File Browser window appears; click on the Upload Button, and choose Files. Navigate to your NYSE-2000-2001-convert.tsv.gz file and wait until the uploading process finishes.
  8. On the menu bar, choose the HCatalog menu.
  9. In the submenu bar, click on the Tables menu. From here, drop the existing nyse_stocks table.
  10. On the left-hand side pane, click on the Create a new table from a file link.
  11. In the Table Name textbox, type nyse_stocks.
  12. Click on the NYSE-2000-2001-convert.tsv.gz file. If the file does not exist, make sure you navigate to the right user or name path.
  13. On the Create a new table from a file page, accept all the options and click on the Create Table button.
  14. Once it is finished, the page redirects to HCatalog Table List. Click on the Browse Data button next to nyse_stocks. Make sure the month and year columns are now available.

In Chapter 2, Setting Up the Ground, we learned that Action Sequence can execute any step in PDI script. However, since it needs to list the step's metadata using the getMetaData method in the PreparedStatement class, it will become problematic for a Hive connection. It is because Hive JDBC does not implement the getMetaData method. Therefore, we need to work out another way by using Java code that utilizes the Statement class instead of PreparedStatement in PDI's user-defined Java class.

Pentaho's data source integration

Execute the following steps:

  1. Launch Spoon.
  2. Open hive_java_query.ktr from the chapter's code folder. This transformation acts as our data.
  3. The transformation consists of several steps, but the most important are three initial steps:
    • Generate Rows: Its function is to generate a data row and trigger the execution of the next sequence of steps, which are Get Variable and User Defined Java Class
    • Get Variable: This enables the transformation to identify a variable and converted into a row field with its value
    • User Defined Java Class: This contains a Java code to query Hive data
  4. Double-click on the User Defined Java Class step. The code begins with importing all the required Java packages, followed by the processRow() method. The code is actually a query to the Hive database using JDBC objects. What makes it different is the following code:
    ResultSet res = stmt.executeQuery(sql);
    while (res.next()) {
      get(Fields.Out, "period").setValue(rowd, res.getString(3) + "-" + res.getString(4));
      get(Fields.Out, "stock_price_close").setValue(rowd, res.getDouble(1));
      putRow(data.outputRowMeta, rowd);
    }

    The code will execute a SQL query statement to Hive. The result will be iterated and filled in the PDI's output rows. Column 1 of the result will be reproduced as stock_price_close. The concatenation of columns 3 and 4 of the result becomes period.

  5. In the User Defined Java Class step, click on the Preview this transformation menu. It may take a few minutes because of the MapReduce process and because it is a single-node Hadoop cluster. You will have better performance when adding more nodes to achieve an optimum cluster setup. You will see a data preview like the following screenshot:
    Pentaho's data source integration

Consuming PDI as a CDA data source

To consume data through CTools, use Community Data Access (CDA) as it is the standard data access layer. CDA is able to connect to several sources including a Pentaho Data Integration transformation.

The following steps will help you create a CDA data sources consuming PDI transformation:

  1. Copy the Chapter 5 folder from your book's code bundle folder into [BISERVER]/pentaho-solutions.
  2. Launch PUC.
  3. In the Browser Panel window, you should see a newly added folder, Chapter 5. If it does not appear, in the Tools menu, click on Refresh and select Repository Cache.
  4. In the PUC Browser Panel window, right-click on NYSE Stock Price – Hive and choose Edit.
  5. Using the same procedure described in Chapter 4, Pentaho Business Analytics Tools, create the following three data sources:

    Name

    line_chart_data

    pie_chart_data

    Kettle Transformation File

    hive_java_query.ktr

    hive_java_query.ktr

    Query (represents a step)

    line_chart_data

    pie_chart_data

    Variables (1)

    stock_param_data (Arg), STOCK (Value)

    stock_param_data (Arg), STOCK (Value)

    Parameters (1)

    stock_param_data (Name), ALLSTOCKS (value)

    stock_param_data (Name), ALLSTOCKS (value)

    The variables and parameters in the data sources will be used later to interact with the dashboard's filter. The Variables textbox allows more than one pair. And Variables(1) indicates that it is the first index value of the Arg and Value pair. The same explanation goes to Parameters(1).

  6. In the Browser Panel window, double-click on stock_price_dashboard_hive.cda inside Chapter 5 to open a CDA data browser. The listbox contains data source names that we have created before; choose DataAccess ID: line_trend_data to preview its data. It will show a table with three columns (stock_symbol, period, and stock_price_close) and one parameter, stock_param_data, with a default value, ALLSTOCKS. Explore all the other data sources to gain a better understanding when working with the next examples.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset