Loading data from HDFS into Hive (job orchestration)

Now we have a text file residing in HDFS that can further be processed using the MapReduce job. Behind all the SQL is a MapReduce job. We will use the Pentaho Data Integration SQL-related step to demonstrate this capability. Follow these steps:

  1. Launch Spoon if it is not running.
  2. Open hdfs_to_hive_product_price_history.kjb from the chapter's code bundle folder. Load the file into Spoon. You should see a job flow similar to the one shown in the following screenshot:
    Loading data from HDFS into Hive (job orchestration)
  3. The Hadoop Copy Files step is responsible for copying the product-price-history.tsv.gz file from the local folder into HDFS.
  4. The TABLE EXIST step checks whether the product_price_history table exists in Hive or not. If it does not, it continues to the CREATE TABLE or TRUNCATE TABLE step. The step editor dialog looks like the one shown in the following screenshot:
    Loading data from HDFS into Hive (job orchestration)
  5. The CREATE TABLE step executes the SQL command to create a new product_price_history table with its structure. The editor looks like the one shown in the following screenshot:
    Loading data from HDFS into Hive (job orchestration)

    Tip

    Hive does not support the DATE data type, so we use STRING instead, in this case, for the date field.

  6. The TRUNCATE TABLE step is executed if the table exists. This step will remove all data from the table.
  7. Finally, the LOAD INFILE step will load the content of the uploaded file into the product_price_history table. It reads from a HDFS location. The step editor dialog looks like the one shown in the following screenshot:
    Loading data from HDFS into Hive (job orchestration)
  8. Run the job.
  9. Launch your browser and navigate to Hortonworks Sandbox as shown in the following screenshot:
    Loading data from HDFS into Hive (job orchestration)
  10. From the menu bar, choose Beeswax (Hive UI), as shown in the following screenshot:
    Loading data from HDFS into Hive (job orchestration)
  11. Click on the Tables menu.
  12. From the list of table names, make sure product_price_history exists. Click on the table.
  13. The Table metadata page appears; click on the Columns tab to show the metadata. Click on the Samples tab to see a preview of the data that has already been uploaded using the PDI job.

Using the same process, the hdfs_to_hive_product_nyse_stocks.kjb sample file will load a bigger NYSE-2000-2001.tsv.gz file data into Hive. At this point, you should have a straightforward understanding of the job.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset