Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Loading data from HDFS into Hive (job orchestration)

Now we have a text file residing in HDFS that can further be processed using the MapReduce job. Behind all the SQL is a MapReduce job. We will use the Pentaho Data Integration SQL-related step to demonstrate this capability. Follow these steps:

Launch Spoon if it is not running.
Open hdfs_to_hive_product_price_history.kjb from the chapter's code bundle folder. Load the file into Spoon. You should see a job flow similar to the one shown in the following screenshot:
The Hadoop Copy Files step is responsible for copying the product-price-history.tsv.gz file from the local folder into HDFS.
The TABLE EXIST step checks whether the product_price_history table exists in Hive or not. If it does not, it continues to the CREATE TABLE or TRUNCATE TABLE step. The step editor dialog looks like the one shown in the following screenshot:
The CREATE TABLE step executes the SQL command to create a new product_price_history table with its structure. The editor looks like the one shown in the following screenshot:
Tip
Hive does not support the DATE data type, so we use STRING instead, in this case, for the date field.
The TRUNCATE TABLE step is executed if the table exists. This step will remove all data from the table.
Finally, the LOAD INFILE step will load the content of the uploaded file into the product_price_history table. It reads from a HDFS location. The step editor dialog looks like the one shown in the following screenshot:
Run the job.
Launch your browser and navigate to Hortonworks Sandbox as shown in the following screenshot:
From the menu bar, choose Beeswax (Hive UI), as shown in the following screenshot:
Click on the Tables menu.
From the list of table names, make sure product_price_history exists. Click on the table.
The Table metadata page appears; click on the Columns tab to show the metadata. Click on the Samples tab to see a preview of the data that has already been uploaded using the PDI job.

Using the same process, the hdfs_to_hive_product_nyse_stocks.kjb sample file will load a bigger NYSE-2000-2001.tsv.gz file data into Hive. At this point, you should have a straightforward understanding of the job.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Loading data from HDFS into Hive (job orchestration)

Create new playlist

Sign In

Sign Up

Loading data from HDFS into Hive (job orchestration)

Tip

Table of Contents for
Loading data from HDFS into Hive (job orchestration)