Now we have a text file residing in HDFS that can further be processed using the MapReduce job. Behind all the SQL is a MapReduce job. We will use the Pentaho Data Integration SQL-related step to demonstrate this capability. Follow these steps:
hdfs_to_hive_product_price_history.kjb
from the chapter's code bundle folder. Load the file into Spoon. You should see a job flow similar to the one shown in the following screenshot:product-price-history.tsv.gz
file from the local folder into HDFS.product_price_history
table exists in Hive or not. If it does not, it continues to the CREATE TABLE or TRUNCATE TABLE step. The step editor dialog looks like the one shown in the following screenshot:product_price_history
table with its structure. The editor looks like the one shown in the following screenshot:product_price_history
table. It reads from a HDFS location. The step editor dialog looks like the one shown in the following screenshot:product_price_history
exists. Click on the table.Using the same process, the hdfs_to_hive_product_nyse_stocks.kjb
sample file will load a bigger NYSE-2000-2001.tsv.gz
file data into Hive. At this point, you should have a straightforward understanding of the job.