Importing data to Hive

Before we begin the walkthrough, see Appendix A, Big Data Sets, to complete the Hive nyse_stocks data preparation and follow these steps:

  1. Launch Spoon if you have closed it.
  2. On the File menu, click on New and select Transformation.
  3. On the left-hand side panel, click on the View tab.
  4. Right-click on the Database connections node to show up a contextual menu and choose New.

The following screenshot shows you how to create a new database connection:

Importing data to Hive

When the Database Connection dialog appears, fill in the following configuration:

  • Connection Name: HIVE2
  • Connection Type: Hadoop Hive 2
  • Host Name: [your working IP address]
  • Database Name: default

Now follow these steps:

  1. Click on the Test button to verify the connection. If successful, click on the OK button to close it. The display window will look like the following screenshot:
    Importing data to Hive
  2. On the left-hand side panel, click on the Design tab.
  3. In the Input group, click on the Table input step and drag it into the working space. The following screenshot shows how the Table input step can be put into the working space:
    Importing data to Hive
  4. Double-click on the Table input step and the editor appears. On the connection listbox, select HIVE2. Type the following query into the SQL editor pane. Set the Limit Size parameter to 65535. We plan to export the data to an Excel file; the 65535 threshold is the limit up to which an Excel file is able to store data.
    SELECT * FROM nyse_stocks
  5. Click on the Preview button; the preview dialog appears, and then click on the OK button. Shortly, it will show a data preview of nyse_stocks, a Hive table.

    This process is actually a Hadoop MapReduce job; see the Preparing Hive data section in Appendix B, Hadoop Setup, which shows the job logs.

    The following screenshot shows a data preview of the Hive query:

    Importing data to Hive
  6. Click on the Close and OK buttons, respectively, to close all the open dialogs.
  7. On the File menu, choose Save and name the file export-hive-to-excel.ktr.
  8. In the Output group, choose and put the Microsoft Excel Output step into the workspace. The following screenshot shows the newly added Microsoft Excel Output step:
    Importing data to Hive
  9. Press Ctrl + click on the Table input step followed by pressing Ctrl + click on the Microsoft Excel Output step. Right-click on one of the steps—a contextual dialog will pop up—and choose New Hop. A hop represents data or control flow among steps. Make sure the configuration looks similar to the following screenshot; click on the OK button.
    Importing data to Hive

    Your transformation will look similar to the following screenshot:

    Importing data to Hive
  10. Double-click on Microsoft Excel Output; the editor dialog appears. On the File tab, specify your output filename in the Filename textbox. Click on the OK button to close the dialog. The following screenshot shows a Microsoft Excel Output step dialog:
    Importing data to Hive
  11. On the menu bar, right below the transformation tab, click on the Run this transformation or job button. On the Execute a transformation dialog, click on the Launch button. The following screenshot shows the running of a transformation or job menu in PDI:
    Importing data to Hive
  12. If the transformation runs successfully, explore the data transfer activity metrics. The following screenshot shows that the transformation runs in 58 seconds:
    Importing data to Hive
  13. Open the Excel file result. The displayed data will look similar to the following screenshot:
    Importing data to Hive
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset