Before starting to play with Hadoop and Hunk, we are going to download and run a VM. You'll get a short description on how to get everything up and running and put in some data for processing later.
We have decided to take the default Cloudera CDH 5.3.1 VM from the Cloudera site and fine-tune it for our needs. Please open this link to prepare a VM: http://www.bigdatapath.com/2015/08/learning-hunk-links-to-vm-with-all-stuff-you-need/.
This post may have been be updated by the time you're reading this book.
You can run the terminal application by clicking the special icon on the top bar:
Your user is cloudera
. sudo
is passwordless:
[cloudera@quickstart ~]$ whoami cloudera [cloudera@quickstart ~]$ sudo su [root@quickstart cloudera]# whoami root [root@quickstart cloudera]#
MySQL is used as an example of the data ingestion process. The user name is dwhuser
, the password is dwhuser
. You can get root access by using the root
username and the cloudera
password:
[cloudera@quickstart ~]$ mysql -u root -p mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | cdrdb | | cm | | firehose | | hue | | metastore | | mysql | | oozie | | retail_db | | sentry | +--------------------+ 10 rows in set (0.00 sec)
We import data from MySQL to Hadoop from the database named cdrdb
. There are some other databases. They are used by Cloudera Manager services and Hadoop features such as Hive Metastore, Oozie, and so on.
Hive Metastore is a service designed to centralize metadata management. It's a kind of Teradata DBC.Table
, DBC.Columns
, or IBM DB2 syscat.Columns
, syscat.Tables
. The idea is to create a strict schema description over the bytes stored in Hadoop and then get access to this data using SQL.
Oozie is a kind of Hadoop CRON without a Single Point of Failure (SPOF). Think it through; is it easy to create a distributed reliable CRON with failover functionality? Oozie uses RDBMS to persist metadata about planned, running, and finished tasks. This VM doesn't provide an Oozie HA configuration.