Greenplum is a parallel database that distributes data and queries to one or more PostgreSQL instances. It complements Hadoop by providing real-time or near real-time access to large amounts of data. It supports using HDFS files as external tables. External tables are a good solution for working with data that lives outside of the Greenplum cluster. Since data in external tables must first travel over the network, they should be infrequently used in queries with other data that lives inside of the Greenplum cluster. This recipe will cover creating read-only and read/write external tables.
This recipe assumes that you are using the CDH3 distribution of Hadoop.
Run an instance of Greenplum that must be able to reach the Hadoop cluster found at http://www.greenplum.com/products/greenplum-database.
Configure Greenplum with the following:
gp_hadoop_target_version
set to cdh3u2
gp_hadoop_home
set to the full path of $HADOOP_HOME
Java 1.6 or above must be installed on each node in the Greenplum cluster.
Create an external table from the weblogs file in HDFS:
CREATE EXTERNAL TABLE weblogs( md5 text, url text, request_date date, request_time time, ip inet ) LOCATION ('gphdfs://<NAMENODE_HOST>:<NAMENODE_PORT>/data/weblogs/weblog_entries.txt') FORMAT 'TEXT' (DELIMITER ' '),
Greenplum has native support for loading data from HDFS in parallel. When a query is run against the weblog_entries.txt
table, the weblog_entries.txt
file is loaded into a temporary Greenplum table. The query then
executes against this table. After the query finishes the table is discarded.
Greenplum external tables also support writing of data. This requires the WRITABLE
keyword while creating the table:
CREATE WRITABLE EXTERNAL TABLE weblogs( md5 text, url text, request_date date, request_time time, ip inet ) LOCATION ('gphdfs://<NAMENODE_HOST>:<NAMENODE_PORT>/data/weblogs/weblog_entries.txt') FORMAT 'TEXT' (DELIMITER ' '),
More information can be found in the Greenplum administrator's handbook at http://media.gpadmin.me/wp-content/uploads/2011/05/GP-4100-AdminGuide.pdf