Using HDFS Java API

HDFS Java API can be used to interact with HDFS from any Java program. This API gives us the ability to utilize the data stored in HDFS from other Java programs as well as to process that data with other non-Hadoop computational frameworks. Occasionally you may also come across a use case where you want to access HDFS directly from inside a MapReduce application. However, if you are writing or modifying files in HDFS directly from a Map or Reduce task, be aware that you are violating the side effect free nature of MapReduce that might lead to data consistency issues based on your use case.

Getting ready

Set the HADOOP_HOME environment variable to point to your Hadoop installation root directory.

How to do it...

The following steps show you how to use the HDFS Java API to perform filesystem operations on a HDFS installation using a Java program:

  1. The following sample program creates a new file in HDFS, writes some text to the newly created file, and reads the file back from the HDFS:
    importjava.io.IOException;
    
    importorg.apache.hadoop.conf.Configuration;
    importorg.apache.hadoop.fs.FSDataInputStream;
    importorg.apache.hadoop.fs.FSDataOutputStream;
    importorg.apache.hadoop.fs.FileSystem;
    importorg.apache.hadoop.fs.Path;
    
    public class HDFSJavaAPIDemo {
    
      public static void main(String[] args) throws IOException {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        System.out.println(fs.getUri());
    
        Path file = new Path("demo.txt");
    
        if (fs.exists(file)) {
          System.out.println("File exists.");
        } else {
          // Writing to file
          FSDataOutputStream outStream = fs.create(file);
          outStream.writeUTF("Welcome to HDFS Java API!!!");
          outStream.close();
        }
    
        // Reading from file
        FSDataInputStream inStream = fs.open(file);
        String data = inStream.readUTF();
        System.out.println(data);
        inStream.close();
    
        fs.close();
      }
  2. Compile and package the above program in to a JAR package. Unzip the source package for this chapter, go to the HDFS_Java_API folder and run the Ant build. The HDFSJavaAPI.jar file will be created in the build folder.
    >cd HDFS_java_API
    >ant
    

    You can use the following Ant build file to compile the above sample program:

    <project name="HDFSJavaAPI" default="compile" basedir=".">
      <property name="build" location="build"/>
      <property environment="env"/>
    
      <path id="hadoop-classpath">
        <fileset dir="${env.HADOOP_HOME}/lib">
          <include name="**/*.jar"/>
        </fileset>
        <fileset dir="${env.HADOOP_HOME}">
          <include name="**/*.jar"/>
        </fileset>
      </path>
    
      <target name="compile">
        <mkdir dir="${build}"/>
        <javac srcdir="src" destdir="${build}">
          <classpath refid="hadoop-classpath"/>
        </javac>
        <jar jarfile="HDFSJavaAPI.jar" basedir="${build}"/>
      </target>    
    
      <target name="clean">
        <delete dir="${build}"/>
      </target>
    </project>
  3. You can execute the above sample with Hadoop using the following command. Running samples using the hadoop script ensures that it uses the currently configured HDFS and the necessary dependencies from the Hadoop classpath.
    >bin/hadoop jar HDFSJavaAPI.jar HDFSJavaAPIDemo
    hdfs://yourhost:9000
    Welcome to HDFS Java API!!!
    
  4. Use the ls command to list the newly created file:
    >/bin/hadoop fs -ls
    Found 1 items
    -rw-r--r--   3 foosupergroup         20 2012-04-27 16:57 /user/foo/demo.txt
    

How it works...

In order to interact with the HDFS programmatically, we first obtain a handle to the currently configured filesystem. Instantiating a Configuration object and obtaining a FileSystem handle within a Hadoop environment will point it to the HDFS NameNode of that environment. Several alternative methods to configure a FileSystem object are discussed in the Configuring the FileSystem object section.

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

The FileSystem.create(filePath) method creates a new file in the given path and provides us with a FSDataOutputStream object to the newly created file. FSDataOutputStream wraps the java.io.DataOutputStream and allows the program to write primitive Java data types to the file. The FileSystem.Create() method overrides if the file exists. In this example, the file will be created relative to the users' home directory in the HDFS, which would result in a path similar to /user/<user_name>/demo.txt.

Path file = new Path("demo.txt");
FSDataOutputStream outStream = fs.create(file);
outStream.writeUTF("Welcome to HDFS Java API!!!");
outStream.close();

FileSystem.open(filepath) opens a FSDataInputStream to the given file. FSDataInputStream wraps the java.io.DataInputStream and allows the program to read primitive Java data types from the file.

FSDataInputStream inStream = fs.open(file);
String data = inStream.readUTF();
System.out.println(data);
inStream.close();

There's more...

HDFS Java API supports many more filesystem operations than we have used in the above sample. The full API documentation can be found at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html.

Configuring the FileSystem object

We can use the HDFS Java API from outside the Hadoop environment as well. When doing so, we have to explicitly configure the HDFS NameNode and the port. The following are a couple of ways to perform that configuration:

  • You can load the configuration files to the Configuration object before retrieving the FileSystem object as follows. Make sure to add all the Hadoop and dependency libraries to the classpath.
    Configuration conf = new Configuration();
    conf.addResource(new Path("…/hadoop/conf/core-site.xml"));
    conf.addResource(new Path("…/hadoop/conf/hdfs-site.xml"));
    
    FileSystem fileSystem = FileSystem.get(conf);
  • You can also specify the NameNode and the port as follows. Replace the NAMENODE_HOSTNAME and PORT with the hostname and the port of the NameNode of your HDFS installation.
    Configuration conf = new Configuration();
    conf.set("fs.default.name", "hdfs://NAMENODE_HOSTNAME:PORT");
    FileSystem fileSystem = FileSystem.get(conf);

HDFS filesystem API is an abstraction that supports several filesystems. In case the above program could not find a valid HDFS configuration, it will point to the local filesystem instead of the HDFS. You can identify the current filesystem of the FileSystem object using the getUri() function as follows. It would result in hdfs://your_namenode:port in the case it's using a properly configured HDFS and file:/// in the case it is using the local filesystem.

fileSystem.getUri();

Retrieving the list of data blocks of a file

The getFileBlockLocations() function of the FileSystem object allows you to retrieve the list of data blocks of a file stored in HDFS, together with hostnames where the blocks are stored and the block offsets. This information would be very useful if you are planning for performing any data local operations on the file's data using a framework other than Hadoop MapReduce.

FileStatus fileStatus = fs.getFileStatus(file);
BlockLocation[] blocks = fs.getFileBlockLocations(
        fileStatus, 0, fileStatus.getLen());

See also

  • The Using HDFS C API recipe in this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset