HDFS Java API can be used to interact with HDFS from any Java program. This API gives us the ability to utilize the data stored in HDFS from other Java programs as well as to process that data with other non-Hadoop computational frameworks. Occasionally you may also come across a use case where you want to access HDFS directly from inside a MapReduce application. However, if you are writing or modifying files in HDFS directly from a Map or Reduce task, be aware that you are violating the side effect free nature of MapReduce that might lead to data consistency issues based on your use case.
Set the HADOOP_HOME
environment variable to point to your Hadoop installation root directory.
The following steps show you how to use the HDFS Java API to perform filesystem operations on a HDFS installation using a Java program:
importjava.io.IOException; importorg.apache.hadoop.conf.Configuration; importorg.apache.hadoop.fs.FSDataInputStream; importorg.apache.hadoop.fs.FSDataOutputStream; importorg.apache.hadoop.fs.FileSystem; importorg.apache.hadoop.fs.Path; public class HDFSJavaAPIDemo { public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); System.out.println(fs.getUri()); Path file = new Path("demo.txt"); if (fs.exists(file)) { System.out.println("File exists."); } else { // Writing to file FSDataOutputStream outStream = fs.create(file); outStream.writeUTF("Welcome to HDFS Java API!!!"); outStream.close(); } // Reading from file FSDataInputStream inStream = fs.open(file); String data = inStream.readUTF(); System.out.println(data); inStream.close(); fs.close(); }
HDFS_Java_API
folder and run the Ant build. The HDFSJavaAPI.jar
file will be created in the build
folder.>cd HDFS_java_API >ant
You can use the following Ant build file to compile the above sample program:
<project name="HDFSJavaAPI" default="compile" basedir="."> <property name="build" location="build"/> <property environment="env"/> <path id="hadoop-classpath"> <fileset dir="${env.HADOOP_HOME}/lib"> <include name="**/*.jar"/> </fileset> <fileset dir="${env.HADOOP_HOME}"> <include name="**/*.jar"/> </fileset> </path> <target name="compile"> <mkdir dir="${build}"/> <javac srcdir="src" destdir="${build}"> <classpath refid="hadoop-classpath"/> </javac> <jar jarfile="HDFSJavaAPI.jar" basedir="${build}"/> </target> <target name="clean"> <delete dir="${build}"/> </target> </project>
hadoop
script ensures that it uses the currently configured HDFS and the necessary dependencies from the Hadoop classpath.>bin/hadoop jar HDFSJavaAPI.jar HDFSJavaAPIDemo hdfs://yourhost:9000 Welcome to HDFS Java API!!!
ls
command to list the newly created file:>/bin/hadoop fs -ls Found 1 items -rw-r--r-- 3 foosupergroup 20 2012-04-27 16:57 /user/foo/demo.txt
In order to interact with the HDFS programmatically, we first obtain a handle to the currently configured filesystem. Instantiating a Configuration object and obtaining a FileSystem
handle within a Hadoop environment will point it to the HDFS NameNode of that environment. Several alternative methods to configure a FileSystem
object are discussed in the Configuring the FileSystem object section.
Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf);
The FileSystem.create(filePath)
method creates a new file in the given path and provides us with a FSDataOutputStream
object to the newly created file. FSDataOutputStream
wraps the java.io.DataOutputStream
and allows the program to write primitive Java data types to the file. The FileSystem.Create()
method overrides if the file exists. In this example, the file will be created relative to the users' home directory in the HDFS, which would result in a path similar to /user/<user_name>/demo.txt
.
Path file = new Path("demo.txt"); FSDataOutputStream outStream = fs.create(file); outStream.writeUTF("Welcome to HDFS Java API!!!"); outStream.close();
FileSystem.open(filepath)
opens a FSDataInputStream
to the given file. FSDataInputStream
wraps the java.io.DataInputStream
and allows the program to read primitive Java data types from the file.
FSDataInputStream inStream = fs.open(file); String data = inStream.readUTF(); System.out.println(data); inStream.close();
HDFS Java API supports many more filesystem operations than we have used in the above sample. The full API documentation can be found at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html.
We can use the HDFS Java API from outside the Hadoop environment as well. When doing so, we have to explicitly configure the HDFS NameNode and the port. The following are a couple of ways to perform that configuration:
Configuration
object before retrieving the FileSystem
object as follows. Make sure to add all the Hadoop and dependency libraries to the classpath.Configuration conf = new Configuration(); conf.addResource(new Path("…/hadoop/conf/core-site.xml")); conf.addResource(new Path("…/hadoop/conf/hdfs-site.xml")); FileSystem fileSystem = FileSystem.get(conf);
NAMENODE_HOSTNAME
and PORT
with the hostname and the port of the NameNode of your HDFS installation.Configuration conf = new Configuration(); conf.set("fs.default.name", "hdfs://NAMENODE_HOSTNAME:PORT"); FileSystem fileSystem = FileSystem.get(conf);
HDFS filesystem API is an abstraction that supports several filesystems. In case the above program could not find a valid HDFS configuration, it will point to the local filesystem instead of the HDFS. You can identify the current filesystem of the FileSystem
object using the getUri()
function as follows. It would result in hdfs://your_namenode:port
in the case it's using a properly configured HDFS and file:///
in the case it is using the local filesystem.
fileSystem.getUri();
The getFileBlockLocations()
function of the FileSystem
object allows you to retrieve the list of data blocks of a file stored in HDFS, together with hostnames where the blocks are stored and the block offsets. This information would be very useful if you are planning for performing any data local operations on the file's data using a framework other than Hadoop MapReduce.
FileStatus fileStatus = fs.getFileStatus(file); BlockLocation[] blocks = fs.getFileBlockLocations( fileStatus, 0, fileStatus.getLen());