Using HDFS C API (libhdfs)

libhdfs— a native shared library—provides a C API that enables non-Java programs to interact with HDFS. libhdfs uses JNI to interact with HDFS through Java.

Getting ready

Current Hadoop distributions contain the pre-compiled libhdfs libraries for 32-bit and 64-bit Linux operating systems. You may have to download the Hadoop standard distribution and compile the libhdfs library from the source code, if your operating system is not compatible with the pre-compiled libraries. Refer to the Mounting HDFS (Fuse-DFS) recipe for information on compiling the libhdfs library.

How to do it...

The following steps show you how to perform operations on a HDFS installation using a HDFS C API:

  1. The following sample program creates a new file in HDFS, writes some text to the newly created file and reads the file back from the HDFS. Replace NAMENODE_HOSTNAME and PORT with the relevant values corresponding to the NameNode of your HDFS cluster. The hdfs_cpp_demo.c source file is provided in the HDFS_C_API directory of the source code bundle for this folder.
    #include "hdfs.h"
    
    int main(intargc, char **argv) {
    
    hdfsFS fs =hdfsConnect("NAMENODE_HOSTNAME,PORT);
    if (!fs) {
            fprintf(stderr, "Cannot connect to HDFS.
    ");
            exit(-1);
        }
    
    char* fileName = "demo_c.txt";
    char* message = "Welcome to HDFS C API!!!";
    int size = strlen(message);
    
    int exists = hdfsExists(fs, fileName);
    
    if (exists > -1) {
        fprintf(stdout, "File %s exists!
    ", fileName);
    }else{
      // Create and open file for writing
      hdfsFile outFile = hdfsOpenFile(fs, fileName, O_WRONLY|O_CREAT, 0, 0, 0);
    if (!outFile) {
      fprintf(stderr, "Failed to open %s for writing!
    ", fileName);
                exit(-2);
        }
    
        // write to file
    hdfsWrite(fs, outFile, (void*)message, size);
        hdfsCloseFile(fs, outFile);
        }
    
        // Open file for reading
    hdfsFile inFile = hdfsOpenFile(fs, fileName, O_RDONLY, 0, 0, 0);
        if (!inFile) {
    fprintf(stderr, "Failed to open %s for reading!
    ", fileName);
            exit(-2);
        }
    
        char* data = malloc(sizeof(char) * size);
        // Read from file.
    tSize readSize = hdfsRead(fs, inFile, (void*)data, size);
    fprintf(stdout, "%s
    ", data);
        free(data);
    
    hdfsCloseFile(fs, inFile);
    hdfsDisconnect(fs);
        return 0;
    }
  2. Compile the above program by using gcc as follows. When compiling you have to link with the libhdfs and the JVM libraries. You also have to include the JNI header files of your Java installation. An example compiling command would look like the following. Replace the ARCH and the architecture dependent paths with the paths relevant for your system.
    >gcc hdfs_cpp_demo.c 
    -I $HADOOP_HOME/src/c++/libhdfs 
    -I $JAVA_HOME/include 
    -I $JAVA_HOME/include/linux/ 
    -L $HADOOP_HOME/c++/ARCH/lib/ 
    -L $JAVA_HOME/jre/lib/ARCH/server
    -l hdfs -ljvm -o hdfs_cpp_demo
    
  3. Export an environment variable named CLASSPATH with the Hadoop dependencies. A safe approach is to include all the jar files in $HADOOP_HOME and in the $HADOOP_HOME/lib.
    export CLASSPATH=$HADOOP_HOME/hadoop-core-xx.jar:....
    

    Tip

    Ant build script to generate the classpath

    Add the following Ant target to the build file given in step 2 of the HDFS Java API recipe. The modified build.xml script is provided in the HDFS_C_API folder of the source package for this chapter.

    <target name="print-cp">
        <property name="classpath" refid="hadoop-classpath"/>
        <echo message="classpath= ${classpath}"/>
      </target>

    Execute the Ant build using ant print-cp to generate a string with all the jars in $HADOOP_HOME and $HADOOP_HOME/lib. Copy and export this string as the CLASSPATH environmental variable.

  4. Execute the program.
    >LD_LIBRARY_PATH=$HADOOP_HOME/c++/ARCH/lib:$JAVA_HOME/jre/lib/ARCH/server ./hdfs_cpp_demo
    Welcome to HDFS C API!!!
    

How it works...

First we connect to a HDFS cluster using the hdfsConnect command by providing the hostname (or the IP address) and port of the NameNode of the HDFS cluster. The hdfsConnectAsUser command can be used to connect to a HDFS cluster as a specific user.

hdfsFS fs =hdfsConnect("NAMENODE_HOSTNAME",PORT);

We create new file and obtain a handle to the newly created file using the hdfsOpenFile command. The O_WRONLY|O_CREAT flags create a new file or override the existing file and open it in write only mode. Other supported flags are O_RDONLY and O_APPEND. The fourth, fifth, and sixth parameters of the hdfsOpenFile command are the buffer size for read/write operations, block replication factor and block size for the newly created file. Specify 0 if you want to use the default values for these three parameters.

hdfsFile outFile = hdfsOpenFile(fs, fileName,flags, 0, 0, 0);

The hdfsWrite command writes the provided data in to the file specified by the outFile handle. Data size needs to be specified using the number of bytes.

hdfsWrite(fs, outFile, (void*)message, size);

The hdfsRead command reads data from the file specified by the inFile. The size of the buffer in bytes needs to be provided as the fourth parameter. The hdfsRead command returns the actual number of bytes read from the file that might be less than the buffer size. If you want to ensure certain amounts of bytes that are read from the file, it is advisable to use the hdfsRead command from inside a loop until the specified number of bytes are read.

char* data = malloc(sizeof(char) * size);
tSize readSize = hdfsRead(fs, inFile, (void*)data, size);

There's more...

HDFS C API (libhdfs) supports many more filesystem operations than the functions we have used in the preceding sample. Refer to the $HADOOP_HOME/src/c++/libhdfs/hdfs.h header file for more information.

Configuring using HDFS configuration files

You can also use the HDFS configuration files to point libhdfs to your HDFS NameNode, instead of specifying the NameNode hostname and the port number in the hdfsConnect command.

  1. Change the NameNode hostname and the port of the hdfsConnect command to 'default' and 0. (Setting the host as NULL would make libhdfs to use the local filesystem).
    hdfsFS fs = hdfsConnect("default",0);
  2. Add the conf directory of your HDFS installation to the CLASSPATH environmental variable.
    export CLASSPATH=$HADOOP_HOME/hadoop-core-xx.jar:....:$HADOOP_HOME/conf

See also

  • The HDFS Java API and Mounting HDFS recipes in this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset