There are many ways to read data from and write data to HDFS. We will start by using the FileSystem API to create and write to a file in HDFS, followed by an application to read a file from HDFS and write it back to the local filesystem.
You will need to download the
weblog_entries.txt
dataset from the Packt website, http://www.packtpub.com/support.
Carry out the following steps to read and write data to HDFS:
public class HdfsWriter extends Configured implements Tool { public int run(String[] args) throws Exception { String localInputPath = args[0]; Path outputPath = new Path(args[1]); Configuration conf = getConf(); FileSystem fs = FileSystem.get(conf); OutputStream os = fs.create(outputPath); InputStream is = new BufferedInputStream( new FileInputStream(localInputPath)); IOUtils.copyBytes(is, os, conf); return 0; } public static void main(String[] args) throws Exception { int returnCode = ToolRunner.run( new HdfsWriter(), args); System.exit(returnCode); } }
public class HdfsReader extends Configured implements Tool { public int run(String[] args) throws Exception { Path inputPath = new Path(args[0]); String localOutputPath = args[1]; Configuration conf = getConf(); FileSystem fs = FileSystem.get(conf); InputStream is = fs.open(inputPath); OutputStream os = new BufferedOutputStream( new FileOutputStream(localOutputPath)); IOUtils.copyBytes(is, os, conf); return 0; } public static void main(String[] args) throws Exception { int returnCode = ToolRunner.run( new HdfsReader(), args); System.exit(returnCode); } }
FileSystem
is an abstract class that represents a generic filesystem. Most Hadoop filesystem implementations can be accessed and manipulated through the
FileSystem
object. To create an instance of the Hadoop Distributed File System, you call the method FileSystem.get()
. The
FileSystem.get()
method will look at the URI assigned to the fs.default.name
parameter of the Hadoop configuration files on your classpath and choose the correct implementation of the FileSystem
class to instantiate. The
fs.default.name
parameter of HDFS has the value hdfs://
.
Once an instance of the FileSystem
class has been
created, the
HdfsWriter
class calls the create()
method to create a file (or overwrite if it already exists) in HDFS. The
create()
method returns an
OutputStream
object, which can be manipulated using normal Java I/O methods. Similarly, HdfsReader
calls the method open()
to open a file in HDFS, which returns an InputStream
object that can be used to read the contents of the file.
The
FileSystem
API is extensive. To demonstrate some of the other methods available in the API, we can add some error checking to the
HdfsWriter
and HdfsReader
classes we created.
To check whether the file exists before we call create()
, use:
boolean exists = fs.exists(inputPath);
To check whether the path is a file, use:
boolean isFile = fs.isFile(inputPath);
To rename a file that already exists, use:
boolean renamed = fs.rename(inputPath, new Path("old_file.txt"));