Day 2: Working with Big Data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Day 2: Working with Big Data

With Day 1’s table creation and manipulation under our belts, it’s time to start adding some serious data to our wiki table. Today, you’ll script against the HBase APIs, ultimately streaming Wikipedia content right into our wiki! Along the way, you’ll pick up some performance tricks for making faster import jobs. Finally, you’ll poke around in HBase’s internals to see how it partitions data into regions, achieving a series of both performance and disaster recovery goals.

Importing Data, Invoking Scripts

One common problem people face when trying a new database system is how to migrate data into it. Handcrafting Put operations with static strings, as you did in Day 1, is all well and good, but you can do better.

Fortunately, pasting commands into the shell is not the only way to execute them. When you start the HBase shell from the command line, you can specify the name of a JRuby script to run. HBase will execute that script as though it were entered directly into the shell. The syntax looks like this:

$ ${HBASE_HOME}/bin/hbase shell <your_script> [<optional_arguments> ...]

Because we’re interested specifically in “Big Data,” let’s create a script for importing Wikipedia articles into our wiki table. The WikiMedia Foundation, which oversees Wikipedia, Wictionary, and other projects, periodically publishes data dumps we can use. These dumps are in the form of enormous XML files. Here’s an example record from the English Wikipedia:

	<page>
	<title>Anarchism</title>
	<id>12</id>
	<revision>
	<id>408067712</id>
	<timestamp>2011-01-15T19:28:25Z</timestamp>
	<contributor>
	<username>RepublicanJacobite</username>
	<id>5223685</id>
	</contributor>
	<comment>Undid revision 408057615 by [[Special:Contributions...</comment>
	<text xml:space="preserve">{{Redirect\|Anarchist\|the fictional character\|
	...
	[[bat-smg:Anarkėzmos]]
	</text>
	</revision>
	</page>

Because we have such incredible foresight, the individual items in these XML files contain all the information we’ve already accounted for in our schema: title (row key), text, timestamp, and author. We ought to be able to write a script to import revisions without too much trouble.

Streaming XML

First things first: We’ll need to parse the huge XML files in a streaming fashion, so let’s start with that. The basic outline for parsing an XML file in JRuby, record by record, looks like this:

hbase/basic_xml_parsing.rb

	import 'javax.xml.stream.XMLStreamConstants'

	factory = javax.xml.stream.XMLInputFactory.newInstance
	reader = factory.createXMLStreamReader(java.lang.System.in)

	while reader.has_next

	type = reader.next

	if type == XMLStreamConstants::START_ELEMENT
	tag = reader.local_name
	# do something with tag
	elsif type == XMLStreamConstants::CHARACTERS
	text = reader.text
	# do something with text
	elsif type == XMLStreamConstants::END_ELEMENT
	# same as START_ELEMENT
	end

	end

Breaking this down, there are a few parts worth mentioning. First, we produce an XMLStreamReader and wire it up to java.lang.System.in, which means it will be reading from standard input.

Next, we set up a while loop, which will continuously pull out tokens from the XML stream until there are none left. Inside the while loop, we process the current token. What happens then depends on whether the token is the start of an XML tag, the end of a tag, or the text in between.

Streaming Wikipedia

Now we can combine this basic XML processing framework with our previous exploration of the HTable and Put interfaces you explored previously. Here is the resultant script. Most of it should look familiar, and we will discuss a few novel parts.

hbase/import_from_wikipedia.rb

	require 'time'

	import 'org.apache.hadoop.hbase.client.HTable'
	import 'org.apache.hadoop.hbase.client.Put'
	import 'javax.xml.stream.XMLStreamConstants'

	def jbytes(*args)
	args.map { \|arg\| arg.to_s.to_java_bytes }
	end

	factory = javax.xml.stream.XMLInputFactory.newInstance
	reader = factory.createXMLStreamReader(java.lang.System.in)

	document = nil
	buffer = nil
	count = 0

	table = HTable.new(@hbase.configuration, 'wiki')
	table.setAutoFlush(false)

	while reader.has_next
	type = reader.next

	if type == XMLStreamConstants::START_ELEMENT

	case reader.local_name
	when 'page' then document = {}
	when /title\|timestamp\|username\|comment\|text/ then buffer = []
	end

	elsif type == XMLStreamConstants::CHARACTERS

	buffer << reader.text unless buffer.nil?

	elsif type == XMLStreamConstants::END_ELEMENT

	case reader.local_name
	when /title\|timestamp\|username\|comment\|text/
	document[reader.local_name] = buffer.join
	when 'revision'
	key = document['title'].to_java_bytes
	ts = (Time.parse document['timestamp']).to_i

	p = Put.new(key, ts)
	p.add(jbytes("text", "", document['text'*]))
	p.add(jbytes("revision", "author", document['username'*]))
	p.add(jbytes("revision", "comment", document['comment'*]))
	table.put(p)

	count += 1
	table.flushCommits() if count % 10 == 0
	if count % 500 == 0
	puts "#{count} records inserted (#{document['title']})"
	end
	end
	end
	end

	table.flushCommits()
	exit

A few things to note in the preceding snippet:

Several new variables were introduced:
- document holds the current article and revision data.
- buffer holds character data for the current field within the document (text, title, author, and so on).
- count keeps track of how many articles you’ve imported so far.
Pay special attention to the use of table.setAutoFlush(false). In HBase, data is automatically flushed to disk periodically. This is preferred in most applications. By disabling autoflush in our script, any put operations you execute will be buffered until you call table.flushCommits. This allows you to batch writes together and execute them when it’s convenient for you.
If the start tag is a <page>, then reset document to an empty hash. Otherwise, if it’s another tag you care about, reset buffer for storing its text.
We handle character data by appending it to the buffer.
For most closing tags, you just stash the buffered contents into the document. If the closing tag is a </revision>, however, you create a new Put instance, fill it with the document’s fields, and submit it to the table. After that, you use flushCommits if you haven’t done so in a while and report progress to stdout.

Compression and Bloom Filters

We’re almost ready to run the script; we just have one more bit of housecleaning to do first. The text column family is going to contain big blobs of text content. Reading those values will take much longer than values like Hello world or Welcome to the wiki! from Day 1. HBase enables us to compress that data to speed up reads:

	hbase> alter 'wiki', {NAME=>'text', COMPRESSION=>'GZ', BLOOMFILTER=>'ROW'}
	0 row(s) in 0.0510 seconds

HBase supports two compression algorithms: Gzip (GZ) and Lempel-Ziv-Oberhumer (LZO). The HBase community highly recommends using LZO over Gzip pretty much unilaterally, but here we’re using Gzip. Why is that?

The problem with LZO for our purposes here is the implementation’s license. While open source, LZO is not compatible with Apache’s licensing philosophy, so LZO can’t be bundled with HBase. Detailed instructions are available online for installing and configuring LZO support. If you want high-performance compression, use LZO in your own projects.

A Bloom filter is a really cool data structure that efficiently answers the question “Have I ever seen this thing before?” and is used to prevent expensive queries that are doomed to fail (that is, to return no results). Originally developed by Burton Howard Bloom in 1970 for use in spell-checking applications, Bloom filters have become popular in data storage applications for determining quickly whether a key exists.

Without going too deep into implementation details, a Bloom filter manages a statically sized array of bits initially set to 0. Each time a new blob of data is presented to the filter, some of the bits are flipped to 1. Determining which bits to flip depends on generating a hash from the data and turning that hash into a set of bit positions.

Later, to test whether the filter has been presented with a particular blob in the past, the filter figures out which bits would have to be 1 and checks them. If any are 0, then the filter can unequivocally say “no.” If all of the bits are 1, then it reports “yes.” Chances are it has been presented with that blob before, but false positives are increasingly likely as more blobs are entered.

This is the trade-off of using a Bloom filter as opposed to a simple hash. A hash will never produce a false positive, but the space needed to store that data is unbounded. Bloom filters use a constant amount of space but will occasionally produce false positives at a predictable rate based on saturation. False positives aren’t a huge deal, though; they just mean that the filter says a value is likely to be there, but you will eventually find out that it isn’t.

HBase supports using Bloom filters to determine whether a particular column exists for a given row key (BLOOMFILTER=>’ROWCOL’) or just whether a given row key exists at all (BLOOMFILTER=>’ROW’). The number of columns within a column family and the number of rows are both potentially unbounded. Bloom filters offer a fast way of determining whether data exists before incurring an expensive disk read.

Engage!

Now that we’ve dissected the script a bit and added some powerful capabilities to our table, we’re ready to kick off the script. Remember that these files are enormous, so downloading and unzipping them is pretty much out of the question. So, what are we going to do?

Fortunately, through the magic of *nix pipes, we can download, extract, and feed the XML into the script all at once. The command looks like this:

	$ curl https://url-for-the-data-dump.com \| bzcat \|
	${HBASE_HOME}/bin/hbase shell import_from_wikipedia.rb

Note that you should replace the preceding dummy URL with the URL of a WikiMedia Foundation dump file of some kind.^[14] You should use [project]-latest-pages-articles.xml.bz2 for either the English Wikipedia (~12.7 GB)^[15] or the English Wiktionary (~566 MB).^[16] These files contain all of the most recent revisions of pages in the Main namespace. That is, they omit user pages, discussion pages, and so on.

Plug in the URL and run it! You should start seeing output like this shortly:

	500 records inserted (Ashmore and Cartier Islands)
	1000 records inserted (Annealing)
	1500 records inserted (Ajanta Caves)

The script will happily chug along as long as you let it or until it encounters an error, but you’ll probably want to shut it off after a while. When you’re ready to kill the script, press Ctrl+C. For now, though, let’s leave it running so we can take a peek under the hood and learn about how HBase achieves its horizontal scalability.

Introduction to Regions and Monitoring Disk Usage

In HBase, rows are kept in order, sorted by the row key. A region is a chunk of rows, identified by the starting key (inclusive) and ending key (exclusive). Regions never overlap, and each is assigned to a specific region server in the cluster. In our simplistic standalone server, there is only one region server, which will always be responsible for all regions. A fully distributed cluster would consist of many region servers.

So, let’s take a look at your HBase server’s disk usage, which will give us insight into how the data is laid out. You can inspect HBase’s disk usage by opening a command prompt to the data/default directory in the hbase.rootdir location you specified earlier and executing the du command. du is a standard *nix command-line utility that tells you how much space is used by a directory and its children, recursively. The -h option tells du to report numbers in human-readable form.

Here’s what ours looked like after about 68 MB worth of pages (out of over 560 MB total or about 160,000 pages) had been inserted and the import was still running:

	$ du -h
	4.0K ./wiki/.tabledesc
	0B ./wiki/.tmp
	0B ./wiki/1e157605a0e5a1493e4cc91d7e368b05/.tmp
	0B ./wiki/1e157605a0e5a1493e4cc91d7e368b05/recovered.edits
	11M ./wiki/1e157605a0e5a1493e4cc91d7e368b05/revision
	64M ./wiki/1e157605a0e5a1493e4cc91d7e368b05/text
	75M ./wiki/1e157605a0e5a1493e4cc91d7e368b05
	75M ./wiki
	75M .

This output tells us a lot about how much space HBase is using and how it’s allocated. The lines starting with /wiki describe the space usage for the wiki table. The long-named subdirectory 1e157605a0e5a1493e4cc91d7e368b05 represents an individual region (the only region so far). Under that, the directories /text and /revision correspond to the text and revision column families, respectively. Finally, the last line sums up all these values, telling us that HBase is using 75 MB of disk space. We can safely ignore the .tmp and recovered.edits for now.

One more thing. In the directory that you specified using the hbase.rootdir variable, you’ll find three folders named MasterProcWALs, WALs, and oldWALs. These folders hold write-ahead log (WAL) files. HBase uses write-ahead logging to provide protection against node failures. This is a fairly typical disaster recovery technique. For instance, write-ahead logging in file systems is called journaling. In HBase, logs are appended to the WAL before any edit operations (put and increment) are persisted to disk.

For performance reasons, edits are not necessarily written to disk immediately. The system does much better when I/O is buffered and written to disk in chunks. If the region server responsible for the affected region were to crash during this limbo period, HBase would use the WAL to determine which operations were successful and take corrective action. Without a WAL, a region server crash would mean that that not-yet-written data would be simply lost.

Writing to the WAL is optional and enabled by default. Edit classes such as Put and Increment have a setter method called setWriteToWAL that can be used to exclude the operation from being written to the WAL. Generally you’ll want to keep the default option, but in some instances it might make sense to change it. For example, if you’re running an import job that you can rerun any time, such as our Wikipedia import script, you might prioritize the performance benefit of disabling WAL writes over disaster recovery protection.

Regional Interrogation

If you let the script run long enough, you’ll see HBase split the table into multiple regions. Here’s our du output again, after about 280 MB worth of data (roughly 2.1 million pages) has been written to the wiki table:

	$ du -h
	4.0K ./wiki/.tabledesc
	0B ./wiki/.tmp
	0B ./wiki/48576fdcfcb9b29257fb93d33dbeda90/.tmp
	0B ./wiki/48576fdcfcb9b29257fb93d33dbeda90/recovered.edits
	132M ./wiki/48576fdcfcb9b29257fb93d33dbeda90/revision
	145M ./wiki/48576fdcfcb9b29257fb93d33dbeda90/text
	277M ./wiki/48576fdcfcb9b29257fb93d33dbeda90
	0B ./wiki/bd36620826a14025a35f1fe5e928c6c9/.tmp
	0B ./wiki/bd36620826a14025a35f1fe5e928c6c9/recovered.edits
	134M ./wiki/bd36620826a14025a35f1fe5e928c6c9/revision
	113M ./wiki/bd36620826a14025a35f1fe5e928c6c9/text
	247M ./wiki/bd36620826a14025a35f1fe5e928c6c9
	1.0G ./wiki
	1.0G .

The biggest change is that the old region (1e157605a0e5a1493e4cc91d7e368b05) is now gone and has been replaced by two new regions (48576fd... and bd36620.... In our stand-alone server, all the regions are served by our single server, but in a distributed environment these would be parceled across multiple region servers.

This raises a few questions, such as “How do the region servers know which regions they’re responsible for serving?” and “How can you find which region (and, by extension, which region server) is serving a given row?”

If we drop back into the HBase shell, we can query the hbase:meta to find out more about the current regions. hbase:meta is a special table whose sole purpose is to keep track of all the user tables and which region servers are responsible for serving the regions of those tables.

hbase> scan 'hbase:meta', { COLUMNS => ['info:server', 'info:regioninfo'] }

Even for a small number of regions, you should get a lot of output. Let’s just focus on the rows that begin with wiki for now. Here’s a fragment of ours, formatted and truncated for readability:

	ROW
	wiki,,1487399747176.48576fdcfcb9b29257fb93d33dbeda90.

	COLUMN+CELL
	column=info:server, timestamp=..., value=localhost.localdomain:35552
	column=info:regioninfo, timestamp=1487399747533, value={
	ENCODED => 48576fdcfcb9b29257fb93d33dbeda90,
	NAME => 'wiki,,1487399747176.48576fdcfcb9b29257fb93d33dbeda90.',
	STARTKEY => '', ENDKEY => 'lacrimamj'}

	ROW
	wiki,lacrimamj,1487399747176.bd36620826a14025a35f1fe5e928c6c9.

	COLUMN+CELL
	column=info:server, timestamp=..., value=localhost.localdomain:35552
	column=info:regioninfo, timestamp=1487399747533, value={
	ENCODED => bd36620826a14025a35f1fe5e928c6c9,
	NAME => 'wiki,lacrimamj,1487399747176.bd36620826a14025a35f1fe5e928c6c9.',
	STARTKEY => 'lacrimamj',
	ENDKEY => ''}

Both of the regions listed previously are served by the same server, localhost.localdomain:35552. The first region starts at the empty string row (”) and ends with ’lacrimamj’. The second region starts at ’lacrimamj’ and goes to ” (that is, to the end of the available keyspace).

STARTKEY is inclusive, while ENDKEY is exclusive. So, if you were looking for the ’Demographics of Macedonia’ row, you’d find it in the first region.

Because rows are kept in sorted order, we can use the information stored in hbase:meta to look up the region and server where any given row should be found. But where is the hbase:meta table stored?

It turns out that the hbase:meta table can also be split into regions and served by region servers just like any other table would be. If you run the load script as long as we have, this may or may not happen on your machine; you may have this table stored in only one region. To find out which servers have which parts of the hbase:meta table, look at the results of the preceding scan query but pay attention to the rows that begin with hbase:namespace.

	ROW
	hbase:namespace,,1486625601612.aa5b4cfb7204bfc50824dee1886103c5.
	COLUMN+CELL
	column=info:server, timestamp=..., value=localhost.localdomain:35552
	column=info:regioninfo, timestamp=1486625602069, value={
	ENCODED => aa5b4cfb7204bfc50824dee1886103c5,
	NAME => 'hbase:namespace,,1486625601612.aa5b4cfb7204bfc50824dee1886103c5.',
	STARTKEY => '',
	ENDKEY => ''}

In this case, the entire keyspace (beginning with ” and ending with ”) is stored in the aa5b4cfb7204bfc50824dee1886103c5, which is on disk on our machine in the data/hbase/namespace/aa5b4cfb7204bfc50824dee1886103c5 subdirectory of our HBase data folder (your region name will vary).

To see other metadata associated with an HBase table, use the describe command, like so:

	hbase> describe 'wiki'
	hbase> describe 'hbase:meta'

This will tell you whether the table is currently enabled and provide a lot of information about each column family in the table, including any Bloom filters that you’ve applied, which compression is used, and so on.

The assignment of regions to region servers, including hbase:meta regions, is handled by the master node, often referred to as HBaseMaster. The master server can also be a region server, performing both duties simultaneously.

When a region server fails, the master server steps in and reassigns responsibility for regions previously assigned to the failed node. The new stewards of those regions would look to the WAL to see what, if any, recovery steps are needed. If the master server fails, responsibility defers to any of the other region servers that step up to become the master.

Scanning One Table to Build Another

Once you’ve stopped the import script from running, we can move on to the next task: extracting information from the imported wiki contents.

Wiki syntax is filled with links, some of which link internally to other articles and some of which link to external resources. This interlinking contains a wealth of topological data. Let’s capture it!

Our goal is to capture the relationships between articles as directional links, pointing one article to another or receiving a link from another. An internal article link in wikitext looks like this: [[<target name>|<alt text>]], where <target name> is the article to link to, and <alt text> is the alternative text to display (optional).

For example, if the text of the article on Star Wars contains the string "[[Yoda|jedi master]]", we want to store that relationship twice—once as an outgoing link from Star Wars and once as an incoming link to Yoda. Storing the relationship twice means that it’s fast to look up both a page’s outgoing links and its incoming links.

To store this additional link data, we’ll create a new table. Head over to the shell and enter this:

	hbase> create 'links', {
	NAME => 'to', VERSIONS => 1, BLOOMFILTER => 'ROWCOL'
	},{
	NAME => 'from', VERSIONS => 1, BLOOMFILTER => 'ROWCOL'
	}

In principle, we could have chosen to shove the link data into an existing column family or merely added one or more additional column families to the wiki table rather than create a new one. When you create a separate table, this has the advantage that the tables have separate regions, which in turn means that the cluster can more effectively split regions as necessary.

The general guidance for column families in the HBase community is to try to keep the number of families per table down. You can do this either by combining more columns into the same families or by putting families in different tables entirely. The choice is largely decided by whether and how often clients will need to get an entire row of data (as opposed to needing just a few column values).

For the wiki application we’ve been developing, the text and revision column families need to be on the same table so when you put new revisions in, the metadata and the text share the same timestamp. The links content, by contrast, will never have the same timestamp as the article from which the data came. Further, most client actions will be interested either in the article text or in the extracted information about article links but probably not in both at the same time. So, splitting out the to and from column families into a separate table makes sense.

Constructing the Scanner

With the links table created, we’re ready to implement a script that’ll scan all the rows of the wiki table. Then, for each row, it’ll retrieve the wikitext and parse out the links. Finally, for each link found, it’ll create incoming and outgoing link table records. The bulk of this script should be pretty familiar to you by now. Most of the pieces are recycled, and we’ll discuss the few novel bits.

hbase/generate_wiki_links.rb

	import 'org.apache.hadoop.hbase.client.HTable'
	import 'org.apache.hadoop.hbase.client.Put'
	import 'org.apache.hadoop.hbase.client.Scan'
	import 'org.apache.hadoop.hbase.util.Bytes'

	def jbytes(*args)
	return args.map { \|arg\| arg.to_s.to_java_bytes }
	end

	wiki_table = HTable.new(@hbase.configuration, 'wiki')
	links_table = HTable.new(@hbase.configuration, 'links')
	links_table.setAutoFlush(false)

	scanner = wiki_table.getScanner(Scan.new)

	linkpattern = /[[([^[]\|:#][^[]\|:])(?:\|([^[]\|]+))?]]/*
	count = 0

	while (result = scanner.next())
	title = Bytes.toString(result.getRow())
	text = Bytes.toString(result.getValue(jbytes('text'*, '')))
	if text
	put_to = nil
	text.scan(linkpattern) do \|target, label\|
	unless put_to
	put_to = Put.new(*jbytes(title))
	put_to.setWriteToWAL( false )
	end

	target.strip!
	target.capitalize!

	label = '' unless label
	label.strip!

	put_to.add(jbytes("to"*, target, label))
	put_from = Put.new(*jbytes(target))
	put_from.add(jbytes("from"*, title, label))
	put_from.setWriteToWAL(false)
	links_table.put(put_from)
	end
	links_table.put(put_to) if put_to
	links_table.flushCommits()
	end

	count += 1
	puts "#{count} pages processed (#{title})" if count % 500 == 0
	end

	links_table.flushCommits()
	exit

A few things to note in this script:

First, we grab a Scan object, which we’ll use to scan through the wiki table.
Extracting row and column data requires some byte wrangling but generally isn’t too bad either.
Each time the linkpattern appears in the page text, we extract the target article and text of the link and then use those values to add to our Put instances.
Finally, we tell the table to execute our accumulated Put operations. It’s possible (though unlikely) for an article to contain no links at all, which is the reason for the if put_to clause.
Using setWriteToWAL(false) for these puts is a judgment call. Because this exercise is for educational purposes and because you could simply rerun the script if anything went wrong, we’ll take the speed bonus and accept our fate should the node fail.

Running the Script

If you’re ready to throw caution to the wind, run the script.

${HBASE_HOME}/bin/hbase shell generate_wiki_links.rb

It should produce output like this:

	500 pages processed (10 petametres)
	1000 pages processed (1259)
	1500 pages processed (1471 BC)
	2000 pages processed (1683)
	...

As with the previous script, you can let it run as long as you like, even to completion. If you want to stop it, press Ctrl+C.

You can monitor the disk usage of the script using du as we’ve done before. You’ll see new entries for the links table we just created, and the size counts will increase as the script runs.

Examining the Output

We just created a scanner programmatically to perform a sophisticated task. Now we’ll use the shell’s scan command to simply dump part of a table’s contents to the console. For each link the script finds in a text: blob, it will indiscriminately create both to and from entries in the links table. To see the kinds of links being created, head over to the shell and scan the table.

hbase> scan 'links', STARTROW => "Admiral Ackbar", ENDROW => "It's a Trap!"

You should get a whole bunch of output. Of course, you can use the get command to see the links for just a single article.

	hbase> get 'links', 'Addition'
	COLUMN CELL
	from:+ timestamp=1487402389072, value=
	from:0 timestamp=1487402390828, value=
	from:Addition timestamp=1487402391595, value=
	from:Appendix:Basic English word list timestamp=1487402393334, value=
	...

The structure of the wiki table is highly regular, with each row consisting of the same columns. As you recall, each row has text:, revision:author, and revision:comment columns. The links table has no such regularity. Each row may have one column or hundreds. And the variety of column names is as diverse as the row keys themselves (titles of Wikipedia articles). That’s okay! HBase is a so-called sparse data store for exactly this reason.

To find out just how many rows are now in your table, you can use the count command.

	hbase> count 'wiki', INTERVAL => 100000, CACHE => 10000
	Current count: 100000, row: Nov-Zelandon
	Current count: 200000, row: adiudicamur
	Current count: 300000, row: aquatores
	Current count: 500000, row: coiso
	...
	Current count: 1300000, row: occludesti
	Current count: 1400000, row: plendonta
	Current count: 1500000, row: receptarum
	Current count: 1900000, row: ventilators
	2179230 row(s) in 17.3440 seconds

Because of its distributed architecture, HBase doesn’t immediately know how many rows are in each table. To find out, it has to count them (by performing a table scan). Fortunately, HBase’s region-based storage architecture lends itself to fast distributed scanning. So, even if the operation at hand requires a table scan, we don’t have to worry quite as much as we would with other databases.

Day 2 Wrap-Up

Whew, that was a pretty big day! You learned how to write an import script for HBase that parses data out of a stream of XML. Then you used those techniques to stream Wikipedia dumps directly into your wiki table.

You learned more of the HBase API, including some client-controllable performance levers such as setAutoFlush, flushCommits, and setWriteToWAL. Along those lines, we discussed some HBase architectural features such as disaster recovery, provided via the write-ahead log.

Speaking of architecture, you discovered table regions and how HBase divvies up responsibility for them among the region servers in the cluster. We scanned the hbase:meta table to get a feel for HBase internals.

Finally, we discussed some of the performance implications of HBase’s sparse design. In so doing, we touched on some community best practices regarding the use of columns, families, and tables.

Day 2 Homework

Find

Find a discussion or article describing the pros and cons of compression in HBase.
Find an article explaining how Bloom filters work in general and how they benefit HBase.
Aside from the algorithm, what other column family options relate to compression?
How does the type of data and expected usage patterns inform column family compression options?

Do

Expanding on the idea of data import, let’s build a database containing nutrition facts.

Download the MyPyramid Raw Food Data set from Data.gov^[17] and extract the zipped contents to Food_Display_Table.xml.

This data consists of many pairs of <Food_Display_Row> tags. Inside these, each row has a <Food_Code> (integer value), <Display_Name> (string), and other facts about the food in appropriately named tags.

Create a new table called foods with a single column family to store the facts. What should you use for the row key? What column family options make sense for this data?
Create a new JRuby script for importing the food data. Use the streaming XML parsing style we used earlier for the Wikipedia import script and tailor it to the food data. Pipe the food data into your import script on the command line to populate the table.
Using the HBase shell, query the foods table for information about your favorite foods.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Day 2: Working with Big Data

Create new playlist

Sign In

Sign Up

Day 2: Working with Big Data

Importing Data, Invoking Scripts

Streaming XML

Streaming Wikipedia

Compression and Bloom Filters

Engage!

Introduction to Regions and Monitoring Disk Usage

Regional Interrogation

Scanning One Table to Build Another

Constructing the Scanner

Running the Script

Examining the Output

Day 2 Wrap-Up

Day 2 Homework

Find

Do

Table of Contents for
Day 2: Working with Big Data