Day 1: CRUD and Table Administration

Today’s goal is to learn the nuts and bolts of working with HBase. You’ll get a local instance of HBase running in standalone mode (rather than in distributed mode), and then you’ll use the HBase shell to create and alter tables and to insert and modify data using basic CRUD-style commands. After that, you’ll explore how to perform some of those operations programmatically by using the HBase Java API in JRuby. Along the way, you’ll uncover some HBase architectural concepts, such as the relationship between rows, column families, columns, and values in a table. Just bear in mind that these concepts in HBase are subtly different from their counterparts in relational databases.

According to most HBase admins out there, a fully operational, production-quality HBase cluster should really consist of no fewer than five nodes. But this bulky of a setup would be overkill for our needs (and for our laptops). Fortunately, HBase supports three running modes:

  • Standalone mode is a single machine acting alone.
  • Pseudo-distributed mode is a single node pretending to be a cluster.
  • Fully distributed mode is a cluster of nodes working together.

For most of this chapter, you’ll be running HBase in standalone mode. Yet even that can be a bit of a challenge, especially compared to other databases in this book, such as Redis or MongoDB. So although we won’t cover every aspect of installation and administration, we’ll give some relevant troubleshooting tips where appropriate.

Configuring HBase

Before you can use HBase, you need to provide it with some configuration, as HBase doesn’t really have an “out-of-the-box” mode. Configuration settings for HBase are kept in a file called hbase-site.xml, which can be found in the ${HBASE_HOME}/conf directory. Note that HBASE_HOME is an environment variable pointing to the directory where you’ve installed HBase. Make sure to set this variable now, preferably in your .bash_profile or similar file so that it persists across shell sessions.

Initially, this hbase-site.xml file contains just an empty <configuration> tag. You can add any number of property definitions to your configuration using this format:

 <property>
  <name>some.property.name</name>
  <value>A property value</value>
 </property>

By default, HBase uses a temporary directory to store its data files. This means you’ll lose all your data whenever the operating system decides to reclaim the disk space. To keep your data around, you should specify a non-ephemeral storage location. Set the hbase.rootdir property to an appropriate path like so:

 <property>
  <name>hbase.rootdir</name>
  <value>file:///path/to/hbase</value>
 </property>

Here’s an example configuration:

 <property>
  <name>hbase.rootdir</name>
  <value>file://</value>
 </property>

To start HBase, open a terminal (command prompt) and run this command:

 $ ​${​HBASE_HOME​}​/bin/start-hbase.sh

To shut down HBase at any time, use the stop-hbase.sh command in the same directory.

If anything goes wrong, take a look at the most recently modified files in the ${HBASE_HOME}/logs directory. On *nix-based systems, the following command will pipe the latest log data to the console as it’s written:

 $ cd ​${​HBASE_HOME​}
 $ find ./logs -name ​"hbase-*.log"​ -exec tail -f {} ​;

The HBase Shell

The HBase shell is a JRuby-based command-line program you can use to interact with HBase. In the shell, you can add and remove tables, alter table schemas, add or delete data, and perform a bunch of other tasks. Later, we’ll explore other means of connecting to HBase, but for now the shell will be our home.

With HBase running, open a terminal and fire up the HBase shell:

 $ ​${​HBASE_HOME​}​/bin/hbase shell

To confirm that it’s working properly, try asking it for version information. That should output a version number and hash, and a timestamp for when the version was released.

 hbase>​​ ​​version
 1.2.1, r8d8a7107dc4ccbf36a92f64675dc60392f85c015, Wed Mar 30 11:19:21 CDT 2016

You can enter help at any time to see a list of available commands or to get usage information about a particular command.

Next, execute the status command to see how your HBase server is holding up.

 hbase>​​ ​​status
 1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load

If an error occurs for any of these commands or if the shell hangs, a connection problem could be to blame. HBase does its best to automatically configure its services based on your network setup, but sometimes it gets it wrong. If you’re seeing these symptoms, check the HBase network settings.

Creating a Table

Most programming languages have some concept of a key/value map. JavaScript has objects, Ruby has hashes, Go has maps, Python has dictionaries, Java has hashmaps, and so on. A table in HBase is basically a big map—well, more accurately, a map of maps.

In an HBase table, keys are arbitrary strings that each map to a row of data. A row is itself a map in which keys are called columns and values are stored as uninterpreted arrays of bytes. Columns are grouped into column families, so a column’s full name consists of two parts: the column family name and the column qualifier. Often these are concatenated together using a colon (for example, family:qualifier).

Here’s what a simple HBase table might look like if it were a Python dictionary:

 hbase_table = { ​# Table
 'row1'​: { ​# Row key
 'cf1:col1'​: ​'value1'​, ​# Column family, column, and value
 'cf1:col2'​: ​'value2'​,
 'cf2:col1'​: ​'value3'
  },
 'row2'​: {
 # More row data
  }
 }
 
 queried_value = hbase_table[​'row1'​][​'cf1:col1'​] ​# 'value1'

For a more visual illustration, take a look at the following diagram.

images/hbase-simple-architecture.png

In this figure, we have a hypothetical table with two column families: color and shape. The table has two rows—denoted by dashed boxes—identified by their row keys: first and second. Looking at just the first row, you see that it has three columns in the color column family (with qualifiers red, blue, and yellow) and one column in the shape column family (square). The combination of row key and column name (including both family and qualifier) creates an address for locating data. In this example, the tuple first/color:red points us to the value ’#F00’.

Now let’s take what you’ve learned about table structure and use it to do something fun—you’re going to make a wiki! There are lots of juicy info bits you might want to associate with a wiki, but you’ll start with the bare minimum. A wiki contains pages, each of which has a unique title string and contains some article text.

Use the create command to make our wiki table in the HBase shell:

 hbase>​​ ​​create​​ ​​'wiki'​​,​​ ​​'text'
 0 row(s) in 1.2160 seconds

Here, we’re creating a table called wiki with a single column family called text. The table is currently empty; it has no rows and thus no columns. Unlike a relational database, in HBase a column is specific to the row that contains it. Columns don’t have to be predefined in something like a CREATE TABLE declaration in SQL. For our purposes here, though, we’ll stick to a schema, even though it isn’t predefined. When we start adding rows, we’ll add columns to store data at the same time.

Visualizing our table architecture, we arrive at something like the following figure.

images/hbase-simple-wiki-architecture.png

By our own convention, we expect each row to have exactly one column within the text family, qualified by the empty string (). So, the full column name containing the text of a page will be ’text:’.

For our wiki table to be useful, it’s of course going to need content, so let’s add some!

Inserting, Updating, and Retrieving Data

Our wiki needs a Home page, so we’ll start with that. To add data to an HBase table, use the put command:

 hbase>​​ ​​put​​ ​​'wiki'​​,​​ ​​'Home'​​,​​ ​​'text:'​​,​​ ​​'Welcome to the wiki!'

This command inserts a new row into the wiki table with the key ’Home’, adding ’Welcome to the wiki!’ to the column called ’text:’. Note the colon at the end of the column name. This is actually a requirement in HBase if you don’t specify a column family in addition to a column (in this case, you’re specifying no column family).

We can query the data for the ’Home’ row using get, which requires two parameters: the table name and the row key. You can optionally specify a list of columns to return. Here, we’ll fetch the value of the text: column:

 hbase>​​ ​​get​​ ​​'wiki'​​,​​ ​​'Home'​​,​​ ​​'text:'
 COLUMN CELL
  text: timestamp=1295774833226, value=Welcome to the wiki!
 1 row(s) in 0.0590 seconds

Notice the timestamp field in the output. HBase stores an integer timestamp for all data values, representing time in milliseconds since the epoch (00:00:00 UTC on January 1, 1970). When a new value is written to the same cell, the old value hangs around, indexed by its timestamp. This is a pretty awesome feature, and one that is unique to HBase amongst the databases in this book. Most databases require you to specifically handle historical data yourself, but in HBase, versioning is baked right in!

Finally, let’s perform a scan operation:

 hbase>​​ ​​scan​​ ​​'wiki'

Scan operations simply return all rows in the entire table. Scans are powerful and great for development purposes but they are also a very blunt instrument, so use them with care. We don’t have much data in our wiki table so it’s perfectly fine, but if you’re running HBase in production, stick to more precise reads or you’ll put a lot of undue strain on your tables.

Putting and Getting

The put and get commands allow you to specify a timestamp explicitly. If using milliseconds since the epoch doesn’t strike your fancy, you can specify another integer value of your choice. This gives you an extra dimension to work with if you need it. If you don’t specify a timestamp, HBase will use the current time when inserting, and it will return the most recent version when reading.

Luc says:
Luc says:
Rows Are Like Mini Databases

Rows in HBase are a bit tough to fully understand at first because rows tend to be much more “shallow” in other databases. In relational databases, for example, rows contain any number of column values but not metadata such as timestamps, and they don’t contain the kind of depth that HBase rows do (like the Python dictionary in the previous example).

I recommend thinking of HBase rows as being a tiny database in their own right. Each cell in the database can have many different values associated with it (like a mini timeseries database). When you fetch a row in HBase, you’re not fetching a set of values; you’re fetching a small world.

Altering Tables

So far, our wiki schema has pages with titles, text, and an integrated version history but nothing else. Let’s expand our requirements to include the following:

  • In our wiki, a page is uniquely identified by its title.
  • A page can have unlimited revisions.
  • A revision is identified by its timestamp.
  • A revision contains text and optionally a commit comment.
  • A revision was made by an author, identified by name.

Visually, our requirements can be sketched as you see in the following figure.

images/hbase-wiki-requirements.png

In this abstract representation of our requirements for a page, we see that each revision has an author, a commit comment, some article text, and a timestamp. The title of a page is not part of a revision because it’s the identifier we use to denote revisions belonging to the same page and thus cannot change. If you did want to change the title of a page, you’d need to write a whole new row.

Mapping our vision to an HBase table takes a somewhat different form, as illustrated in the figure that follows.

images/hbase-wiki-architecture.png

Our wiki table uses the title as the row key and will group other page data into two column families called text and revision. The text column family is the same as before; we expect each row to have exactly one column, qualified by the empty string (), to hold the article contents. The job of the revision column family is to hold other revision-specific data, such as the author and commit comment.

Defaults

We created the wiki table with no special options, so all the HBase default values were used. One such default value is to keep only three VERSIONS of column values, so let’s increase that. To make schema changes, first we have to take the table offline with the disable command.

 hbase>​​ ​​disable​​ ​​'wiki'
 0 row(s) in 1.0930 seconds

Now we can modify column family characteristics using the alter command.

 hbase>​​ ​​alter​​ ​​'wiki'​​,​​ ​​{​​ ​​NAME​​ ​​=>​​ ​​'text'​​,​​ ​​VERSIONS​​ ​​=>
 hbase*​ ​org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
 0 row(s) in 0.0430 seconds

Here, we’re instructing HBase to alter the text column family’s VERSIONS attribute. There are a number of other attributes we could have set, some of which we’ll discuss in Day 2. The hbase* line means that it’s a continuation of the previous line.

Altering a Table

Operations that alter column family characteristics can be very expensive because HBase has to create a new column family with the chosen specifications and then copy all the data over. In a production system, this may incur significant downtime. For this reason, the sooner you settle on column family options the better.

With the wiki table still disabled, let’s add the revision column family, again using the alter command:

 hbase>​​ ​​alter​​ ​​'wiki'​​,​​ ​​{​​ ​​NAME​​ ​​=>​​ ​​'revision'​​,​​ ​​VERSIONS​​ ​​=>
 hbase*​ ​org.apache.hadoop.hbase.HConstants::ALL_VERSIONS }
 0 row(s) in 0.0660 seconds

Just as before, with the text family, we’re only adding a revision column family to the table schema, not individual columns. Though we expect each row to eventually contain a revision:author and revision:comment, it’s up to the client to honor this expectation; it’s not written into any formal schema. If someone wants to add a revision:foo for a page, HBase won’t stop them.

Moving On

With these additions in place, let’s reenable our wiki:

 hbase>​​ ​​enable​​ ​​'wiki'
 0 row(s) in 0.0550 seconds

Now that our wiki table has been modified to support our growing requirements list, we can start adding data to columns in the revision column family.

Adding Data Programmatically

As you’ve seen, the HBase shell is great for tasks such as manipulating tables. Sadly, the shell’s data insertion support isn’t the best. The put command allows you to set only one column value at a time, and in our newly updated schema, we need to add multiple column values simultaneously so they all share the same timestamp. We’re going to need to start scripting.

The following script can be executed directly in the HBase shell because the shell is also a JRuby interpreter. When run, it adds a new version of the text for the Home page, setting the author and comment fields at the same time. JRuby runs on the Java virtual machine (JVM), giving it access to the HBase Java code. These examples will not work with non-JVM Ruby.

 import ​'org.apache.hadoop.hbase.client.HTable'
 import ​'org.apache.hadoop.hbase.client.Put'
 
 def​ jbytes(*args)
  args.map {|arg| arg.to_s.to_java_bytes}
 end
 
 table = HTable.new(@hbase.configuration, ​"wiki"​)
 
 p = Put.new(*jbytes(​"Home"​))
 
 p.add(*jbytes(​"text"​, ​""​, ​"Hello world"​))
 p.add(*jbytes(​"revision"​, ​"author"​, ​"jimbo"​))
 p.add(*jbytes(​"revision"​, ​"comment"​, ​"my first edit"​))
 
 table.put(p)

The import lines bring references to useful HBase classes into the shell. This saves us from having to write out the full namespace later. Next, the jbytes function takes any number of arguments and returns an array converted to Java byte arrays, as the HBase API methods demand.

After that, we create a local variable (table) pointing to our wiki table, using the @hbase administration object for configuration information.

Next, we stage a commit operation by creating and preparing a new instance of a Put object, which takes the row to be modified. In this case, we’re sticking with the Home page we’ve been working with thus far. Finally, we add properties to our Put instance and then call on the table object to execute the put operation we’ve prepared. The add method has several forms; in our case, we used the three-argument version: add(column_family, column_qualifier, value).

Why Column Families?

You may be tempted to build your whole structure without column families. Why not just store all of a row’s data in a single column family? That solution would be simpler to implement. But there are downsides to avoiding column families. One of them is that you’d miss out on fine-grained performance tuning. Each column family’s performance options are configured independently. These settings affect things such as read and write speed and disk space consumption.

The other advantage to keep in mind is that column families are stored in different directories. When reading row data in HBase, you can potentially target your reads to specific column families within the row and thus avoid unnecessary cross-directory lookups, which can provide a speed boost, especially in read-heavy workloads.

All operations in HBase are atomic at the row level. No matter how many columns are affected, the operation will have a consistent view of the particular row being accessed or modified. This design decision helps clients reason intelligently about the data.

Our put operation affects several columns and doesn’t specify a timestamp, so all column values will have the same timestamp (the current time in milliseconds). Let’s verify by invoking get.

 hbase>​​ ​​get​​ ​​'wiki'​​,​​ ​​'Home'
 COLUMN CELL
  revision:author timestamp=1296462042029, value=jimbo
  revision:comment timestamp=1296462042029, value=my first edit
  text: timestamp=1296462042029, value=Hello world
 3 row(s) in 0.0300 seconds

As you can see, each column value listed previously has the same timestamp.

Day 1 Wrap-Up

Today, you got a firsthand look at a running HBase server. You learned how to configure it and monitor log files for troubleshooting. And using the HBase shell you performed basic administration and data manipulation tasks.

In providing a basic data model for a wiki storage engine, you explored schema design in HBase. You learned how to create tables and manipulate column families. Designing an HBase schema means making choices about column family options and, just as important, our semantic interpretation of features such as timestamps and row keys.

You also started poking around in the HBase Java API by executing JRuby code in the shell. In Day 2, you’ll take this a step further, using the shell to run custom scripts for big jobs such as data import.

At this point, we hope you’ve been able to uncouple your thinking from relational database terms such as table, row, and column. By all means, don’t forget those terms; just suspend their meaning in your head for a while longer, as the difference between how HBase uses these terms and what they mean in other systems will become even starker as we delve deeper into HBase’s features.

Day 1 Homework

HBase documentation online generally comes in two flavors: extremely technical and nonexistent. There are some decent “getting started” guides out there, but there’s a chance you may need to spend some time trawling through Javadoc or source code to find answers.

Find

  1. Figure out how to use the shell to do the following:

    • Delete individual column values in a row
    • Delete an entire row
  2. Bookmark the HBase API documentation for the version of HBase you’re using.

Do

  1. Create a function called put_many that creates a Put instance, adds any number of column-value pairs to it, and commits it to a table. The signature should look like this:

     def​ put_many(table_name, row, column_values)
     # your code here
     end
  2. Define your put_many function by pasting it in the HBase shell, and then call it like so:

     hbase>​​ ​​put_many​​ ​​'wiki'​​,​​ ​​'Some title'​​,​​ ​​{
     hbase*​ ​"text:" => "Some article text",
     hbase*​ ​"revision:author" => "jschmoe",
     hbase*​ ​"revision:comment" => "no comment" }
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset