Day 3: Playing with Other Databases

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Day 3: Playing with Other Databases

Today we’re wrapping up our final database chapter by inviting some other databases from the book to the Redis party. Amongst the databases we’ve gone over in this book, Redis is the most lightweight and modular, capable of enhancing just about any complex data setup without imposing a ton of engineering and ops overhead. Redis will hold a starring role in today’s exercise by making our interaction with other databases faster and easier.

We’ve learned throughout this book that different databases have different strengths, so many modern systems have moved toward a polyglot persistence model, where many databases each play a role in the system. You will learn how to build one of these projects using CouchDB as the system of record (the canonical data source), Neo4j to handle data relationships, and Redis to help with data population and caching. Consider this your final exam for the book.

Note that our choice of databases for this last project should not be seen as the authors’ endorsement of any specific set of databases, languages, or frameworks over another but rather a showcase of how multiple databases can work together, leveraging the capabilities of each in pursuit of a single goal.

A Polyglot, Persistent Service

Our polyglot persistence service will act as the backend for a web interface that provides information about bands. We want to store a list of musical band names, the artists who performed in those bands, and any number of roles each artist played in the band, from lead singer to backup keytar player to people who just dance (like the guy in Mighty Mighty Bosstones). Each of the three databases—Redis, CouchDB, and Neo4j—will handle a different aspect of our band data management system.

Like the growing phenomenon of polyglot programming, polyglot persistence is now well established in most corners of the database universe.

If you are unfamiliar with the practice, polyglot programming is what happens when a team uses more than one programming language in a single project. Contrast this with the convention of using one general-purpose language throughout a project or even an entire company (you may have heard of “Java shops” or “Ruby shops”). Polyglot programming is useful because each language has different inherent strengths. A language like Scala may be better suited for large “Big Data” processing tasks, whereas a language like Ruby may be friendlier for shell scripts. Used together, they create a synergy. The rise of so-called microservice architectures, in which services written in different languages interact with each other to form complex, unified systems, is a logical outcome of polyglot programming.

Similar to its language-centric cousin, polyglot persistence happens when you use the strengths of many kinds of database in the same system, as opposed to relying on a single database (often a relational database). A basic variant of this is already quite common: using a key-value store (such as Redis) that acts as a cache for queries against a relatively slower relational database (such as PostgreSQL). Relational models, as we’ve seen in previous chapters, are ill-suited for a growing host of problems, such as graph traversal. But even these new databases shine only as a few stars in the full galaxy of requirements.

Why the sudden interest in polyglot everything? Nearly a decade ago, Martin Fowler noted^[62] that having a single central database that multiple applications could use as a source of truth was a common pattern in software design. This once-popular database integration pattern has given way to a middleware layer pattern, where multiple applications instead communicate to a service layer over HTTP or a lighter protocol, such as Thrift or Protocol Buffers. This frees up the middleware service itself to rely on any number of databases or, in the case of polyglot persistence, many types of database.

Redis plays three important roles in our system: It assists in populating data to CouchDB, it acts as a cache for recent Neo4j changes, and it enables quick lookup for partial value searches. Its speed and ability to store multiple data formats make it well suited for data population, and its built-in expiry policies are perfect for handling cached data.

CouchDB is our system of record (SOR), or authoritative data source. CouchDB’s document structure is an easy way to store band data because it allows for nested artist and role information, and we will take advantage of the changes API^[63] in CouchDB to keep our third data source in sync.

Neo4j is our relationship store. Although querying the CouchDB SOR directly is perfectly reasonable, a graph database affords us a simplicity and speed in walking node relationships that other databases have a difficult time matching. We’ll store relationships between bands, band members, and the roles the members play.

Each database has a specific role to play in our system, but they don’t natively communicate with another, which means that we need to build a translation layer between them. We’ll use Node.js to populate the databases, communicate between them, and act as a simple front-end server. Because gluing multiple databases together requires a bit of code, this last day will include much more code than we have seen so far in this book.

Population

The first item of business is to populate our databases with the necessary data. We take a two-phased approach here, by first populating Redis and then populating our CouchDB SOR. We’ll be using a large dataset holding information about bands, including the names of those bands, band members, which role each band member played, and more. This dataset contains a lot of information, but we’re interested only in extracting the member or artist name, the group or band name, and their roles in that band stored as a comma-separated list. For example, John Cooper played in the band Skillet as the Lead vocalist, Acoustic guitar player, and Bassist.

First, download the file holding the dataset:

$ curl -O [data-file-url]

Replace [data-file-url] with the URL in the footnote below.^[64]

This file contains a lot of information, but we are interested only in extracting the member or artist name, the group or band name, and their roles in that band stored as a comma-separated list. For example, John Cooper played in the band Skillet as the Lead vocalist, Acoustic guitar player, and Bassist.

/m/0654bxy John Cooper Skillet Lead vocalist,Acoustic guitar,Bass 1996

Ultimately we want to structure John Cooper and the other members of Skillet into a single CouchDB document like the following, stored at the URL http://localhost:5984/bands/Skillet:

	{
	"_id": "Skillet",
	"name": "Skillet"
	"artists": [
	{
	"name": "John Cooper",
	"role": [
	"Acoustic guitar",
	"Lead vocalist",
	"Bass"
	]
	},
	...
	{
	"name": "Korey Cooper",
	"role": [
	"backing vocals",
	"Synthesizer",
	"Guitar",
	"Keyboard instrument"
	]
	}
	]
	}

This file contains well over 100,000 band members and more than 30,000 bands. That’s not many, but it’s a good starting point to build your own system. Note that not all roles are documented for each artist. This is an incomplete dataset, but we can deal with that later.

Phase 1: Data Transformation

You may wonder why we bother populating Redis and don’t just dive right into populating CouchDB. Acting as an intermediary, Redis adds structure to the flat TSV data so that subsequent insertion into another database is fast. Because our plan is to create a single record per band name, Redis allows us to make a single pass through our TSV file (which lists the same band for each band member—each band member is represented in a line). Adding single members directly to CouchDB for each line in the file can cause update thrashing, where two band member lines attempt to create/update the same band document at the same time, forcing the system to reinsert when one of them fails CouchDB’s version check.

The catch with this strategy is that you’re limited to the constraints of Redis to hold an entire dataset in RAM—though this limit could be overcome by the simple consistent-hashing cluster you saw on Day 2.

With our data file in hand, ensure you have Node.js installed as well as the Node Package Manager (npm). Once that’s all done, we need to install three NPM projects: redis, csv, and hiredis (the optional Redis C-driver you learned about on Day 2 that can greatly speed up Redis interactions).

$ npm install hiredis redis csv-parser cradle

Then, check that your Redis server is running on the default port 6379, or alter each script’s createClient function to point to your Redis port.

You can populate Redis by running the following Node.js script in the same directory as your TSV file, which we assume is named group_membership.tsv. (All of the JavaScript files we’ll look at are fairly verbose, so we don’t show them in their entirety. All of the code can be downloaded from the Pragmatic Bookshelf website. Here we’ll just stick to the meat of each file.) Download and run the following file:

$ node prePopulate.js

This script basically iterates through each line of the TSV and extracts the artist name, the band name, and the roles they play in that band. Then it adds those values to Redis (skipping any blank values).

The format of each Redis band key is "band:Band Name". The script will add this artist name to the set of artist names. So, the key "band:Beatles" will contain the set of values ["John Lennon", "Paul McCartney", "George Harrison", "Ringo Starr"]. The artist keys will also contain the band name and similarly contain a set of roles. "artist:Beatles:Ringo Starr" will contain the set ["Drums"].

The other code just keeps track of how many lines we’ve processed and outputs the results to the screen.

redis/prePopulate.js

	var stream = csv({
	separator: 't',
	newline: 'n'
	});

	fs.createReadStream(tsvFilename)
	.pipe(stream)
	.on('data', function(data) {
	var
	artist = data['member'],
	band = data['group'],
	roles = buildRoles(data['role']);

	if (artist === '' \|\| band === '') {
	trackLineCount();
	return true;
	}

	redisClient.sadd('band:' + band, artist);

	if (roles.length > 0) {
	roles.forEach(function(role) {
	redisClient.sadd(`artist:${band}:${artist}`, role);
	});
	}

	trackLineCount();
	})
	.on('end', function(totalLines) {
	console.log(`Total lines processed: ${processedLines}`);
	redisClient.quit();
	});

Eric says:

Nonblocking Code

Before starting this book, we were only passingly familiar with writing event-driven nonblocking applications. Nonblocking means precisely that: rather than waiting for a long-running process to complete, the main code will continue executing. Whatever you need to do in response to a blocking event you put inside a function or code block to be executed later. This can be by spawning a separate thread or, in our case, implementing a reactor pattern event-driven approach.

In a blocking program, you can write code that queries a database, waits, and loops through the results.

	results = database.some_query()
	for value in results
	# do something with each value
	end
	# this is not executed until after the results are looped...

In an event-driven program, you would pass in the loop as a function/code block. While the database is doing its thing, the rest of the program can continue running. Only after the database returns the result does the function/code block get executed.

	database.some_query do \|results\|
	for value in results
	# do something with each value
	end
	end
	# this continues running while the database performs its query...

It took us quite some time to realize the benefits here. The rest of the program can run rather than sitting idle while it waits on the database, sure, but is this common? Apparently so, because when we began coding in this style, we noticed an order-of-magnitude decrease in latency.

We try to keep the code as simple as we can, but interacting with databases in a nonblocking way is an inherently complex process. But as we learned, it’s generally a very good method when dealing with databases. Nearly every popular programming language has some sort of nonblocking library. Ruby has EventMachine, Python has Twisted, Java has the NIO library, C# has Interlace, and, of course, JavaScript has Node.js.

You can test that the code has been populating Redis by launching redis-cli and executing RANDOMKEY. We should expect a key prefixed by band: or artist: … any value but (nil) is good.

Now that Redis is populated, proceed immediately to the next section. You could lose data if you turn Redis off, unless you chose to set a higher durability than the default or initiated a SAVE command.

Phase 2: SOR Insertion

CouchDB will play the role of our system of record (SOR). If any data conflicts arise between Redis, CouchDB, or Neo4j, CouchDB wins. A good SOR should contain all of the data necessary to rebuild any other data source in its domain.

Ensure CouchDB is running on the default port 5984, or change the require(’http’).createClient(5984, ’localhost’) line in the following code to the port number you require. Redis should also still be running from the previous section. Download and run the following file:

$ node populateCouch.js

Because phase 1 was all about pulling data from a TSV and populating Redis, this phase is all about pulling data from Redis and populating CouchDB. You don’t use any special drivers for CouchDB because it’s a simple REST interface and Node.js has a simple built-in HTTP library.

In the following block of code, we perform a Redis KEYS bands:* to get a list of all band names in our system. If we had a really big dataset, we could add more scoping (for example, bands:A* to get only band names starting with a, and so on). Then, for each of those bands we fetch the set of artists and extract the band name from the key by removing the prefix bands: from the key string.

redis/populateCouch.js

	redisClient.keys('band:', function*(err, bandKeys) {
	totalBands = bandKeys.length;
	var
	readBands = 0,
	bandsBatch = [];

	bandKeys.forEach(function(bandKey) {
	// substring of 'band:'.length gives us the band name
	var bandName = bandKey.substring(5);
	redisClient.smembers(bandKey, function(err, artists) {

Next, we get all of the roles for every artist in this band, which Redis returns as an array of arrays (each artists role is its own array). You can do this by batching up Redis SMEMBERS commands into an array called roleBatch and executing them in a single MULTI batch. Effectively, that would be executing a single pipelined request like this:

	MULTI
	SMEMBERS "artist:Beatles:John Lennon"
	SMEMBERS "artist:Beatles:Ringo Starr"
	EXEC

From there, a batch of 50 CouchDB documents is made. We build a batch of 50 because we then send the entire set to CouchDB’s /_bulk_docs command, allowing us very, very fast insertion.

redis/populateCouch.js

	redisClient.
	multi(roleBatch).
	exec(function(err, roles) {
	var
	i = 0,
	artistDocs = [];

	// build the artists sub-documents
	artists.forEach(function(artistName) {
	artistDocs.push({ name: artistName, role : roles[i++] });
	});

	// add this new band document to the batch to be executed later
	bandsBatch.push({
	_id: couchKeyify(bandName),
	name: bandName,
	artists: artistDocs
	});

With the population of the bands database, we now have in a single location all of the data our system requires. We know the names of many bands, the artists who performed in them, and the roles they played in those bands.

Now would be a good time to take a break and play around with our newly populated bands system of record in CouchDB at http://localhost:5984/_utils/database.html?bands.

Relationship Store

Next on the docket is our Neo4j service that we’ll use to track relationships between artists and the roles they play. We could certainly query CouchDB outright by creating views, but we are rather limited on complex queries based on relationships. If Wayne Coyne from the Flaming Lips loses his theremin before a show, he could ask Charlie Clouser from Nine Inch Nails, who also plays a theremin. Or we could discover artists who have many overlapping talents, even if they performed different roles in different bands—all with a simple node walk.

With our initial data in place, now we need to keep Neo4j in sync with CouchDB should any data ever change on our system of record. So, we’ll kill two birds by crafting a service that populates Neo4j with any changes to CouchDB since the database was created.

We also want to populate Redis with keys for our bands, artists, and roles so we can quickly access this data later. Happily, this includes all data that we’ve already populated in CouchDB, thus saving us a separate initial Neo4j and Redis population step.

Ensure that Neo4j is running on port 7474, or change the appropriate createClient function to use your correct port. CouchDB and Redis should still be running. Download and run the following file. This file will continue running until you shut it down.

$ node graphSync.js

This server just uses the continuous polling example we saw in the CouchDB chapter to track all CouchDB changes. Whenever a change is detected, we do two things: populate Redis and populate Neo4j. This code populates Redis by cascading callback functions. First it populates the band as "band-name:Band Name". It follows this pattern for artist name and roles.

This way, we can search with partial strings. For example, KEYS band-name:Bea* could return this: Beach Boys, Beastie Boys, Beatles, and so on.

redis/graphSync.js

	function feedBandToRedis(band) {
	redisClient.set(`band-name:${band.name}`, 1);
	band.artists.forEach(function(artist) {
	redisClient.set(`artist-name:${artist.name}`, 1);
	artist.role.forEach(function(role){
	redisClient.set(`role-name:${role}`, 1);

The next block is how we populate Neo4j. We created a driver that you can download as part of this book’s code, named neo4jCachingClient.js. It just uses Node.js’s HTTP library to connect to the Neo4j REST interface with a bit of rate-limiting built in so the client doesn’t open too many connections at once. Our driver also uses Redis to keep track of changes made to the Neo4j graph without having to initiate a separate query. This is our third separate use for Redis—the first being as a data transformation step for populating CouchDB, and the second we just saw earlier, to quickly search for band values.

This code creates band nodes (if they need to be created), then artist nodes (if they need to be created), and then roles. Each step along the way creates a new relationship, so The Beatles node will relate to John, Paul, George, and Ringo nodes, who in turn each relate to the roles they play.

redis/graphSync.js

	function feedBandToNeo4j(band, progress) {
	var
	lookup = neo4jClient.lookupOrCreateNode,
	relate = neo4jClient.createRelationship;

	lookup('bands', 'name', band.name, 'Band', function(bandNode) {
	progress.emit('progress', 'band');
	band.artists.forEach(function(artist) {
	lookup('artists', 'name', artist.name, 'Artist', function(artistNode) {
	progress.emit('progress', 'artist');
	relate(bandNode.metadata.id, artistNode.self, 'member', function() {
	progress.emit('progress', 'member');
	});
	artist.role.forEach(function(role){
	lookup('roles', 'role', role, 'Role', function(roleNode) {
	progress.emit('progress', 'role');
	relate(artistNode.metadata.id, roleNode.self, 'plays', function() {
	progress.emit('progress', 'plays');

Let this service keep running in its own window. Every update to CouchDB that adds a new artist or role to an existing artist will trigger a new relationship in Neo4j and potentially new keys in Redis. As long as this service runs, they should be in sync.

Open your CouchDB web console and open a band. Make any data change you want to the database: add a new band member (make yourself a member of the Beatles!), or add a new role to an artist. Keep an eye on the graphSync output. Then fire up the Neo4j console and try finding any new connections in the graph. If you added a new band member, they should now have a relationship with the band node or new role if that was altered. The current implementation does not remove relationships—though it would not be a complete modification to add a Neo4j DELETE operation to the script.

The Service

This is the part we’ve been building up to. We’re going to create a simple web application that allows users to search for a band. Any band in the system will list all of the band members as links, and any clicked band member link will list some information about the artist—namely, the roles they play. In addition, each role the artist plays will list every other artist in the system who also plays that role.

For example, searching for Led Zeppelin would give you Jimmy Page, John Paul Jones, John Bonham, and Robert Plant. Clicking Jimmy Page will list that he plays guitar and also many other artists who play guitar, like The Edge from U2.

To simplify our web app creation a bit, we’ll need two more node packages: bricks (a simple web framework) and mustache (a templating library).

$ npm install bricks mustache neo4j-driver

As in the previous sections, ensure you have all of the databases running, and then start up the server. Download and run the following code:

$ node band.js

The server is set to run on port 8080, so if you point your browser to http://localhost:8080/, you should see a simple search form.

Let’s take a look at the code that will build a web page that lists band information. Each URL performs a separate function in our little HTTP server. The first is at http://localhost:8080/band and accepts any band name as a parameter.

redis/bands.js

	appServer.addRoute("^/band$", function(req, res) {
	var
	bandName = req.param('name'),
	bandNodePath = couchUtil.couchKeyify(bandName),
	membersCypherQuery =
	`MATCH (Band {name: "${bandName}"})-[:member1..3]-(b:Band)`* +
	`RETURN DISTINCT b LIMIT 10`;

	getCouchDoc(bandNodePath, res, function(couchDoc) {
	var artists = couchDoc && couchDoc['artists'];

	cypher(membersCypherQuery, function(bandsGraphData) {
	var bands = [];
	bandsGraphData.data.forEach(function(band) {
	bands.push(band[0].data.name);
	});

	var values = { band: bandName, artists: artists, bands: bands };

	var template = `
	<h2>{{band}} Band Members</h2>
	<ul>
	{{#artists}}
	<li><a href="/artist?name={{name}}">{{name}}</a></li>
	{{/artists}}
	</ul>
	<h3>You may also like</h3>
	<ul>
	{{#bands}}
	<li><a href="/band?name={{.}}">{{.}}</a></li>
	{{/bands}}
	</ul>
	`;

	writeTemplate(res, template, values);
	});

If you enter the band Nirvana in the search form, your URL request will be http://localhost:8080/band?name=Nirvana. This function will render an HTML page (the overall template is in an external file named template.html). This web page lists all artists in a band, which it pulls directly from CouchDB. It also lists some suggested bands, which it retrieves from a Gremlin query against the Neo4j graph. The Gremlin query is like this for Nirvana:

g.V.filter{it.name=="Nirvana"}.out("member").in("member").dedup.name

Or in other words, from the Nirvana node, get all unique names whose members are connected to Nirvana members. For example, Dave Grohl played in Nirvana and the Foo Fighters, so Foo Fighters will be returned in this list.

The next action is the http://localhost:8080/artist URL. This page will output information about an artist.

redis/bands.js

	appServer.addRoute("^/artist$", function(req, res) {
	var
	artistName = req.param('name'),
	rolesCypherQuery = `MATCH (Artist {name: "${artistName}"})` +
	`-[:plays]-(r:Role) RETURN r`,
	bandsCypherQuery = `MATCH (Artist {name: "${artistName}"})` +
	`-[:member]-(b:Band) RETURN b`;

	cypher(rolesCypherQuery, function(rolesGraphData) {
	cypher(bandsCypherQuery, function(bandsGraphData) {
	var
	roles = [],
	bands = [];

	rolesGraphData.data.forEach(function(role) {
	roles.push(role[0].data.role);
	});

	bandsGraphData.data.forEach(function(band) {
	bands.push(band[0].data.name);
	});

	var values = { artist: artistName, roles: roles, bands: bands };

	var template = `
	<h3>{{artist}} Performs these Roles</h3>
	<ul>
	{{#roles}}
	<li>{{.}}</li>
	{{/roles}}
	</ul>
	<h3>Play in Bands</h3>
	<ul>
	{{#bands}}
	<li><a href="/band?name={{.}}">{{.}}</a></li>
	{{/bands}}
	</ul>
	`;
	writeTemplate(res, template, values);
	});
	});

Two Gremlin queries are executed here. This first outputs all roles a member plays, and the second is a list of bands that person played in. For example, Jeff Ward (http://localhost:8080/artist?name=Jeff%20Ward) would be listed as playing the role Drummer and in the bands Nine Inch Nails and Ministry.

A cool feature of the previous two pages is that we render links between these values. The artist list in the /bands page links to the chosen /artist page, and vice versa. But we could make searching a bit easier.

redis/bands.js

	appServer.addRoute("^/search$", function(req, res) {
	var query = req.param('term');

	redisClient.keys(`band-name:${query}`, function*(error, keys) {
	var bands = [];
	keys.forEach(function(key){
	bands.push(key.replace("band-name:", ''));
	});
	res.write(JSON.stringify(bands));
	res.end();

Here we just pull all keys from Redis that match the first part of the string, such as "Bea*" as described previously. The function then outputs the data as JSON. The template.html file links to the jQuery code necessary to make this function as an autocomplete feature on the rendered search box.

Expanding the Service

This is a fairly little script for all of the bare-bones work we’re doing here. You may find many places you want to extend. Notice that the band suggestion is only first-order bands (bands the current band’s members have performed in); you can get interesting results by writing a query to traverse second-order bands, like this: g.V.filter{it.name==’Nine Inch Nails’}.out(’member’).in(’member’).dedup.loop(3){ it.loops <= 2 }.name.

You may also note that we do not have a form where someone can update band information. Adding this functionality could be fairly simple because we already wrote CouchDB population code in the populateCouch.js script, and populating CouchDB will automatically keep Neo4j and Redis eventually consistent as long as the graph_sync.js service is running.

If you enjoy playing with this kind of polyglot persistence, you could take this even further. You could add a PostgreSQL data warehouse^[65] to transform this data into a star schema—allowing for different dimensions of analysis, such as most commonly played instrument or average numbers of total members in a band vs. total instruments. You could add a CouchDB server to store information about the music associated with each band, an HBase server to build a messaging system that enables users to keep track of their historical likes/dislikes, or a MongoDB extension to add a geographic element to this service.

Or, redesign this project entirely with a different language, web framework, or dataset. There are as many opportunities to extend this project as there are combinations of databases and technologies to create it—a Cartesian product of all open source.

Day 3 Wrap-Up

Today was a big day—so big, in fact, we wouldn’t be surprised if it took several days to complete. But this is a little taste of the future of data management systems, as the world strolls away from the one large relational database model to a several specialized databases model. We also glued these databases together with some nonblocking code, which, though not a focus of this book, also seems to be where database interaction is headed in the development space.

The importance of Redis in this model should not be missed. Redis certainly doesn’t provide any functionality these databases don’t supply individually, but it does supply speedy data structures. We were able to organize a flat file into a series of meaningful data structures, which is an integral part of both data population and transportation. And it did this in a fast and simple-to-use way.

Even if you’re not sold on the whole polyglot persistence model, you should certainly consider Redis for any system.

Day 3 Homework

Alter the importer steps to also track a band member’s start and end dates with the band. Track that data in the artist’s CouchDB subdocument. Display this information on the artist’s page.
Add MongoDB into the mix by storing a few music samples into GridFS, whereby users can hear a song or two related to a band. If any song exists for a band, add a link to the web app.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Day 3: Playing with Other Databases

Create new playlist

Sign In

Sign Up

Day 3: Playing with Other Databases

A Polyglot, Persistent Service

Population

Phase 1: Data Transformation

Phase 2: SOR Insertion

Relationship Store

The Service

Expanding the Service

Day 3 Wrap-Up

Day 3 Homework

Table of Contents for
Day 3: Playing with Other Databases