Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Processing collections concurrently

As mentioned in the introduction, this chapter's recipes will recourse to the spectacular features of the GPars framework.

In this recipe, we take a look at the Parallelizer, which is the common GPars term that refers to Parallel Collections. These are a number of additional methods added by GPars to the Groovy collection framework, which enable data parallelism techniques.

Getting ready

We will start with setting up the Gradle build (see the Integrating Groovy into the build process using Gradle recipe in Chapter 2, Using Groovy Ecosystem) and the folder structure that we will reuse across the recipes of this chapter. In a new folder aptly called parallel, create a build.gradle file having the following content:

apply plugin: 'groovy'

repositories {
  mavenCentral()
}

dependencies {
  compile 'org.codehaus.groovy:groovy-all:2.1.6'
  compile 'org.codehaus.gpars:gpars:1.0.0'
  compile 'com.google.guava:guava:14.0.1'
  compile group: 'org.codehaus.groovy.modules.http-builder',
    name: 'http-builder', version: '0.6'
  compile('org.multiverse:multiverse-beta:0.7-RC-1') {
    transitive = false
  }
  testCompile 'junit:junit:4.+'
    testCompile 'edu.stanford.nlp:stanford-corenlp:1.3.5'
}

Some dependencies may appear obscure, but they will be revealed and explained in every recipe. The GPars dependency is visible after the Groovy one (note that the Groovy distribution is already packaged with GPars 1.0.0, located in the lib folder of the Groovy's binary distribution).

Before delving into the code, we also need to create a folder structure to hold the classes and the tests. Create the following structure in the same folder where the build file resides:

src/main/groovy/org/groovy/cookbook

src/test/groovy/org/groovy/cookbook

How to do it...

In the following steps, we are going to fill our sample project structure with code.

Let's create a unit test, named ParallelTest.groovy in the new src/test/groovy/org/groovy/cookbook directory. The unit test class will contain tests in which we sample the various parallel methods available from the Parallelizer framework:

package org.groovy.cookbook

import static groovyx.gpars.GParsPool.*

import org.junit.*

import edu.stanford.nlp.process.PTBTokenizer
import edu.stanford.nlp.process.CoreLabelTokenFactory
import edu.stanford.nlp.ling.CoreLabel

class ParallelizerTest {

  static words = []

  ...

}

Now we add a couple of test setup methods that generate a large collection of test data:

@BeforeClass
static void loadDict() {
  def libraryUrl = 'http://www.gutenberg.org/cache/epub/'
  def bookFile = '17405/pg17405.txt'
  def bigText = "${libraryUrl}${bookFile}".toURL()
  words = tokenize(bigText.text)
}

static tokenize(String txt) {
  List<String> words = []
  PTBTokenizer ptbt = new PTBTokenizer(
    new StringReader(txt),
    new CoreLabelTokenFactory(),
    ''
  )
  ptbt.each { entry ->words << entry.value()
  }
  words
}

And finally, add some tests:

@Test
void testParallelEach() {
  withPool {
    words.eachParallel { token ->
      if (token.length() > 10 &&
      !token.startsWith('http')) {
        println token
      }
    }
  }
}

@Test
void testEveryParallel() {
  withPool {
    assert !(words.everyParallel { token ->
      token.length() > 20
    })
  }
}

@Test
void combinedParallel() {
  withPool {
    println words
    .findAllParallel { it.length() > 10 &&
      !it.startsWith('http') }
    .groupByParallel { it.length() }
    .collectParallel { "WORD LENGTH ${it.key}: " +
      it.value*.toLowerCase().unique() }
  }
}

How it works...

In this test, we sample some of the methods available through the GParsPool class. This class uses a "fork/join" based pool (see http://en.wikipedia.org/wiki/Fork%E2%80%93join_queue), to provide parallel variants of the common Groovy iteration methods such as each, collect, findAll, and others.

The tokenize method, in step 2, is used to split a large text downloaded from the Internet into a list of "tokens". To perform this operation, we use the excellent NLP (Natural Language Processing) library from Stanford University. This library allows fast and error-free tokenizing of any English text. What really counts here is that, we are able to quickly create a large List of values, on which we can test some parallel methods. The downloaded text comes from the Gutenberg project website, a large repository of literary works stored in plain text. We have already used files from the Gutenberg project in the Defining data structures as code in Groovy recipe in Chapter 3, Using Groovy Language Features and Processing every word in a text file recipe from Chapter 4, Working with Files in Groovy.

All the tests require the GParsPool class. The withPool method is statically imported for brevity.

The first test uses eachParallel to traverse the List and print the tokens if a certain condition is met. On an 8-core processor, this method is between 35 percent and 45 percent faster than the sequential equivalent.

The third test shows a slightly more complex usage of the Parallelizer API and demonstrates how to combine several methods to aggregate data. The list is first filtered out by word length, and then a grouping is executed on the token length itself, and finally, the collectParallel method is used to create a parallel array out of the supplied collection. The result of the previous test would print something as follows:

[WORD LENGTH 22: [-LRB-801-RRB- 596-1887],
 WORD LENGTH 20: [trademark/copyright]
 WORD LENGTH 19: [straightforwardness],
 WORD LENGTH 18: [commander-in-chief,
                  [email protected]],...

The original list of tokens is aggregated into a Map, where the key is the word length and the value is a List of words having that length found in the text.

There's more...

In this short recipe, we only tried out few of the many "parallel" methods. In the following recipes, we will see more examples of the Parallelizer in action. For a complete list of parallel operations, refer to the Javadoc page of GParsUtil: http://gpars.org/1.0.0/javadoc/groovyx/gpars/GParsPoolUtil.html.

Table of Contents for
Processing collections concurrently

Processing collections concurrently

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for Processing collections concurrently

Create new playlist

Sign In

Sign Up

Processing collections concurrently

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for
Processing collections concurrently