As mentioned in the introduction, this chapter's recipes will recourse to the spectacular features of the GPars framework.
In this recipe, we take a look at the Parallelizer
, which is the common GPars term that refers to Parallel Collections. These are a number of additional methods added by GPars to the Groovy collection framework, which enable data parallelism techniques.
We will start with setting up the Gradle build (see the Integrating Groovy into the build process using Gradle recipe in Chapter 2, Using Groovy Ecosystem) and the folder structure that we will reuse across the recipes of this chapter. In a new folder aptly called parallel
, create a build.gradle
file having the following content:
apply plugin: 'groovy' repositories { mavenCentral() } dependencies { compile 'org.codehaus.groovy:groovy-all:2.1.6' compile 'org.codehaus.gpars:gpars:1.0.0' compile 'com.google.guava:guava:14.0.1' compile group: 'org.codehaus.groovy.modules.http-builder', name: 'http-builder', version: '0.6' compile('org.multiverse:multiverse-beta:0.7-RC-1') { transitive = false } testCompile 'junit:junit:4.+' testCompile 'edu.stanford.nlp:stanford-corenlp:1.3.5' }
Some dependencies may appear obscure, but they will be revealed and explained in every recipe. The GPars dependency is visible after the Groovy one (note that the Groovy distribution is already packaged with GPars 1.0.0, located in the lib
folder of the Groovy's binary distribution).
Before delving into the code, we also need to create a folder structure to hold the classes and the tests. Create the following structure in the same folder where the build file resides:
src/main/groovy/org/groovy/cookbook
src/test/groovy/org/groovy/cookbook
In the following steps, we are going to fill our sample project structure with code.
ParallelTest.groovy
in the new src/test/groovy/org/groovy/cookbook
directory. The unit test class will contain tests in which we sample the various parallel methods available from the Parallelizer
framework:package org.groovy.cookbook import static groovyx.gpars.GParsPool.* import org.junit.* import edu.stanford.nlp.process.PTBTokenizer import edu.stanford.nlp.process.CoreLabelTokenFactory import edu.stanford.nlp.ling.CoreLabel class ParallelizerTest { static words = [] ... }
@BeforeClass static void loadDict() { def libraryUrl = 'http://www.gutenberg.org/cache/epub/' def bookFile = '17405/pg17405.txt' def bigText = "${libraryUrl}${bookFile}".toURL() words = tokenize(bigText.text) } static tokenize(String txt) { List<String> words = [] PTBTokenizer ptbt = new PTBTokenizer( new StringReader(txt), new CoreLabelTokenFactory(), '' ) ptbt.each { entry ->words << entry.value() } words }
@Test void testParallelEach() { withPool { words.eachParallel { token -> if (token.length() > 10 && !token.startsWith('http')) { println token } } } } @Test void testEveryParallel() { withPool { assert !(words.everyParallel { token -> token.length() > 20 }) } } @Test void combinedParallel() { withPool { println words .findAllParallel { it.length() > 10 && !it.startsWith('http') } .groupByParallel { it.length() } .collectParallel { "WORD LENGTH ${it.key}: " + it.value*.toLowerCase().unique() } } }
In this test, we sample some of the methods available through the GParsPool
class. This class uses a "fork/join" based pool (see http://en.wikipedia.org/wiki/Fork%E2%80%93join_queue), to provide parallel variants of the common Groovy iteration methods such as each
, collect
, findAll
, and others.
The tokenize
method, in step 2, is used to split a large text downloaded from the Internet into a list of "tokens". To perform this operation, we use the excellent NLP (Natural Language Processing) library from Stanford University. This library allows fast and error-free tokenizing of any English text. What really counts here is that, we are able to quickly create a large List
of values, on which we can test some parallel methods. The downloaded text comes from the Gutenberg project website, a large repository of literary works stored in plain text. We have already used files from the Gutenberg project in the Defining data structures as code in Groovy recipe in Chapter 3, Using Groovy Language Features and Processing every word in a text file recipe from Chapter 4, Working with Files in Groovy.
All the tests require the GParsPool
class. The withPool
method is statically imported for brevity.
The first test uses eachParallel
to traverse the List
and print the tokens if a certain condition is met. On an 8-core processor, this method is between 35 percent and 45 percent faster than the sequential equivalent.
The third test shows a slightly more complex usage of the Parallelizer API and demonstrates how to combine several methods to aggregate data. The list is first filtered out by word length, and then a grouping is executed on the token length itself, and finally, the collectParallel
method is used to create a parallel array out of the supplied collection. The result of the previous test would print something as follows:
[WORD LENGTH 22: [-LRB-801-RRB- 596-1887], WORD LENGTH 20: [trademark/copyright] WORD LENGTH 19: [straightforwardness], WORD LENGTH 18: [commander-in-chief, [email protected]],...
The original list of tokens is aggregated into a Map
, where the key is the word length and the value is a List
of words having that length found in the text.
In this short recipe, we only tried out few of the many "parallel" methods. In the following recipes, we will see more examples of the Parallelizer
in action. For a complete list of parallel operations, refer to the Javadoc
page of GParsUtil
: http://gpars.org/1.0.0/javadoc/groovyx/gpars/GParsPoolUtil.html.