This recipe will use the built-in RegExFilter
class in
Accumulo to return only key-value pairs, where the qualifier is of a particular source value. The filtering will be distributed across the different TabletServers that house the table acled
.
This recipe will be the easiest to test over a pseudo-distributed Hadoop cluster with Accumulo 1.4.1 and Zookeeper 3.3.3 installed. The shell script in this recipe assumes that Zookeeper is running on the host localhost
and on the port 2181
; you can change this to suit your environment needs. The Accumulo installation's bin
folder needs to be on your environment path.
For this recipe you'll need to create an Accumulo instance named test
with the user as root
and password as password
.
To see the filtered results from this recipe, you will need to complete the Using MapReduce to bulk import geographic event data into Accumulo recipe listed earlier in this chapter. This will give you some sample data to experiment with.
Follow these steps to use the Regex filtering iterator:
accumulo-examples.jar
.example.accumulo
and add the class SourceFilterMain.java
with the following content:package examples.accumulo; import org.apache.accumulo.core.client.Connector; import org.apache.accumulo.core.client.IteratorSetting; import org.apache.accumulo.core.client.Scanner; import org.apache.accumulo.core.client.ZooKeeperInstance; import org.apache.accumulo.core.data.Key; import org.apache.accumulo.core.data.Value; import org.apache.accumulo.core.iterators.user.RegExFilter; import org.apache.accumulo.core.security.Authorizations; import org.apache.hadoop.io.Text; import java.util.HashMap; import java.util.Map; public class SourceFilterMain { public static final String TEST_TABLE = "acled"; public static final Text COLUMN_FAMILY = new Text("cf"); public static final Text SRC_QUAL = new Text("src");
main()
method handles argument parsing and querying with the filter:public static void main(String[] args) throws Exception { if(args.length < 5) { System.err.println("usage: <src> <instance name> <user> <password> <zookeepers>"); System.exit(0); } String src = args[0]; String instanceName = args[1]; String user = args[2]; String pass = args[3]; String zooQuorum = args[4]; ZooKeeperInstance ins = new ZooKeeperInstance(instanceName, zooQuorum); Connector connector = ins.getConnector(user, pass); Scanner scan = connector.createScanner(TEST_TABLE, new Authorizations()); scan.fetchColumn(COLUMN_FAMILY, SRC_QUAL); IteratorSetting iter = new IteratorSetting(15, "regexfilter", RegExFilter.class); iter.addOption(RegExFilter.VALUE_REGEX, src); scan.addScanIterator(iter); int count = 0; for(Map.Entry<Key, Value> row : scan) { System.out.println("row: " + row.getKey().getRow().toString()); count++; } System.out.println("total rows: " + count); } }
accumulo-examples.jar
.accumulo-examples.jar
is located, create a new shell script named run_src_filter.sh
with the following commands. Be sure to change ACCUMULO-LIB
, HADOOP_LIB
, and ZOOKEEPER_LIB
to match your local paths:ACCUMULO_LIB=/opt/cloud/accumulo-1.4.1/lib/* HADOOP_LIB=/Applications/hadoop-0.20.2- cdh3u1/*:/Applications/hadoop-0.20.2-cdh3u1/lib/* ZOOKEEPER_LIB=/opt/cloud/zookeeper-3.4.2/* java -cp $ACCUMULO_LIB:$HADOOP_LIB:$ZOOKEEPER_LIB:accumulo-examples.jar examples.accumulo.SourceFilterMain 'Panafrican News Agency' test root password localhost:2181
Panafrican News Agency
.The script takes in the required parameters necessary to connect to the Accumulo table acled
, plus an additional parameter for a source qualifier value to filter on. We set up a Scanner
instance with blank authorizations and configure an IteratorSetting
of type RegExFilter
to do the regex comparison on the TabletServer. Our regex is a very simple direct match on the supplied source argument.
We then iterate over the result set and printout the rowID for any matching key-value pairs. At the end, we print a tally of how many key-value pairs were found matching that source.
The responsibility of filtering key-value pairs based on the value is distributed across the various TabletServers that hold tablets for the acled
table. The client only sees rows that match the filter, and can immediately begin processing.