Advertisements have become a major medium of revenue for the Web. It is a billion-dollar business, and the source of the most Google revenue. Further, it has made it possible for companies such as Google to run their main services free of cost, while collecting their revenue through advertisements.
Let us explore how we can implement a simple "Adwords" style algorithm using MapReduce.
Adwords lets people bid for keywords. For example, the advertiser "A" can bid for keywords "Hadoop Support" for 2$ and provided a maximum budget of 100$, and the advertiser "B" would bid for keywords "Hadoop Support" for 1.50$ and provided a maximum budget of 200$. When a user searches for a document with given keywords, the system will choose one or more advertisements among the bids for these keywords. Advertisers will pay only if a user clicks on the advertisement.
Adwords problem is to show advertisements such that it will maximize revenue. There are several factors in the play while designing such a solution:
In this recipe, we will implement a simplified version of the Adwords balance algorithm that can be used in such situations. For simplicity, we will assume that advertisers only bid on single words. Also, as we cannot find a real bid dataset, we will generate a sample bid dataset.
Assume that you are to support a keyword-based advertisement using the Amazon dataset. The recipe will work as follows:
Measure = bid value * click-through rate * (1-e^(-1*current budget/ initial budget))
The following steps describe how to prepare to run the Adwords sample:
HADOOP_HOME
to refer to Hadoop installation directory.The following steps describe how to run the Adwords sample:
DATA_DIR
.HADOOP_HOME
. If the /data
directory
is already there, clean it up. This dataset is large, and might take a long time if you try to run it with a single computer. You might want to only upload the first 50,000 lines or so of the dataset if you need the sample to run quickly.$ bin/hadoopdfs -mkdir /data $ bin/hadoopdfs -mkdir /data/input1 $ bin/hadoopdfs -put <DATA_DIR>/amazon-meta.txt /data/input1
chapter8.zip
). We will call that folder CHAPTER_8_SRC
.hadoop.home
property in the CHAPTER_8_SRC/build.xml
file to point to your Hadoop installation directory.ant build
command from the CHAPTER_8_SRC
directory.build/lib/hadoop-cookbook-chapter8.jar
to your HADOOP_HOME
.HADOOP_HOME
.$ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.adwords.ClickRateApproximator/data/input1 /data/output6
$ bin/hadoopdfs -get /data/output6/part-r-00000clickrate.data
keyword:(Annals 74 keyword:(Audio 153 keyword:(BET 95 keyword:(Beat 98 keyword:(Beginners) 429 keyword:(Beginning 110
HADOOP_HOME
. You can find the results in a biddata.data
file.$ java -cp build/lib/hadoop-cookbook-chapter8.jar chapter8.adwords.AdwordsBidGenerator clickrate.data
/data/input2
and upload the bid dataset and results from the earlier MapReduce task to the /data/input2
directory of HDFS.$ bin/hadoopdfs -put clickrate.data /data/input2 $ bin/hadoopdfs -put biddata.data /data/input2
$ bin/hadoopjar hadoop-cookbook-chapter8.jar chapter8.adwords.AdwordsBalanceAlgorithmDataGenerator/data/input2 /data/output7
$ bin/hadoopdfs -get /data/output7/part-r-00000adwords.data
(Animated client23,773.0,5.0,97.0| (Animated) client33,310.0,8.0,90.0| (Annals client76,443.0,13.0,74.0| client51,1951.0,4.0,74.0| (Beginners) client86,210.0,6.0,429.0| client6,236.0,5.0,429.0| (Beginning client31,23.0,10.0,110.0|
$ javajar hadoop-cookbook-chapter8.jar chapter8.adwords.AdwordsAssigneradwords.data
As we discussed in the How to do it... section, the recipe consists of two MapReduce tasks. You can find the source code for the first MapReduce task from src/chapter8/adwords/ClickRateApproximator.java
.
The mapper function looks like the following. It parses the Amazon dataset using the Amazon data format, and for each word in each product title, it emits the word and the sales ranks of that product.
public void map(Object key, Text value, Context context) { ItemDataitemData = null; List<AmazonCustomer>customerList = AmazonCustomer.parseAItemLine(value.toString()); if(customerList.size() == 0) { return; } for (AmazonCustomer customer : customerList) { itemData = customer.itemsBrought.iterator().next(); break; } String[] tokens = itemData.title.split("\s"); for(String token: tokens) { if(token.length() > 3) { context.write(new Text(token), new IntWritable(itemData.salesrank)); } } }
Then, Hadoop sorts the emitted key-value pairs by keys and invokes the reducer once for each key passing the values emitted against that key. As shown in the following code, the reducer calculates an approximation for the click rate using sales ranks emitted against the key.
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { doubleclickrate = 0; for(IntWritableval: values) { if(val.get() > 1) { clickrate = clickrate + 1000/Math.log(val.get()); } else { clickrate = clickrate + 1000; } } context.write(new Text("keyword:" +key.toString()), newIntWritable((int)clickrate)); }
There is no publicly available bid dataset. Therefore, we will generate a random bid data set for our recipe using AdwordsBidGenerator. It would read the keywords generated by the earlier recipe and generate a random bid dataset.
Then we will use the second MapReduce task to merge the bid data set with click-through rate and generate a dataset that has bids' information sorted against the keyword. You can find the source for the second MapReduce task from src/chapter8/adwords/AdwordsBalanceAlgorithmDataGenerator.java
. The mapper function looks like the following:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] keyVal = value.toString().split("\s"); if (keyVal[0].startsWith("keyword:")) { context.write( new Text(keyVal[0].replace("keyword:", "")), new Text(keyVal[1])); } else if (keyVal[0].startsWith("client")) { List<String[]> bids = new ArrayList<String[]>(); double budget = 0; String clientid = keyVal[0]; String[] tokens = keyVal[1].split(","); for (String token : tokens) { String[] kp = token.split("="); if (kp[0].equals("budget")) { budget = Double.parseDouble(kp[1]); } else if (kp[0].equals("bid")) { String[] bidData = kp[1].split("\|"); bids.add(bidData); } } for (String[] bid : bids) { String keyword = bid[0]; String bidValue = bid[1]; context.write(new Text(keyword), new Text(new StringBuffer() .append(clientid).append(",") .append(budget).append(",") .append(bidValue).toString())); } } }
The mapper function reads both the bid data set and click-through rate datasets and emits both types of data against the keyword. Then, each reducer receives all bids and associated click-through data for each keyword. Then the reducer merges the data and emits a list of bids against each keyword.
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { String clientid = null; String budget = null; String bid = null; String clickRate = null; List<String>bids = new ArrayList<String>(); for (Text val : values) { if (val.toString().indexOf(",") > 0) { bids.add(val.toString()); } else { clickRate = val.toString(); } } StringBufferbuf = new StringBuffer(); for (String bidData : bids) { String[] vals = bidData.split(","); clientid = vals[0]; budget = vals[1]; bid = vals[2]; buf.append(clientid).append(",") .append(budget).append(",") .append(Double.valueOf(bid)).append(",") .append(Math.max(1, Double.valueOf(clickRate))); buf.append("|"); } if (bids.size() > 0) { context.write(key, new Text(buf.toString())); } }
Finally, the Adwords assigner loads the bids data and stores them against the keywords in the memory. Given a keyword, the Adwords assigner finds the bid that has maximum value for the following equation and selects a bid among all the bids for advertisement:
Measure = bid value * click-through rate * (1-e^(-1*current budget/ initial budget))
The preceding recipe assumes that Adwords assigner can load all the data in the memory to make advertisements assignment decisions. However, if the dataset is big, we can partition the dataset among multiple computers by keywords (for example, assign keywords that start with "A-D" to the first computer and so on).
This recipe assumes that users only bid for single words. However, to support multiple keyword bids, we would need to combine the click-through rates, and the rest of the algorithm can proceed as before.
More information about online advertisement can be found from the book, Mining of Massive Datasets, by Anand Rajaraman and Jeffrey D. Ullman. This book can be found at http://infolab.stanford.edu/~ullman/mmds.html.