Apache Mahout was a general machine learning library built on top of Hadoop. Mahout started out primarily as a Java MapReduce package to run machine learning algorithms. As machine learning algorithms are iterative in nature, MapReduce had major performance and scalability issues, so Mahout stopped the development of MapReduce-based algorithms and started supporting new platforms, such as Spark, H2O, and Flink, with a new package called Samsara.
Let's install Mahout, explore the Mahout shell with Scala bindings, and then build a recommendation system.
The latest version of Spark does not work well with Mahout yet, so I used the Spark 1.4.1 version with the Mahout 0.12.2 version. Download the Spark prebuilt binary from the following location and start Spark daemons:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz tar xzvf spark-1.4.1-bin-hadoop2.6.tgz cd spark-1.4.1-bin-hadoop2.6
Now, let's download the Mahout binaries and unpack them as shown here:
wget http://mirrors.sonic.net/apache/mahout/0.12.2/apache-mahout-distribution-0.12.2.tar.gz tar xzvf apache-mahout-distribution-0.12.2.tar.gz
Now, export the following environment variables and start the Mahout shell:
export MAHOUT_HOME=/home/cloudera/apache-mahout-distribution-0.12.2 export SPARK_HOME=/home/cloudera/spark-1.4.1-bin-hadoop2.6 export MAHOUT_LOCAL=true export MASTER=yarn-client export JAVA_TOOL_OPTIONS="-Xmx2048m -XX:MaxPermSize=1024m -Xms1024m" cd ~/apache-mahout-distribution-0.12.2 bin/mahout spark-shell
The following two Scala imports are typically used to enable Mahout Scala DSL Bindings for linear algebra:
import org.apache.mahout.math._ import scalabindings._ import MatlabLikeOps._
The two types of vectors supported by Mahout shell are as follows:
mahout> val denseVector1: Vector = (3.0, 4.1, 6.2) denseVector1: org.apache.mahout.math.Vector = {0:3.0,1:4.1,2:6.2}
mahout> val sparseVector1 = svec((6 -> 1) :: (9 -> 2.0) :: Nil) sparseVector1: org.apache.mahout.math.RandomAccessSparseVector = {9:2.0,6:1.0}
Access the elements of a vector:
mahout> denseVector1(2) res0: Double = 6.2
Set values to a vector:
mahout> denseVector1(2)=8.2 mahout> denseVector1 res2: org.apache.mahout.math.Vector = {0:3.0,1:4.1,2:8.2}
The following are the vector arithmetic operations:
mahout> val denseVector2: Vector = (1.0, 1.0, 1.0) denseVector2: org.apache.mahout.math.Vector = {0:1.0,1:1.0,2:1.0} mahout> val addVec=denseVector1 + denseVector2 addVec: org.apache.mahout.math.Vector = {0:4.0,1:5.1,2:9.2} mahout> val subVec=denseVector1 - denseVector2 subVec: org.apache.mahout.math.Vector = {0:2.0,1:3.0999999999999996,2:7.199999999999999}
Similarly, multiplication and division can be done as well.
The result of adding a scalar to a vector is that all elements are incremented by the value of the scalar. For example, the following command adds 10
to all the elements of the vector:
mahout> val addScalr=denseVector1+10 addScalr: org.apache.mahout.math.Vector = {0:13.0,1:14.1,2:18.2} mahout> val addScalr=denseVector1-2 addScalr: org.apache.mahout.math.Vector = {0:1.0,1:2.0999999999999996,2:6.199999999999999}
Now let's see how to initialize the matrix. The inline initialization of a matrix, either dense or sparse, is always performed row-wise:
mahout> val denseMatrix = dense((10, 20, 30), (30, 40, 50)) denseMatrix: org.apache.mahout.math.DenseMatrix = { 0 => {0:10.0,1:20.0,2:30.0} 1 => {0:30.0,1:40.0,2:50.0} }
mahout> val sparseMatrix = sparse((1, 30) :: Nil, (0, 20) :: (1, 20.5) :: Nil) sparseMatrix: org.apache.mahout.math.SparseRowMatrix = { 0 => {1:30.0} 1 => {0:20.0,1:20.5} }
mahout> val diagonalMatrix=diag(20, 4) diagonalMatrix: org.apache.mahout.math.DiagonalMatrix = { 0 => {0:20.0} 1 => {1:20.0} 2 => {2:20.0} 3 => {3:20.0} }
mahout> val identityMatrix = eye(4) identityMatrix: org.apache.mahout.math.DiagonalMatrix = { 0 => {0:1.0} 1 => {1:1.0} 2 => {2:1.0} 3 => {3:1.0} }
Access the elements of a matrix as follows:
mahout> denseMatrix(1,1) res5: Double = 40.0 mahout> sparseMatrix(0,1) res18: Double = 30.0
Fetch a row of a vector as follows:
mahout> denseMatrix(1,::) res21: org.apache.mahout.math.Vector = {0:30.0,1:40.0,2:50.0}
Fetch a column of a vector as follows:
mahout> denseMatrix(::,1) res22: org.apache.mahout.math.Vector = {0:20.0,1:40.0}
Set a matrix row as follows:
mahout> denseMatrix(1,::)=(99,99,99) res23: org.apache.mahout.math.Vector = {0:99.0,1:99.0,2:99.0} mahout> denseMatrix res24: org.apache.mahout.math.DenseMatrix = { 0 => {0:10.0,1:20.0,2:30.0} 1 => {0:99.0,1:99.0,2:99.0} }
Matrices are assigned by reference and not as a copy. See the following example:
mahout> val newREF = denseMatrix newREF: org.apache.mahout.math.DenseMatrix = { 0 => {0:10.0,1:20.0,2:30.0} 1 => {0:99.0,1:99.0,2:99.0} } mahout> newREF += 10.0 res25: org.apache.mahout.math.Matrix = { 0 => {0:20.0,1:30.0,2:40.0} 1 => {0:109.0,1:109.0,2:109.0} } mahout> denseMatrix res26: org.apache.mahout.math.DenseMatrix = { 0 => {0:20.0,1:30.0,2:40.0} 1 => {0:109.0,1:109.0,2:109.0} }
If you want a separate copy, you can clone it:
mahout> val newClone = denseMatrix clone newClone: org.apache.mahout.math.Matrix = { 0 => {0:20.0,1:30.0,2:40.0} 1 => {0:109.0,1:109.0,2:109.0} } mahout> newClone += 10 res27: org.apache.mahout.math.Matrix = { 0 => {0:30.0,1:40.0,2:50.0} 1 => {0:119.0,1:119.0,2:119.0} } mahout> newClone res28: org.apache.mahout.math.Matrix = { 0 => {0:30.0,1:40.0,2:50.0} 1 => {0:119.0,1:119.0,2:119.0} } mahout> denseMatrix res29: org.apache.mahout.math.DenseMatrix = { 0 => {0:20.0,1:30.0,2:40.0} 1 => {0:109.0,1:109.0,2:109.0} }
Mahout provides you with recommendation algorithms such as spark-itemsimilarity
and spark-rowsimilarity
to create recommendations. When these recommendations are combined with a search tool, such as Solr or Elasticsearch, the recommendations will be personalized for individual users.
Figure 8.6 is a recommendation application with lambda architecture to create and update the model in batch mode and a search engine playing the real-time serving role. All user interactions are collected in real time and stored on HBase. In Solr, two collections are created—one for user history and one for item indicators. Indicators are user interactions created by spark-mahout correlation algorithms.
A recommender application queries Solr directly to get recommendations. There can be two actions from users such as purchase which is a primary action and secondary actions such as product detail-views or add-to-wishlists. The primary action from the user history with its co-occurrence and cross-co-occurrence indicators are usually recommended. However, recommendations can be customized with secondary actions.
Let's learn how to create a similarity matrix and cross-similarity matrix using the spark-itemsimilarity
algorithm. Create a file called infile
with the following content. Note that you can directly provide a log file as an input by providing delimiters:
[cloudera@quickstart ~]$ cat infile u1,purchase,iphone u1,purchase,ipad u2,purchase,nexus u2,purchase,galaxy u3,purchase,surface u4,purchase,iphone u4,purchase,galaxy u1,view,iphone u1,view,ipad u1,view,nexus u1,view,galaxy u2,view,iphone u2,view,ipad u2,view,nexus u2,view,galaxy u3,view,surface u3,view,nexus u4,view,iphone u4,view,ipad u4,view,galaxy
Then, run the spark-itemsimilarity
job with the following command:
[cloudera@quickstart apache-mahout-distribution-0.12.2]$ bin/mahout spark-itemsimilarity --input /home/cloudera/infile --output /home/cloudera/outdir --master local[*] --filter1 purchase --filter2 view -ic 2 -rc 0 -fc 1
The previous command line options are explained as follows:
-f1
or --filter1
: Datum for the primary item set-f2
or --filter2
: Datum for the secondary item set-ic
or --itemIDColumn
: Column number for item ID. Default is 1
.-rc
or --rowIDColumn
: Column number for row ID. Default is 0
.-fc
or --filterColumn
: Column number for filter string. Default is -1
for no filter.Note that the --master
parameter can be changed to a standalone master or yarn.
This program will produce the following output:
[cloudera@quickstart apache-mahout-distribution-0.12.2]$$ cd ~/outdir/ [cloudera@quickstart outdir]$ ls -R .: cross-similarity-matrix similarity-matrix ./cross-similarity-matrix: part-00000 part-00001 part-00002 _SUCCESS ./similarity-matrix: part-00000 _SUCCESS [cloudera@quickstart outdir]$ cat similarity-matrix/part-00000 galaxy nexus:1.7260924347106847 ipad iphone:1.7260924347106847 surface iphone ipad:1.7260924347106847 nexus galaxy:1.7260924347106847 [cloudera@quickstart outdir]$ cat cross-similarity-matrix/part-0000* galaxy galaxy:1.7260924347106847 ipad:1.7260924347106847 iphone:1.7260924347106847 nexus:1.7260924347106847 ipad galaxy:0.6795961471815897 ipad:0.6795961471815897 iphone:0.6795961471815897 nexus:0.6795961471815897 surface surface:4.498681156950466 nexus:0.6795961471815897 iphone galaxy:1.7260924347106847 ipad:1.7260924347106847 iphone:1.7260924347106847 nexus:1.7260924347106847 nexus galaxy:0.6795961471815897 ipad:0.6795961471815897 iphone:0.6795961471815897 nexus:0.6795961471815897
Based on the previous result, on the site for the page displaying the iPhone, we can now show that the iPad as a recommendation that was purchased by similar people. The current user's purchase history will be used to personalize the recommendations on Solr.