Cosine similarity is used to measure the similarity of two vectors. In this recipe, it will be used to find the similarity of artists based on the number of times Audioscrobbler users have added each user to their playlist. The idea is to show how often users play both artist 1 and artist 2.
Download the Audioscrobbler
dataset from http://www.packtpub.com/support.
Perform the following steps to calculate cosine similarity using Pig:
artist_data.txt
and user_artist_data.txt
files into HDFS:hadoop fs –put artist_data.txt user_artist_data.txt /data/audioscrobbler/
plays = load '/data/audioscrobbler/user_artist_data.txt' using PigStorage(' ') as (user_id:long, artist_id:long, playcount:long); artist = load '/data/audioscrobbler/artist_data.txt' as (artist_id:long, artist_name:chararray);
user_artist_data.txt
file:plays = sample plays .01;
100
:user_total_grp = group plays by user_id; user_total = foreach user_total_grp generate group as user_id, SUM(plays.playcount) as totalplays; plays_user_total = join plays by user_id, user_total by user_id using 'replicated'; norm_plays = foreach plays_user_total generate user_total::user_id as user_id, artist_id, ((double)playcount/(double)totalplays) * 100.0 as norm_play_cnt;
norm_plays2 = foreach norm_plays generate *; play_pairs = join norm_plays by user_id, norm_plays2 by user_id using 'replicated'; play_pairs = filter play_pairs by norm_plays::plays::artist_id != norm_plays2::plays::artist_id;
cos_sim_step1 = foreach play_pairs generate ((double)norm_plays::norm_play_cnt) * (double)norm_plays2::norm_play_cnt) as dot_product_step1, ((double)norm_plays::norm_play_cnt *(double) norm_plays::norm_play_cnt) as play1_sq; ((double)norm_plays2::norm_play_cnt *(double) norm_plays2::norm_play_cnt) as play2_sq; cos_sim_grp = group cos_sim_step1 by (norm_plays::plays::artist_id, norm_plays2::plays::artist_id); cos_sim_step2 = foreach cos_sim_grp generate flatten(group), COUNT(cos_sim_step1.dot_prodct_step1) as cnt, SUM(cos_sim_step1.dot_product_step1) as dot_product, SUM(cos_sim_step1.norm_plays::norm_play_cnt) as tot_play_sq, SUM(cos_sim_step1.norm_plays2::norm_play_cnt) as tot_play_sq2; cos_sim = foreach cos_sim_step2 generate group::norm_plays::plays::artist_id as artist_id1, group::norm_plays2::plays_artist_id as artist_id2, dot_product / (tot_play_sq1 * tot_play_sq2) as cosine_similarity;
art1 = join cos_sim by artist_id1, artist by artist_id using 'replicated'; art2 = join art1 by artist_id2, artist by artist_id using 'replicated'; art3 = foreach art2 generate artist_id1, art1::artist::artist_name as artist_name1, artist_id2, artist::artist_name as artist_name2, cosin_similarity;
top = order art3 by cosine_similarity DESC; top_25 = limit top 25; dump top25;
The output would be:
(1000157,AC/DC,3418,Hole,0.9115799166673817) (829,Nas,1002216,The Darkness,0.9110152004952198) (1022845,Jessica Simpson,1002325,Mandy Moore,0.9097097460071537) (53,Wu-Tang Clan,78,Sublime,0.9096468367168238) (1001180,Godsmack,1234871,Devildriver,0.9093019011575069) (1001594,Adema,1007903,Maroon 5,0.909297052154195) (689,Bette Midler,1003904,Better Than Ezra,0.9089467492461345) (949,Ben Folds Five,2745,Ladytron,0.908736095810886) (1000388,Ben Folds,930,Eminem,0.9085664586931873) (1013654,Who Da Funk,5672,Nancy Sinatra,0.9084521262343653) (1005386,Stabbing Westward,30,Jane's Addiction,0.9075360259222892) (1252,Travis,1275996,R.E.M.,0.9071980963712077) (100,Phoenix,1278,Ryan Adams,0.9071754511713067) (2247,Four Tet,1009898,A Silver Mt. Zion,0.9069623744896833) (1037970,Kanye West,1000991,Alison Krauss,0.9058717234023009) (352,Beck,5672,Nancy Sinatra,0.9056851798338253) (831,Nine Inch Nails,1251,Morcheeba,0.9051453756031981) (1007004,Journey,1005479,Mr. Mister,0.9041311825160151) (1002470,Elton John,1000416,Ramones,0.9040551837635081) (1200,Faith No More,1007903,Maroon 5,0.9038274644717641) (1002850,Glassjaw,1016435,Senses Fail,0.9034604126636377) (1004294,Thursday,2439,HiM,0.902728300518356) (1003259,ABBA,1057704,Readymade,0.9026955950032872) (1001590,Hybrid,791,Beenie Man,0.9020872203833108) (1501,Wolfgang Amadeus Mozart,4569,Simon & Garfunkel,0.9018860912385024)
The load
statements tell Pig about the format and datatypes of the data being loaded. Pig loads data lazily. This means that the load
statements at the beginning of this script will not do any work until another statement is entered that asks for output.
The user_artist_data.txt
file is sampled so that a
replicated join can be used when it is joined with itself. This significantly reduces the processing time at the cost of accuracy. The sample value of .01
is used, meaning that roughly one in hundred rows of data will be loaded.
A user selecting to play an artist is treated as a vote for that artist. The play counts are normalized to 100
. This ensures that each user is given the same number of votes.
A self join of the user_artist_data.txt
file by user_id
will generate all pairs of artists that users have added to their playlist. The filter removes duplicates caused by the self join.
The next few statements calculate the cosine similarity. For each pair of artists that users have added to their playlist, multiply the number of plays for artist 1 by the number of plays for artist 2. Then output the number of plays for artist 1 and the number of plays for artist 2. Group the previous result by each pair of artists. Sum the multiplication of the number of plays for artist 1 by the number of plays by artist 2 for each user generated previously as the dot product. Sum the number of plays for artist 1 by all users. Sum the number of plays for artist 2 by all users. The cosine similarly is the dot product over the total plays for artist 1 multiplied by the total plays for artist two. The idea is to show how often users play both artist 1 and artist 2.