Unsupervised machine learning does not use annotated data; that is, the dataset does to contain anticipated results. While there are several unsupervised learning algorithms, we will demonstrate the use of association rule learning to illustrate this learning approach.
Association rule learning is a technique that identifies relationships between data items. It is part of what is called market basket analysis. When a shopper makes purchases, these purchases are likely to consist of more than one item, and when it does, there are certain items that tend to be bought together. Association rule learning is one approach for identifying these related items. When an association is found, a rule can be formulated for it.
For example, if a customer buys diapers and lotion, they are also likely to buy baby wipes. An analysis can find these associations and a rule stating the observation can be formed. The rule would be expressed as {diapers, lotion} => {wipes}. Being able to identify these purchasing patterns allows a store to offer special coupons, arrange their products to be easier to get, or effect any number of other market-related activities.
One of the problems with this technique is that there are a large number of possible associations. One efficient method that is commonly used is the apriori algorithm. This algorithm works on a collection of transactions defined by a set of items. These items can be thought of as purchases and a transaction as a set of items bought together. The collection is often referred to as a database.
Consider the following set of transactions where a 1 indicates that the item was purchased as part of a transaction and 0 means that it was not purchased:
Transaction ID |
Diapers |
Lotion |
Wipes |
Formula |
1 |
1 |
1 |
1 |
0 |
2 |
1 |
1 |
1 |
1 |
3 |
0 |
1 |
1 |
0 |
4 |
1 |
0 |
0 |
0 |
5 |
0 |
1 |
1 |
1 |
There are several analysis terms used with the apriori model:
These definitions and sample values can be found at https://en.wikipedia.org/wiki/Association_rule_learning.
We will be using the Apriori
Weka class to demonstrate Java support for the algorithm using two datasets. The first is the data discussed previously and the second deals with what a person may take on a hike.
The following is the data file, babies.arff
, for baby information:
@relation TEST_ITEM_TRANS @attribute Diapers {1, 0} @attribute Lotion {1, 0} @attribute Wipes {1, 0} @attribute Formula {1, 0} @data 1,1,1,0 1,1,1,1 0,1,1,0 1,0,0,0 0,1,1,1
We start by reading in the file using a BufferedReader
instance. This object is used as the argument of the Instances
class, which will hold the data:
try { BufferedReader br; br = new BufferedReader(new FileReader("babies.arff")); Instances data = new Instances(br); br.close(); ... } catch (Exception ex) { // Handle exceptions }
Next, an Apriori
instance is created. We set the number of rules to be generated and a minimum confidence for the rules:
Apriori apriori = new Apriori(); apriori.setNumRules(100); apriori.setMinMetric(0.5);
The buildAssociations
method generates the associations using the Instances
variable. The associations are then displayed:
apriori.buildAssociations(data); System.out.println(apriori);
There will be 100 rules displayed. The following is the abbreviated output. Each rule is followed by various measures of the rule:
Apriori ======= Minimum support: 0.3 (1 instances) Minimum metric <confidence>: 0.5 Number of cycles performed: 14 Generated sets of large itemsets: Size of set of large itemsets L(1): 8 Size of set of large itemsets L(2): 18 Size of set of large itemsets L(3): 16 Size of set of large itemsets L(4): 5 Best rules found: 1. Wipes=1 4 ==> Lotion=1 4 <conf:(1)> lift:(1.25) lev:(0.16) [0] conv:(0.8) 2. Lotion=1 4 ==> Wipes=1 4 <conf:(1)> lift:(1.25) lev:(0.16) [0] conv:(0.8) 3. Diapers=0 2 ==> Lotion=1 2 <conf:(1)> lift:(1.25) lev:(0.08) [0] conv:(0.4) 4. Diapers=0 2 ==> Wipes=1 2 <conf:(1)> lift:(1.25) lev:(0.08) [0] conv:(0.4) 5. Formula=1 2 ==> Lotion=1 2 <conf:(1)> lift:(1.25) lev:(0.08) [0] conv:(0.4) 6. Formula=1 2 ==> Wipes=1 2 <conf:(1)> lift:(1.25) lev:(0.08) [0] conv:(0.4) 7. Diapers=1 Wipes=1 2 ==> Lotion=1 2 <conf:(1)> lift:(1.25) lev:(0.08) [0] conv:(0.4) 8. Diapers=1 Lotion=1 2 ==> Wipes=1 2 <conf:(1)> lift:(1.25) lev:(0.08) [0] conv:(0.4) ... 62. Diapers=0 Lotion=1 Formula=1 1 ==> Wipes=1 1 <conf:(1)> lift:(1.25) lev:(0.04) [0] conv:(0.2) ... 99. Lotion=1 Formula=1 2 ==> Diapers=1 1 <conf:(0.5)> lift:(0.83) lev:(-0.04) [0] conv:(0.4) 100. Diapers=1 Lotion=1 2 ==> Formula=1 1 <conf:(0.5)> lift:(1.25) lev:(0.04) [0] conv:(0.6)
This provides us with a list of relationships, which we can use to identify patterns in activities such as purchasing behavior.