Supervised learning

In a supervised learning scenario, we need to provide the algorithm with a set of training tuples. Each tuple has our features from record pairs and a label classifying the tuple as either a match or no match. In our case, we don't have any labeled data.

The RecordLinkage package provides a numeric vector called identity.RLdata500, which stores the matching record number for every record number. We can pass this using an identity parameter to compare.dedup:

> str(identity.RLdata500)
 num [1:500] 34 51 115 189 72 142 162 48 133 190 ...
> str(identity.RLdata500)
 num [1:500] 34 51 115 189 72 142 162 48 133 190 ...
> rec.pairs <- compare.dedup(RLdata500
+                            ,identity = identity.RLdata500
+                            ,blockfld = list(1, 5:7)
+ )
> head(rec.pairs$pairs)
  id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1   1 174        1       NA        0       NA  0  0  0        0
2   1 204        1       NA        0       NA  0  0  0        0
3   2   7        1       NA        0       NA  0  0  0        0
4   2  43        1       NA        0       NA  1  1  1        1
5   2 169        1       NA        0       NA  0  0  0        0
6   4  19        1       NA        0       NA  0  0  0        0

If you see the output of rec.pairs$pairs, you will notice that now the is_match column says if the record pair is a match or no match. Previously, when we did not provide the identity parameter, it was initialized to NA. We are going to leverage this output to train our classification model:

> train <- getMinimalTrain(rec.pairs)
> model <- trainSupv(train, method ="bagging")
> train.pred <- classifySupv(model, newdata = train)
> test.pred  <- classifySupv(model, newdata = rec.pairs)
> 
> summary(train.pred)

Deduplication Data Set

500 records 
17 record pairs 

9 matches
8 non-matches
0 pairs with unknown status


9 links detected 
0 possible links detected 
8 non-links detected 

alpha error: 0.000000
beta error: 0.000000
accuracy: 1.000000


Classification table:

           classification
true status N P L
      FALSE 8 0 0
      TRUE  0 0 9
> summary(test.pred)

Deduplication Data Set

500 records 
1221 record pairs 

49 matches
1172 non-matches
0 pairs with unknown status


52 links detected 
0 possible links detected 
1169 non-links detected 

alpha error: 0.020408
beta error: 0.003413
accuracy: 0.995905


Classification table:

           classification
true status    N    P    L
      FALSE 1168    0    4
      TRUE     1    0   48
>

Using the getMinimalTrain function, we get a small set of records from our rec.pairs as our training data. We initialize and train a bagging model using this data with the trainSupv function. Finally, using the classifySupv function, we run our model on our training data and test data, which is the whole rec.pairs in this case to get the predictions. Finally, using the summary function, we can look at the accuracy of our model. We have a 100% accurate model in our training set. RecordLinkage supports a lot of classification models, including neural networks, svm, bagging, and trees. As we have a very small training set, it is advisable to use bagging or svm. Finally, we have around 99% accuracy on our whole dataset.

Alternatively, we can leverage our unsupervised clustering output. Use that as an initial training set to build our first supervised learning model:

> rec.pairs <- compare.dedup(RLdata500
+                            ,blockfld = list(1, 5:7)
+                            ,strcmp =   c(2,3,4)
+                            ,strcmpfun = levenshteinSim)
> 
> # Run K-Means Model
> kmeans.model <- classifyUnsup(rec.pairs, method = "kmeans")
> 
> # Change the original rec.pairs with rec.pairs from K-Means
> pairs <- kmeans.model$pairs
> pairs$prediction <- kmeans.model$prediction
> head(pairs)
  id1 id2 fname_c1 fname_c2  lname_c1 lname_c2 by bm bd is_match prediction
1   1 174        1       NA 0.1428571       NA  0  0  0       NA          N
2   1 204        1       NA 0.0000000       NA  0  0  0       NA          N
3   2   7        1       NA 0.3750000       NA  0  0  0       NA          N
4   2  43        1       NA 0.8333333       NA  1  1  1       NA          L
5   2 169        1       NA 0.0000000       NA  0  0  0       NA          N
6   4  19        1       NA 0.1428571       NA  0  0  0       NA          N
>

We pass our rec.pairs to the clustering method and extract the pairs with their predictions. We want to replace the is_match column with our predictions. However, the values should be 0 or 1 in the is_match column instead of N or L:

> pairs$is_match <- NULL
> pairs$is_match <- ifelse(pairs$prediction == 'N', 0,1)
> pairs$prediction <- NULL
> pairs[is.na(pairs)] <- 0
> head(pairs)
  id1 id2 fname_c1 fname_c2  lname_c1 lname_c2 by bm bd is_match
1   1 174        1        0 0.1428571        0  0  0  0        0
2   1 204        1        0 0.0000000        0  0  0  0        0
3   2   7        1        0 0.3750000        0  0  0  0        0
4   2  43        1        0 0.8333333        0  1  1  1        1
5   2 169        1        0 0.0000000        0  0  0  0        0
6   4  19        1        0 0.1428571        0  0  0  0        0
> 
> rec.pairs$pairs <- pairs
> head(rec.pairs$pairs)
  id1 id2 fname_c1 fname_c2  lname_c1 lname_c2 by bm bd is_match
1   1 174        1        0 0.1428571        0  0  0  0        0
2   1 204        1        0 0.0000000        0  0  0  0        0
3   2   7        1        0 0.3750000        0  0  0  0        0
4   2  43        1        0 0.8333333        0  1  1  1        1
5   2 169        1        0 0.0000000        0  0  0  0        0
6   4  19        1        0 0.1428571        0  0  0  0        0
> 
>

Now, having changed our is_match column to binary, we replace the original pairs data frame in rec.pairs with our modified data frame pairs. With this achieved, we can follow the process of generating a small train test, building a model, and verifying the accuracy of the model as we did in the previous section:

train <- getMinimalTrain(rec.pairs)
Warning message:
In getMinimalTrain(rec.pairs) :
  Comparison patterns in rpairs contain string comparison values!
> model <- trainSupv(train, method ="bagging")
> train.pred <- classifySupv(model, newdata = train)
> test.pred  <- classifySupv(model, newdata = rec.pairs)
> 
> summary(train.pred)

Deduplication Data Set

500 records 
82 record pairs 

38 matches
44 non-matches
0 pairs with unknown status


38 links detected 
0 possible links detected 
44 non-links detected </strong>

alpha error: 0.000000
beta error: 0.000000
accuracy: 1.000000


Classification table:

           classification
true status  N  P  L
      FALSE 44  0  0
      TRUE   0  0 38
> summary(test.pred)

Deduplication Data Set

500 records 
1221 record pairs 

150 matches
1071 non-matches
0 pairs with unknown status


150 links detected 
0 possible links detected 
1071 non-links detected 

alpha error: 0.000000
beta error: 0.000000
accuracy: 1.000000


Classification table:

           classification
true status    N    P    L
      FALSE 1071    0    0
      TRUE     0    0  150

We have achieved a 100% accuracy in both our training and whole datasets.

Table of Contents for Supervised learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Supervised learning