The method, emWeights, is based on the expectation maximization algorithm to derive from the weights, a measure of the closeness of two entities. According to this method, two conditional probabilities, one for match and an other for no match, has to be derived.
P (features | match = 0) and P (features | match = 1) are estimated using the expectation maximization algorithm. The weights are calculated as the ratio of these two probabilities. This approach is called the Fellegi-Sunter model.
> library(RecordLinkage)
> data("RLdata500")
> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim)
> pairs.weights <- emWeights(rec.pairs)
> hist(pairs.weights$Wdata)
>
As seen in the feature generation section, we use the dedup function to generate string comparison features. With the features, we invoke the emWeights function to get the Fellegi-Sunter weights. The output of emWeights is a list:
> str(pairs.weights)
List of 8
$ data :'data.frame': 500 obs. of 7 variables:
..$ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128 112 77 42 139 26 99 ...
..$ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50 23 76 61 77 30 ...
..$ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA NA NA NA NA ...
..$ by : int [1:500] 1949 1968 1930 1957 1966 1929 1967 1942 1978 1971 ...
..$ bm : int [1:500] 7 7 4 9 1 7 8 9 3 2 ...
..$ bd : int [1:500] 22 27 30 2 13 4 1 20 4 27 ...
$ pairs :'data.frame': 1221 obs. of 10 variables:
..$ id1 : num [1:1221] 1 1 2 2 2 4 4 4 4 4 ...
..$ id2 : num [1:1221] 174 204 7 43 169 19 50 78 83 133 ...
..$ fname_c1: num [1:1221] 1 1 1 1 1 1 1 1 1 1 ...
..$ fname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: num [1:1221] 0.143 0 0.375 0.833 0 ...
..$ lname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ by : num [1:1221] 0 0 0 1 0 0 1 0 0 0 ...
..$ bm : num [1:1221] 0 0 0 1 0 0 0 0 1 0 ...
..$ bd : num [1:1221] 0 0 0 1 0 0 0 0 0 0 ...
..$ is_match: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
$ frequencies: Named num [1:7] 0.00685 0.04167 0.00926 0.11111 0.01163 ...
..- attr(*, "names")= chr [1:7] "fname_c1" "fname_c2" "lname_c1" "lname_c2" ...
$ type : chr "deduplication"
$ M : num [1:128] 0.000355 0.001427 0.004512 0.01815 0.001504 ...
$ U : num [1:128] 2.84e-04 8.01e-06 2.52e-05 7.10e-07 2.83e-06 ...
$ W : num [1:128] 0.322 7.477 7.486 14.641 9.053 ...
$ Wdata : num [1:1221] -10.3 -10.3 -10.3 12.8 -10.3 ...
- attr(*, "class")= chr "RecLinkData"
>
The Wdata vector stores the weights for Record Linkage based on an EM algorithm, higher values indicate better matches. Let's plot this data as a histogram to look at the weights distribution:
The histogram is skewed with a lot of negative weights and very few positive weights. This gives us a hint that we have very few matches in our dataset. Alternatively, we can view the weight distribution as follows:
> summary(pairs.weights)
Deduplication Data Set
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
Weight distribution:
[-15,-10] (-10,-5] (-5,0] (0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
1011 0 148 9 2 29 0 5 17
>
The getPairs function conveniently gives the weights for the pair:
> weights.df<-getPairs(pairs.weights)
> head(weights.df)
id fname_c1 fname_c2 lname_c1 lname_c2 by bm bd Weight
1 48 WERNER <NA> KOERTIG <NA> 1965 11 28
2 238 WERNIER <NA> KOERTIG <NA> 1965 11 28 29.628078
3
4 68 PETEVR <NA> FUCHS <NA> 1972 9 12
5 190 PETER <NA> FUCHS <NA> 1972 9 12 29.628078
6
For record IDs 48 and 238, the weight is 29.62. The higher the weight is, the more probability there is of a match. With the weights, now we can use a threshold-based classification model. We can derive the thresholds from either the histogram or the weight distribution. We are going to choose the upper threshold, that is, for a match, we need a weight of 10 or more. For a no match, we set the lower threshold as 5, and any entity pairs with less than 5 will be tagged as a no match. The emClassify function is used to classify the entities as match and no match:
> pairs.classify <- emClassify(pairs.weights, threshold.upper = 10, threshold.lower = 5)
> summary(pairs.classify)
Deduplication Data Set
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
Weight distribution:
[-15,-10] (-10,-5] (-5,0] (0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
1011 0 148 9 2 29 0 5 17
51 links detected
2 possible links detected
1168 non-links detected
Classification table:
classification
true status N P L
<NA> 1168 2 51
The label N stands for no match or no links found. Label P stands for possible matches and label L for matches aka links founds. We see that with our given threshold, 51 matches were found. Let's make a single data frame to collate all our results:
> final.results <- pairs.classify$pairs
> final.results$weight <- pairs.classify$Wdata
> final.results$links <- pairs.classify$prediction
> head(final.results)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match weight links
1 1 174 1 NA 0.1428571 NA 0 0 0 NA -10.28161 N
2 1 204 1 NA 0.0000000 NA 0 0 0 NA -10.28161 N
3 2 7 1 NA 0.3750000 NA 0 0 0 NA -10.28161 N
4 2 43 1 NA 0.8333333 NA 1 1 1 NA 12.76895 L
5 2 169 1 NA 0.0000000 NA 0 0 0 NA -10.28161 N
6 4 19 1 NA 0.1428571 NA 0 0 0 NA -10.28161 N
>
Let us use the data frame final.results to plot a histogram:
counts <- table(final.results$links)
barplot(counts, main="Link Distribution",
xlab="Link Types")
A bar graph of links columns to look at our prediction distribution is as follows:
Finally, we can give the list of matches to our customer:
> weights.df.srow <-getPairs( pairs.weights, single.rows = TRUE)
> final.matches <- final.results[final.results$links == 'L',]
>
> final <- merge(final.matches, weights.df.srow)
> final <- subset(final, select = -c(fname_c1.2, fname_c2.2, lname_c1.2, lname_c2.2, by.2, bm.2, bd.2, weight))
> head(final)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match links fname_c1.1 fname_c2.1 lname_c1.1
1 106 175 1 NA 1.0000000 NA 1 0 1 NA L ANDRE <NA> MUELLER
2 108 203 1 NA 1.0000000 NA 0 1 1 NA L GERHARD <NA> FRIEDRICH
3 112 116 1 NA 0.8000000 NA 1 1 1 NA L GERHARD <NA> ERNSR
4 119 131 0 NA 0.1111111 NA 1 1 1 NA L ALEXANDER <NA> FRIEDRICH
5 120 165 1 NA 0.8750000 NA 1 1 1 NA L FRANK <NA> BERGMANN
6 125 193 1 NA 0.8750000 NA 1 1 1 NA L CHRISTIAN <NA> MUELLEPR
lname_c2.1 by.1 bm.1 bd.1 Weight
1 <NA> 1976 2 25 11.86047
2 <NA> 1987 2 10 10.29360
3 <NA> 1980 12 16 12.76895
4 <NA> 1968 8 14 23.37222
5 <NA> 1998 11 8 12.76895
6 <NA> 1974 8 9 12.76895
>