Expectation maximization method

The method, emWeights, is based on the expectation maximization algorithm to derive from the weights, a measure of the closeness of two entities. According to this method, two conditional probabilities, one for match and an other for no match, has to be derived.

P (features | match = 0) and P (features | match = 1) are estimated using the expectation maximization algorithm. The weights are calculated as the ratio of these two probabilities. This approach is called the Fellegi-Sunter model.

> library(RecordLinkage)
> data("RLdata500")
> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim)
> pairs.weights <- emWeights(rec.pairs)

> hist(pairs.weights$Wdata)
>
Using the EM Algorithm for weight computation in the Fellegi-Sunter model of record linkage - William E. Winkler.

As seen in the feature generation section, we use the dedup function to generate string comparison features. With the features, we invoke the emWeights function to get the Fellegi-Sunter weights. The output of emWeights is a list:

> str(pairs.weights)
List of 8
$ data :'data.frame': 500 obs. of 7 variables:
..$ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128 112 77 42 139 26 99 ...
..$ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50 23 76 61 77 30 ...
..$ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA NA NA NA NA ...
..$ by : int [1:500] 1949 1968 1930 1957 1966 1929 1967 1942 1978 1971 ...
..$ bm : int [1:500] 7 7 4 9 1 7 8 9 3 2 ...
..$ bd : int [1:500] 22 27 30 2 13 4 1 20 4 27 ...
$ pairs :'data.frame': 1221 obs. of 10 variables:
..$ id1 : num [1:1221] 1 1 2 2 2 4 4 4 4 4 ...
..$ id2 : num [1:1221] 174 204 7 43 169 19 50 78 83 133 ...
..$ fname_c1: num [1:1221] 1 1 1 1 1 1 1 1 1 1 ...
..$ fname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: num [1:1221] 0.143 0 0.375 0.833 0 ...
..$ lname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ by : num [1:1221] 0 0 0 1 0 0 1 0 0 0 ...
..$ bm : num [1:1221] 0 0 0 1 0 0 0 0 1 0 ...
..$ bd : num [1:1221] 0 0 0 1 0 0 0 0 0 0 ...
..$ is_match: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
$ frequencies: Named num [1:7] 0.00685 0.04167 0.00926 0.11111 0.01163 ...
..- attr(*, "names")= chr [1:7] "fname_c1" "fname_c2" "lname_c1" "lname_c2" ...
$ type : chr "deduplication"
$ M : num [1:128] 0.000355 0.001427 0.004512 0.01815 0.001504 ...
$ U : num [1:128] 2.84e-04 8.01e-06 2.52e-05 7.10e-07 2.83e-06 ...
$ W : num [1:128] 0.322 7.477 7.486 14.641 9.053 ...
$ Wdata : num [1:1221] -10.3 -10.3 -10.3 12.8 -10.3 ...
- attr(*, "class")= chr "RecLinkData"
>

The Wdata vector stores the weights for Record Linkage based on an EM algorithm, higher values indicate better matches.  Let's plot this data as a histogram to look at the weights distribution:

The histogram is skewed with a lot of negative weights and very few positive weights. This gives us a hint that we have very few matches in our dataset. Alternatively, we can view the weight distribution as follows:

> summary(pairs.weights)

Deduplication Data Set

500 records
1221 record pairs

0 matches
0 non-matches
1221 pairs with unknown status


Weight distribution:

[-15,-10] (-10,-5] (-5,0] (0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
1011 0 148 9 2 29 0 5 17
>

The getPairs function conveniently gives the weights for the pair:

> weights.df<-getPairs(pairs.weights)
> head(weights.df)
id fname_c1 fname_c2 lname_c1 lname_c2 by bm bd Weight
1 48 WERNER <NA> KOERTIG <NA> 1965 11 28
2 238 WERNIER <NA> KOERTIG <NA> 1965 11 28 29.628078
3
4 68 PETEVR <NA> FUCHS <NA> 1972 9 12
5 190 PETER <NA> FUCHS <NA> 1972 9 12 29.628078
6

For record IDs 48 and 238, the weight is 29.62. The higher the weight is, the more probability there is  of a match. With the weights, now we can use a threshold-based classification model. We can derive the thresholds from either the histogram or the weight distribution. We are going to choose the upper threshold, that is, for a match, we need a weight of 10 or more. For a no match, we set the lower threshold as 5, and any entity pairs with less than 5 will be tagged as a no match. The emClassify function is used to classify the entities as match and no match:

> pairs.classify <- emClassify(pairs.weights, threshold.upper = 10, threshold.lower = 5)
> summary(pairs.classify)

Deduplication Data Set

500 records
1221 record pairs

0 matches
0 non-matches
1221 pairs with unknown status


Weight distribution:

[-15,-10] (-10,-5] (-5,0] (0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
1011 0 148 9 2 29 0 5 17

51 links detected
2 possible links detected
1168 non-links detected

Classification table:

classification
true status N P L
<NA> 1168 2 51

The label N stands for no match or no links found. Label P stands for possible matches and label L for matches aka links founds. We see that with our given threshold, 51 matches were found. Let's make a single data frame to collate all our results:

> final.results <- pairs.classify$pairs
> final.results$weight <- pairs.classify$Wdata
> final.results$links <- pairs.classify$prediction
> head(final.results)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match weight links
1 1 174 1 NA 0.1428571 NA 0 0 0 NA -10.28161 N
2 1 204 1 NA 0.0000000 NA 0 0 0 NA -10.28161 N
3 2 7 1 NA 0.3750000 NA 0 0 0 NA -10.28161 N
4 2 43 1 NA 0.8333333 NA 1 1 1 NA 12.76895 L
5 2 169 1 NA 0.0000000 NA 0 0 0 NA -10.28161 N
6 4 19 1 NA 0.1428571 NA 0 0 0 NA -10.28161 N
>

Let us use the data frame final.results to plot a histogram:

counts <- table(final.results$links)
barplot(counts, main="Link Distribution",
xlab="Link Types")

A bar graph of links columns to look at our prediction distribution is as follows:

Finally, we can give the list of matches to our customer:

> weights.df.srow <-getPairs( pairs.weights, single.rows = TRUE)
> final.matches <- final.results[final.results$links == 'L',]
>
> final <- merge(final.matches, weights.df.srow)
> final <- subset(final, select = -c(fname_c1.2, fname_c2.2, lname_c1.2, lname_c2.2, by.2, bm.2, bd.2, weight))
> head(final)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match links fname_c1.1 fname_c2.1 lname_c1.1
1 106 175 1 NA 1.0000000 NA 1 0 1 NA L ANDRE <NA> MUELLER
2 108 203 1 NA 1.0000000 NA 0 1 1 NA L GERHARD <NA> FRIEDRICH
3 112 116 1 NA 0.8000000 NA 1 1 1 NA L GERHARD <NA> ERNSR
4 119 131 0 NA 0.1111111 NA 1 1 1 NA L ALEXANDER <NA> FRIEDRICH
5 120 165 1 NA 0.8750000 NA 1 1 1 NA L FRANK <NA> BERGMANN
6 125 193 1 NA 0.8750000 NA 1 1 1 NA L CHRISTIAN <NA> MUELLEPR
lname_c2.1 by.1 bm.1 bd.1 Weight
1 <NA> 1976 2 25 11.86047
2 <NA> 1987 2 10 10.29360
3 <NA> 1980 12 16 12.76895
4 <NA> 1968 8 14 23.37222
5 <NA> 1998 11 8 12.76895
6 <NA> 1974 8 9 12.76895
>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset