Expectation maximization method

The method, emWeights, is based on the expectation maximization algorithm to derive from the weights, a measure of the closeness of two entities. According to this method, two conditional probabilities, one for match and an other for no match, has to be derived.

P (features | match = 0) and P (features | match = 1) are estimated using the expectation maximization algorithm. The weights are calculated as the ratio of these two probabilities. This approach is called the Fellegi-Sunter model.

> library(RecordLinkage)
> data("RLdata500")
> rec.pairs <- compare.dedup(RLdata500
+                            ,blockfld = list(1, 5:7)
+                            ,strcmp =   c(2,3,4)
+                            ,strcmpfun = levenshteinSim)
> pairs.weights <- emWeights(rec.pairs)

> hist(pairs.weights$Wdata)
>

Using the EM Algorithm for weight computation in the Fellegi-Sunter model of record linkage - William E. Winkler.

As seen in the feature generation section, we use the dedup function to generate string comparison features. With the features, we invoke the emWeights function to get the Fellegi-Sunter weights. The output of emWeights is a list:

> str(pairs.weights)
List of 8
 $ data       :'data.frame':    500 obs. of  7 variables:
  ..$ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128 112 77 42 139 26 99 ...
  ..$ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA NA NA NA NA NA NA ...
  ..$ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50 23 76 61 77 30 ...
  ..$ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA NA NA NA NA ...
  ..$ by      : int [1:500] 1949 1968 1930 1957 1966 1929 1967 1942 1978 1971 ...
  ..$ bm      : int [1:500] 7 7 4 9 1 7 8 9 3 2 ...
  ..$ bd      : int [1:500] 22 27 30 2 13 4 1 20 4 27 ...
 $ pairs      :'data.frame':    1221 obs. of  10 variables:
  ..$ id1     : num [1:1221] 1 1 2 2 2 4 4 4 4 4 ...
  ..$ id2     : num [1:1221] 174 204 7 43 169 19 50 78 83 133 ...
  ..$ fname_c1: num [1:1221] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ fname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
  ..$ lname_c1: num [1:1221] 0.143 0 0.375 0.833 0 ...
  ..$ lname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
  ..$ by      : num [1:1221] 0 0 0 1 0 0 1 0 0 0 ...
  ..$ bm      : num [1:1221] 0 0 0 1 0 0 0 0 1 0 ...
  ..$ bd      : num [1:1221] 0 0 0 1 0 0 0 0 0 0 ...
  ..$ is_match: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
 $ frequencies: Named num [1:7] 0.00685 0.04167 0.00926 0.11111 0.01163 ...
  ..- attr(*, "names")= chr [1:7] "fname_c1" "fname_c2" "lname_c1" "lname_c2" ...
 $ type       : chr "deduplication"
 $ M          : num [1:128] 0.000355 0.001427 0.004512 0.01815 0.001504 ...
 $ U          : num [1:128] 2.84e-04 8.01e-06 2.52e-05 7.10e-07 2.83e-06 ...
 $ W          : num [1:128] 0.322 7.477 7.486 14.641 9.053 ...
 $ Wdata      : num [1:1221] -10.3 -10.3 -10.3 12.8 -10.3 ...
 - attr(*, "class")= chr "RecLinkData"
>

The Wdata vector stores the weights for Record Linkage based on an EM algorithm, higher values indicate better matches. Let's plot this data as a histogram to look at the weights distribution:

The histogram is skewed with a lot of negative weights and very few positive weights. This gives us a hint that we have very few matches in our dataset. Alternatively, we can view the weight distribution as follows:

> summary(pairs.weights)

Deduplication Data Set

500 records 
1221 record pairs 

0 matches
0 non-matches
1221 pairs with unknown status


Weight distribution:

[-15,-10]  (-10,-5]    (-5,0]     (0,5]    (5,10]   (10,15]   (15,20]   (20,25]   (25,30] 
     1011         0       148         9         2        29         0         5        17 
>

The getPairs function conveniently gives the weights for the pair:

> weights.df<-getPairs(pairs.weights)
> head(weights.df)
   id fname_c1 fname_c2 lname_c1 lname_c2   by bm bd     Weight
1  48   WERNER     <NA>  KOERTIG     <NA> 1965 11 28           
2 238  WERNIER     <NA>  KOERTIG     <NA> 1965 11 28  29.628078
3                                                              
4  68   PETEVR     <NA>    FUCHS     <NA> 1972  9 12           
5 190    PETER     <NA>    FUCHS     <NA> 1972  9 12  29.628078
6

For record IDs 48 and 238, the weight is 29.62. The higher the weight is, the more probability there is of a match. With the weights, now we can use a threshold-based classification model. We can derive the thresholds from either the histogram or the weight distribution. We are going to choose the upper threshold, that is, for a match, we need a weight of 10 or more. For a no match, we set the lower threshold as 5, and any entity pairs with less than 5 will be tagged as a no match. The emClassify function is used to classify the entities as match and no match:

> pairs.classify <- emClassify(pairs.weights, threshold.upper = 10, threshold.lower = 5)
> summary(pairs.classify)

Deduplication Data Set

500 records 
1221 record pairs 

0 matches
0 non-matches
1221 pairs with unknown status


Weight distribution:

[-15,-10]  (-10,-5]    (-5,0]     (0,5]    (5,10]   (10,15]   (15,20]   (20,25]   (25,30] 
     1011         0       148         9         2        29         0         5        17 

51 links detected 
2 possible links detected 
1168 non-links detected 

Classification table:

           classification
true status    N    P    L
       <NA> 1168    2   51

The label N stands for no match or no links found. Label P stands for possible matches and label L for matches aka links founds. We see that with our given threshold, 51 matches were found. Let's make a single data frame to collate all our results:

> final.results <- pairs.classify$pairs
> final.results$weight <- pairs.classify$Wdata
> final.results$links <- pairs.classify$prediction
> head(final.results)
  id1 id2 fname_c1 fname_c2  lname_c1 lname_c2 by bm bd is_match    weight links
1   1 174        1       NA 0.1428571       NA  0  0  0       NA -10.28161     N
2   1 204        1       NA 0.0000000       NA  0  0  0       NA -10.28161     N
3   2   7        1       NA 0.3750000       NA  0  0  0       NA -10.28161     N
4   2  43        1       NA 0.8333333       NA  1  1  1       NA  12.76895     L
5   2 169        1       NA 0.0000000       NA  0  0  0       NA -10.28161     N
6   4  19        1       NA 0.1428571       NA  0  0  0       NA -10.28161     N
>

Let us use the data frame final.results to plot a histogram:

counts <- table(final.results$links)
barplot(counts, main="Link Distribution",
        xlab="Link Types")

A bar graph of links columns to look at our prediction distribution is as follows:

Finally, we can give the list of matches to our customer:

> weights.df.srow <-getPairs( pairs.weights, single.rows = TRUE)
> final.matches <- final.results[final.results$links == 'L',]
> 
> final <- merge(final.matches, weights.df.srow)
> final <- subset(final, select = -c(fname_c1.2, fname_c2.2, lname_c1.2, lname_c2.2, by.2, bm.2, bd.2, weight))
> head(final)
  id1 id2 fname_c1 fname_c2  lname_c1 lname_c2 by bm bd is_match links fname_c1.1 fname_c2.1 lname_c1.1
1 106 175        1       NA 1.0000000       NA  1  0  1       NA     L      ANDRE       <NA>    MUELLER
2 108 203        1       NA 1.0000000       NA  0  1  1       NA     L    GERHARD       <NA>  FRIEDRICH
3 112 116        1       NA 0.8000000       NA  1  1  1       NA     L    GERHARD       <NA>      ERNSR
4 119 131        0       NA 0.1111111       NA  1  1  1       NA     L  ALEXANDER       <NA>  FRIEDRICH
5 120 165        1       NA 0.8750000       NA  1  1  1       NA     L      FRANK       <NA>   BERGMANN
6 125 193        1       NA 0.8750000       NA  1  1  1       NA     L  CHRISTIAN       <NA>   MUELLEPR
  lname_c2.1 by.1 bm.1 bd.1   Weight
1       <NA> 1976    2   25 11.86047
2       <NA> 1987    2   10 10.29360
3       <NA> 1980   12   16 12.76895
4       <NA> 1968    8   14 23.37222
5       <NA> 1998   11    8 12.76895
6       <NA> 1974    8    9 12.76895
>

Table of Contents for Expectation maximization method

Create new playlist

Sign In

Sign Up

Table of Contents for
Expectation maximization method