Weights-based method

The epiWeights function implements the weights-based method. R documentation has a nice introduction to the weights-based method:

> help("epiWeights")
For more details about the weights method, refer to P. Contiero et al. The Epilink record linkage software.
Methods Inf Med., 44(1):66–71, 2005.

The mechanism of invoking and finally generating the results is very similar to how we did it using emWeights:

library(RecordLinkage)
data("RLdata500")

# weight calculation
rec.pairs <- compare.dedup(RLdata500
,blockfld = list(1, 5:7)
,strcmp = c(2,3,4)
,strcmpfun = levenshteinSim)

pairs.weights <- epiWeights(rec.pairs)
hist(pairs.weights$Wdata)

One again, the distribution is similar to our histogram from the emWeights method. Let's see an alternate view of the weights distribution:

> summary(pairs.weights)

Deduplication Data Set

500 records
1221 record pairs

0 matches
0 non-matches
1221 pairs with unknown status


Weight distribution:

[0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35] (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6]
371 445 186 83 66 14 7 1 15
(0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9]
8 10 0 13 0 2
>

Once again we will use getPairs and emClassify as we did in emWeights:

weights.df<-getPairs(pairs.weights)
head(weights.df)

# Classification
pairs.classify <- emClassify(pairs.weights, threshold.upper = 0.5, threshold.lower = 0.3)

# View the matches
final.results <- pairs.classify$pairs
final.results$weight <- pairs.classify$Wdata
final.results$links <- pairs.classify$prediction
head(final.results)

counts <- table(final.results$links)
barplot(counts, main="Link Distribution",
xlab="Link Types")

Generate our final list for our customer:

> weights.df.srow <-getPairs( pairs.weights, single.rows = TRUE)
> final.matches <- final.results[final.results$links == 'L',]
>
> final <- merge(final.matches, weights.df.srow)
> final <- subset(final, select = -c(fname_c1.2, fname_c2.2, lname_c1.2, lname_c2.2, by.2, bm.2, bd.2, weight))
> head(final)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match links fname_c1.1 fname_c2.1 lname_c1.1
1 106 175 1 NA 1.0000000 NA 1 0 1 NA L ANDRE <NA> MUELLER
2 108 203 1 NA 1.0000000 NA 0 1 1 NA L GERHARD <NA> FRIEDRICH
3 112 116 1 NA 0.8000000 NA 1 1 1 NA L GERHARD <NA> ERNSR
4 120 165 1 NA 0.8750000 NA 1 1 1 NA L FRANK <NA> BERGMANN
5 125 193 1 NA 0.8750000 NA 1 1 1 NA L CHRISTIAN <NA> MUELLEPR
6 127 142 1 NA 0.8333333 NA 1 1 0 NA L KARL <NA> KLEIN
lname_c2.1 by.1 bm.1 bd.1 Weight
1 <NA> 1976 2 25 0.6910486
2 <NA> 1987 2 10 0.6133400
3 <NA> 1980 12 16 0.7518301
4 <NA> 1998 11 8 0.7656562
5 <NA> 1974 8 9 0.7656562
6 <NA> 2002 6 20 0.6228760
>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset