We will use compare.dedup to generate the features:
> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim)
> summary(rec.pairs)
Deduplication Data Set
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
> matches <- rec.pairs$pairs
> matches[c(1:3, 1203:1204), ]
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0.1428571 NA 0 0 0 NA
2 1 204 1 NA 0.0000000 NA 0 0 0 NA
3 2 7 1 NA 0.3750000 NA 0 0 0 NA
1203 448 497 1 NA 0.0000000 NA 0 0 0 NA
1204 450 477 1 NA 0.0000000 NA 0 0 0 NA
We have 500 records; we should generate 500*(500-1)/ 2 pairs to do the comparisons--in general, for n records it will be n(n-1)/2 pairs. In a large dataset, this may be tedious. The blockfld parameter helps us alleviate this problem. It helps reduce the number of pairs by focusing on certain constraints while generating the pairs:
blockfld = list(1, 5:7)
Our constraints are represented in a list. We say we need either to match the first column or columns 5 up to 7 for two records to qualify to become a pair. You can see in the results that we are finally left with only 1,221 pairs:
Deduplication Data Set
500 records
1221 record pairs
0 matches
0 non-matches
1221 pairs with unknown status
The dedup function returns a list:
> str(rec.pairs)
List of 4
$ data :'data.frame': 500 obs. of 7 variables:
..$ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128 112 77 42 139 26 99 ...
..$ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50 23 76 61 77 30 ...
..$ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA NA NA NA NA ...
..$ by : int [1:500] 1949 1968 1930 1957 1966 1929 1967 1942 1978 1971 ...
..$ bm : int [1:500] 7 7 4 9 1 7 8 9 3 2 ...
..$ bd : int [1:500] 22 27 30 2 13 4 1 20 4 27 ...
$ pairs :'data.frame': 1221 obs. of 10 variables:
..$ id1 : num [1:1221] 1 1 2 2 2 4 4 4 4 4 ...
..$ id2 : num [1:1221] 174 204 7 43 169 19 50 78 83 133 ...
..$ fname_c1: num [1:1221] 1 1 1 1 1 1 1 1 1 1 ...
..$ fname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: num [1:1221] 0.143 0 0.375 0.833 0 ...
..$ lname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ by : num [1:1221] 0 0 0 1 0 0 1 0 0 0 ...
..$ bm : num [1:1221] 0 0 0 1 0 0 0 0 1 0 ...
..$ bd : num [1:1221] 0 0 0 1 0 0 0 0 0 0 ...
..$ is_match: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
$ frequencies: Named num [1:7] 0.00685 0.04167 0.00926 0.11111 0.01163 ...
..- attr(*, "names")= chr [1:7] "fname_c1" "fname_c2" "lname_c1" "lname_c2" ...
$ type : chr "deduplication"
- attr(*, "class")= chr "RecLinkData"
The entry pairs in the list form a data frame that has all the generated features. We capture this data frame under the name matches:
matches <- rec.pairs$pairs
Let's look at the first two rows of this data frame. Each row compares two records:
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0.1428571 NA 0 0 0 NA
2 1 204 1 NA 0.0000000 NA 0 0 0 NA
The first instance compares records 1 and 174. There is a perfect match in the first component of the first name. Both the entities do not have a second component for the first name. We see a float number in the first component of the last name. This number is the output of a string comparison. There is no match in the date of birth fields. The final column is a is_match indicating if we have a match. We will get to the last column later. Let's start with the string comparison.