Feature generation

We will use compare.dedup to generate the features:

> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim)
> summary(rec.pairs)

Deduplication Data Set

500 records
1221 record pairs

0 matches
0 non-matches
1221 pairs with unknown status

> matches <- rec.pairs$pairs
> matches[c(1:3, 1203:1204), ]
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0.1428571 NA 0 0 0 NA
2 1 204 1 NA 0.0000000 NA 0 0 0 NA
3 2 7 1 NA 0.3750000 NA 0 0 0 NA
1203 448 497 1 NA 0.0000000 NA 0 0 0 NA
1204 450 477 1 NA 0.0000000 NA 0 0 0 NA

We have 500 records; we should generate 500*(500-1)/ 2  pairs to do the comparisons--in general, for n records it will be n(n-1)/2 pairs. In a large dataset, this may be tedious. The blockfld parameter helps us alleviate this problem. It helps reduce the number of pairs by focusing on certain constraints while generating the pairs:

blockfld = list(1, 5:7)

Our constraints are represented in a list. We say we need either to match the first column or columns 5 up to 7 for two records to qualify to become a pair. You can see in the results that we are finally left with only  1,221 pairs:

Deduplication Data Set

500 records
1221 record pairs

0 matches
0 non-matches
1221 pairs with unknown status

The dedup function returns a list:

> str(rec.pairs)
List of 4
$ data :'data.frame': 500 obs. of 7 variables:
..$ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128 112 77 42 139 26 99 ...
..$ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50 23 76 61 77 30 ...
..$ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA NA NA NA NA ...
..$ by : int [1:500] 1949 1968 1930 1957 1966 1929 1967 1942 1978 1971 ...
..$ bm : int [1:500] 7 7 4 9 1 7 8 9 3 2 ...
..$ bd : int [1:500] 22 27 30 2 13 4 1 20 4 27 ...
$ pairs :'data.frame': 1221 obs. of 10 variables:
..$ id1 : num [1:1221] 1 1 2 2 2 4 4 4 4 4 ...
..$ id2 : num [1:1221] 174 204 7 43 169 19 50 78 83 133 ...
..$ fname_c1: num [1:1221] 1 1 1 1 1 1 1 1 1 1 ...
..$ fname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ lname_c1: num [1:1221] 0.143 0 0.375 0.833 0 ...
..$ lname_c2: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
..$ by : num [1:1221] 0 0 0 1 0 0 1 0 0 0 ...
..$ bm : num [1:1221] 0 0 0 1 0 0 0 0 1 0 ...
..$ bd : num [1:1221] 0 0 0 1 0 0 0 0 0 0 ...
..$ is_match: num [1:1221] NA NA NA NA NA NA NA NA NA NA ...
$ frequencies: Named num [1:7] 0.00685 0.04167 0.00926 0.11111 0.01163 ...
..- attr(*, "names")= chr [1:7] "fname_c1" "fname_c2" "lname_c1" "lname_c2" ...
$ type : chr "deduplication"
- attr(*, "class")= chr "RecLinkData"

The entry pairs in the list form a data frame that has all the generated features. We capture this data frame under the name matches:

matches <- rec.pairs$pairs

Let's look at the first two rows of this data frame. Each row compares two records:

     id1 id2 fname_c1 fname_c2  lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0.1428571 NA 0 0 0 NA
2 1 204 1 NA 0.0000000 NA 0 0 0 NA

The first instance compares records 1 and 174. There is a perfect match in the first component of the first name. Both the entities do not have a second component for the first name. We see a float number in the first component of the last name. This number is the output of a string comparison. There is no match in the date of birth fields. The final column is a is_match indicating if we have a match. We will get to the last column later. Let's start with the string comparison.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset