String features

Let's look at the dedup function invocation once again:

> rec.pairs <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = c(2,3,4)
+ ,strcmpfun = levenshteinSim).

The strcmp and strcmpfun parameters dictate on which fields we need to do string comparison and what kind of string comparison we need to apply. We pass a vector indicating the column IDs to strcmp. We need to do string comparisons in columns 2, 3, and 4. We want to use the Levenshtein distance to find the similarity between two strings.

Levenshtein distance (LD) is a measure of similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t (https://en.wikipedia.org/wiki/Levenshtein_distance).

Let's look at the records 1 and 174; we see the first name matching, but no match with the rest of the fields. The Levenshtein distance of 0.142857 also states how far the last names are from each other:

> RLdata500[1,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
1 CARSTEN <NA> MEIER <NA> 1949 7 22
> RLdata500[174,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
174 CARSTEN <NA> SCHMITT <NA> 2001 6 27

We need to state the exact columns where string comparisons should be applied. Not specifying this may lead to unexpected results:

> rec.pairs.matches <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,strcmp = TRUE
+ ,strcmpfun = levenshteinSim)
> head(rec.pairs.matches$pairs)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0.1428571 NA 0.00 0.5 0.5 NA
2 1 204 1 NA 0.0000000 NA 0.50 0.5 0.0 NA
3 2 7 1 NA 0.3750000 NA 0.75 0.5 0.0 NA
4 2 43 1 NA 0.8333333 NA 1.00 1.0 1.0 NA
5 2 169 1 NA 0.0000000 NA 0.50 0.0 0.5 NA
6 4 19 1 NA 0.1428571 NA 0.50 0.5 0.0 NA

You can see that the function has also calculated the string comparisons for the date of birth fields!

We had excluded the first column from string comparison. It was used as a blocking field. However, we can use string comparison for the first column. In that case, the string comparison is performed before the blocking.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset