Phonetic features

The RecordLinkage package includes Soundex and Pho_h algorithms to compare string columns. In our example, we want to use columns 2, 3, and 4 for string comparison, specified by the list we pass to the phonetic parameter, and use the Pho_h function by passing it to the phonfun parameter:

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. https://en.wikipedia.org/wiki/Soundex

Let us generate some phoenetic-based features:

> rec.pairs.matches <- compare.dedup(RLdata500
+ ,blockfld = list(1, 5:7)
+ ,phonetic = c(2,3,4)
+ ,phonfun = pho_h)
> head(rec.pairs.matches$pairs)
id1 id2 fname_c1 fname_c2 lname_c1 lname_c2 by bm bd is_match
1 1 174 1 NA 0 NA 0 0 0 NA
2 1 204 1 NA 0 NA 0 0 0 NA
3 2 7 1 NA 0 NA 0 0 0 NA
4 2 43 1 NA 1 NA 1 1 1 NA
5 2 169 1 NA 0 NA 0 0 0 NA
6 4 19 1 NA 0 NA 0 0 0 NA
> RLdata500[2,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
2 GERD <NA> BAUER <NA> 1968 7 27
> RLdata500[43,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
43 GERD <NA> BAUERH <NA> 1968 7 27

If we compare the results with the string matching output, we see that we find no match between record IDs 1 and 174. Let's look at instance 4, where records id1, 2 and id2, 43 are compared:

> RLdata500[2,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
2 GERD <NA> BAUER <NA> 1968 7 27
> RLdata500[43,]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
43 GERD <NA> BAUERH <NA> 1968 7 27

The last name in those cases sound similar, hence the algorithm has captured them as similar records.

The string and phonetic comparisons cannot be used simultaneously for the same column.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset