Demonstrating the use of RecordLinkage package

We will leverage the RecordLinkage package in R. The data shown in the previous section is available with the package:

RecordLinkage: Record linkage in R provides functions to link and deduplicate datasets. Methods based on a stochastic approach are implemented, as well as classification algorithms from the machine learning domain. Authors: Andreas Borg and Murat Sariyar.
> library(RecordLinkage, quietly = TRUE)
> data(RLdata500)
> str(RLdata500)
'data.frame': 500 obs. of 7 variables:
$ fname_c1: Factor w/ 146 levels "ALEXANDER","ANDRE",..: 19 42 114 128 112 77 42 139 26 99 ...
$ fname_c2: Factor w/ 23 levels "ALEXANDER","ANDREAS",..: NA NA NA NA NA NA NA NA NA NA ...
$ lname_c1: Factor w/ 108 levels "ALBRECHT","BAUER",..: 61 2 31 106 50 23 76 61 77 30 ...
$ lname_c2: Factor w/ 8 levels "ENGEL","FISCHER",..: NA NA NA NA NA NA NA NA NA NA ...
$ by : int 1949 1968 1930 1957 1966 1929 1967 1942 1978 1971 ...
$ bm : int 7 7 4 9 1 7 8 9 3 2 ...
$ bd : int 22 27 30 2 13 4 1 20 4 27 ...
> head(RLdata500)
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
1 CARSTEN <NA> MEIER <NA> 1949 7 22
2 GERD <NA> BAUER <NA> 1968 7 27
3 ROBERT <NA> HARTMANN <NA> 1930 4 30
4 STEFAN <NA> WOLFF <NA> 1957 9 2
5 RALF <NA> KRUEGER <NA> 1966 1 13
6 JUERGEN <NA> FRANKE <NA> 1929 7 4

Our data, RLdata500, has 500 records and 7 variables, which includes first name, last name, and date of birth details. The first and last names are separated into two components denoted by the suffixes, _c1 and _c2. The date of birth is split into year, day, and month. Let's look at the steps that we are going to follow to implement record linkage:

Feature generation is the first step in record linkage. Once we have the desired features, we can solve the record linkage problem either using a stochastic approach or by fitting a machine learning model to the generated features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset