Record Linkage - Stochastic and Machine Learning Approaches

In a large database of records, synonymous records pose a great problem. Two records referring to the same entity are considered to be synonymous. In the absence of a common identifier, such as a primary key or foreign key, joining such records based on the entities is a tough task. Let's illustrate this with a quick example. Consider the following two records:

Sno	First name	Middle name	Last name	Address	City	State	Zip
1	John	NULL	NULL	312 Delray Ave	Deer Field	FL	33433
2	John	NULL	Sanders	312 Delray Beach Ave	Deer Field	FL	33433

Both the records refer to the same entity, one Mr. John. Record linkage refers to an umbrella of algorithms that are designed to solve the exact same problem. Record linkage plays a key role today in various applications such as CRM, Loyalty to name a few. They are an integral part of today's sophisticated business intelligence systems and master data management systems.

Disabled Airplane Pilots – a successful application of record linkage: A database consisting of records of 40,000 airplane pilots licensed by the U.S. Federal Aviation Administration (FAA) and residing in Northern California was matched to a database consisting of individuals receiving disability payments from the social security administration. Forty pilots whose records turned up on both databases were arrested (https://www.soa.org/library/newsletters/the-actuary-magazine/2007/february/link2007feb.aspx).

In this chapter, we will cover the following topics:

Introducing a use case that can be solved by record linkage algorithms
Demonstrating the use of the R package, RecordLinkage
Covering stochastic record linkage algorithms
Implementing machine learning-based record linkage algorithms
Building an RShiny application

The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our example we will introduce the R packages RecordLinkage we will be using. During our code description, we will be using some of the output printed in the console. We have included what will be printed in the console immediately following the statement which prints the information to the console, so as to not disturb the flow of the code.

Table of Contents for Record Linkage - Stochastic and Machine Learning Approaches

Create new playlist

Sign In

Sign Up

Table of Contents for
Record Linkage - Stochastic and Machine Learning Approaches