Record Linkage - Stochastic and Machine Learning Approaches

In a large database of records, synonymous records pose a great problem. Two records referring to the same entity are considered to be synonymous. In the absence of a common identifier, such as a primary key or foreign key, joining such records based on the entities is a tough task. Let's illustrate this with a quick example. Consider the following two records:

Sno First name Middle name Last name Address City State Zip
1 John NULL NULL 312 Delray Ave Deer Field FL 33433
2 John NULL Sanders 312 Delray Beach Ave Deer Field FL 33433

 

Both the records refer to the same entity, one Mr. John. Record linkage refers to an umbrella of algorithms that are designed to solve the exact same problem. Record linkage plays a key role today in various applications such as CRM, Loyalty to name a few. They are an integral part of today's sophisticated business intelligence systems and master data management systems.

Disabled Airplane Pilots – a successful application of record linkage: A database consisting of records of 40,000 airplane pilots licensed by the U.S. Federal Aviation Administration (FAA) and residing in Northern California was matched to a database consisting of individuals receiving disability payments from the social security administration. Forty pilots whose records turned up on both databases were arrested (https://www.soa.org/library/newsletters/the-actuary-magazine/2007/february/link2007feb.aspx).

In this chapter, we will cover the following topics:

  • Introducing a  use case that can be solved by record linkage algorithms
  • Demonstrating the use of the R package, RecordLinkage
  • Covering stochastic record linkage algorithms
  • Implementing machine learning-based record linkage algorithms
  • Building an RShiny application

The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our example we will introduce the R packages RecordLinkage we will be using. During our code description, we will be using some of the output printed in the console. We have included what will be printed in the console immediately following the statement which prints the information to the console, so as to not disturb the flow of the code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset