In a large database of records, synonymous records pose a great problem. Two records referring to the same entity are considered to be synonymous. In the absence of a common identifier, such as a primary key or foreign key, joining such records based on the entities is a tough task. Let's illustrate this with a quick example. Consider the following two records:
Sno | First name | Middle name | Last name | Address | City | State | Zip |
1 | John | NULL | NULL | 312 Delray Ave | Deer Field | FL | 33433 |
2 | John | NULL | Sanders | 312 Delray Beach Ave | Deer Field | FL | 33433 |
Both the records refer to the same entity, one Mr. John. Record linkage refers to an umbrella of algorithms that are designed to solve the exact same problem. Record linkage plays a key role today in various applications such as CRM, Loyalty to name a few. They are an integral part of today's sophisticated business intelligence systems and master data management systems.
In this chapter, we will cover the following topics:
- Introducing a use case that can be solved by record linkage algorithms
- Demonstrating the use of the R package, RecordLinkage
- Covering stochastic record linkage algorithms
- Implementing machine learning-based record linkage algorithms
- Building an RShiny application
The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our example we will introduce the R packages RecordLinkage we will be using. During our code description, we will be using some of the output printed in the console. We have included what will be printed in the console immediately following the statement which prints the information to the console, so as to not disturb the flow of the code.