Summary

We introduced the problem of record linkage and emphasized its importance. We introduced the package, RecordLinkage, in R to solve record linkage problems. We started with generating features, string- and phonetic-based, for record pairs so that they can be processed further down the pipeline to dedup records. We covered expectation maximization and weights-based methods to perform a dedup task on our record pairs. Finally, we wrapped up the chapter by introducing machine learning methods for dedup tasks. Under unsupervised methods, K-means clustering was discussed. We further leveraged the output of the K-means algorithm to train a supervised model.

In the next chapter we go through streaming data and its challenges. We will build a stream clustering algorithm for a given streaming data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset