Searching and DeDuplicating Using CNNs

Deep neural networks have been proven to work extremely well when provided with a large number of data points. However, a dearth of data for building large search engines has been a primary issue for most engines. Traditional approaches to searching in text data involved domain understanding and keyword mapping. These provided search engines with a knowledge graph that included sufficient information about the topics, so that the engine would be able to find answers. The search engine would then also be able to expand to new topics, by obtaining the relationships between topics.

In this chapter, we will explore how to build a search engine using machine learning (ML), and we will use it for tasks such as matching and deduplication. In order to understand how deep learning can improve our searches, we will build baseline methods, using traditional methods such as TF-IDF and latent semantic indexing. Subsequently, we will develop a Convolutional Neural Network (CNN) that learns to identify duplicate texts. We will compare traditional approaches to CNNs, in order to understand the pros and cons of each approach.

We will approach the task of searching as the ability to understand the semantics of sentences. Searching for text in a large corpus of data requires the ability to get an appropriate representation of the text. We will tackle the task of deduplication as an extension of the classifier. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset