front matter

preface

Thank you for choosing Getting Started with Natural Language Processing. I am very excited that you decided to learn about natural language processing (NLP) with the help of this book, and I hope that you’ll enjoy getting started with NLP following this material and the examples.

Natural language processing addresses various types of tasks related to language and processing of information expressed in human language. The field and techniques have been around for quite a long time, and they are well integrated into our everyday lives; in fact, you are probably benefiting from NLP on a daily basis without realizing it. Therefore, I can’t really overemphasize the importance and the impact that this technology has on our lives. The first chapter of this book will give you an overview of the wide scope of NLP applications that you might be using regularly—from internet search engines to spam filters to predictive keyboards (and many more!), and the rest the book will help you to implement many of these applications from scratch yourself.

In recent years, the field has been gaining more and more interest and attention. There are several reasons for this: on the one hand, thanks to the internet, we now have access to increasingly larger amounts of data. On the other hand, thanks to the recent developments in computer hardware and software, we have more powerful technology to process this data. The recent advances in machine learning and deep learning have also contributed to the increasing importance of NLP. These days, large tech companies are realizing the potential of using NLP, and businesses in legal tech, finance, insurance, health care, and many other sectors are investing in it. The reason for that is clear—language is the primary means of communication in all spheres of life, so being able to efficiently process the information expressed in the form of human language is always an advantage. This makes a book on NLP very timely. My goal with this book is to introduce you to a wide variety of topics related to natural language and its processing, and to show how and why these things matter in practical applications—be that your own small project or a company-level project that could benefit from extracting and using information from texts.

I have been working in NLP for over a decade now, and before switching to NLP, I primarily focused on linguistics and theoretical studies of language. Looking back, what motivated and excited me the most about turning to the more technical field of NLP were the incredible new opportunities opened up to me by technology and the ease of working with data and getting the information you need from texts, whether in the context of academic studies about the language itself or in the context of practical applications in any other domain. This book aims to produce the same effect. It is highly practice oriented, and each language-related concept, each technique, and each task is explained with the help of real-life examples.

acknowledgments

Writing a book is a long process that takes a lot of time and effort. I truly enjoyed working on this book, and I sincerely hope that you will enjoy reading it, too. Nevertheless, it would be impossible to enjoy this process, or even to finish the book, were it not for the tremendous support, inspiration, and encouragement provided to me by my family, my partner Ted, and my dear friends Eugene, Alex, and Natalia. Thank you for believing in me!

I am also extremely grateful to the Manning team and all the people at Manning who took time to review my book with such care and who gave me valuable feedback along the way. I’d like to acknowledge my development editor, Dustin Archibald, who was always there for me with his patience and support, especially when I needed those the most. I am also grateful to Michael Lund, my technical development editor, and Al Krinker, my technical proofreader, for carefully checking the content and the code for this book and providing me with valuable feedback. I would also like to extend my gratitude to Kathleen Rossland, my production editor; Carrie Andrews, my copyeditor; and Susan Honeywell and Azra Dedic, members of the graphics editing team, whose valuable help at the final stages of editing of this book improved it tremendously. Thanks as well to the rest of the Manning team who worked on the production and promotion of this book.

I would also like to thank all the reviewers who took the time out of their busy schedules to read my manuscript at various stages of its development. Thanks to their invaluable feedback and advice, this book kept improving from earlier stages until it went into production. I would like to acknowledge Alessandro Buggin, Cage Slagel, Christian Bridge-Harrington, Christian Thoudahl, Douglas Sparling, Elmer C. Peramo, Erik Hansson, Francisco Rivas, Ian D. Miller, James Richard Woodruff, Jason Hales, Jérôme Baton, Jonathan Wood, Joseph Perenia, Kelly Hair, Lewis Van Winkle, Luis Fernando Fontoura de Oliveira, Monica Guimaraes, Najeeb Arif, Patrick Regan, Rees Morrison, Robert Diana, Samantha Berk, Sumit K. Singh, Tanya Wilke, Walter Alexander Mata López, and Werner Nindl.

about this book

The primary goal that I have for this book is to help you appreciate how truly exciting the field of NLP is, how limitless the possibilities of working in this area are, and how low the barrier to entry is now. My goal is to help you get started in this field easily and to show what a wide range of different applications you can implement yourself within a matter of days even if you have never worked in this field before. This book can be used both as a comprehensive cover-to-cover guide through a range of practical applications and as a reference book if you are interested in only some of the practical tasks. By the time you finish reading this book, you will have acquired

  • Knowledge about the essential NLP tasks and the ability to recognize any particular task when you encounter it in a real-life scenario. We will cover such popular tasks as sentiment analysis, text classification, information search, and many more.

  • A whole arsenal of NLP algorithms and techniques, including stemming, lemmatization, part-of-speech tagging, and many more. You will learn how to apply a range of practical approaches to text, such as vectorization, feature extraction, supervised and unsupervised machine learning, among others.

  • An ability to structure an NLP project and an understanding of which steps need to be involved in a practical project.

  • Comprehensive knowledge of the key NLP, as well as machine-learning, terminology.

  • Comprehensive knowledge of the available resources and tools for NLP.

Who should read this book

I have written this book to be accessible to software developers and beginners in data science and machine learning. If you have done some programming in Python before and are familiar with high school math and algebra (e.g., matrices, vectors, and basic operations involving them), you should be good to go! Most importantly, the book does not assume any prior knowledge of linguistics or NLP, as it will help you learn what you need along the way.

How this book is organized: A road map

The first two chapters of this book introduce you to the field of natural language processing and the variety of NLP applications available. They also show you how to build your own small application with a minimal amount of specialized knowledge and skills in NLP. If you are interested in having a quick start in the field, I would recommend reading these two chapters. Each subsequent chapter looks more closely into a specific NLP application, so if you are interested in any such specific application, you can just focus on a particular chapter. For a comprehensive overview of the field, techniques, and applications, I would suggest reading the book cover to cover:

  • Chapter 1—Introduces the field of NLP with its various tasks and applications. It also briefly overviews the history of the field and shows how NLP applications are used in our everyday lives.

  • Chapter 2—Explains how you can build your own practical NLP application (spam filtering) from scratch, walking you through all the essential steps in the application pipeline. While doing so, it introduces a number of fundamental NLP techniques, including tokenization and text normalization, and shows how to use them in practice via a popular NLP toolkit called NLTK.

  • Chapter 3—Focuses on the task of information retrieval. It introduces several key NLP techniques, such as stemming and stopword removal, and shows how you can implement your own information-retrieval algorithm. It also explains how such an algorithm can be evaluated.

  • Chapter 4—Looks into information extraction and introduces further fundamental techniques, such as part-of-speech tagging, lemmatization, and dependency parsing. Moreover, it shows how to build an information-extraction application using another popular NLP toolkit called spaCy.

  • Chapter 5—Shows how to implement your own author (or user) profiling algorithm, providing you with further examples and practice in NLTK and spaCy. Moreover, it presents the task as a text classification problem and shows how to implement a machine-learning classifier using a popular machine learning library called scikit-learn.

  • Chapter 6—Follows up on the topic of author (user) profiling started in chapter 5. It investigates closely the task of linguistic feature engineering, which is an essential step in any NLP project. It shows how to perform linguistic feature engineering using NLTK and spaCy, and how to evaluate the results of a text classification algorithm.

  • Chapter 7—Starts the topic of sentiment analysis, which is a very popular NLP task. It applies a lexicon-based approach to the task. The sentiment analyzer is built using a linguistic pipeline with spaCy.

  • Chapter 8—Follows up on sentiment analysis, but unlike chapter 7, it takes a data-driven approach to this task. Several machine-learning techniques are applied using scikit-learn, and further linguistic concepts are introduced with the use of spaCy and NLTK language resources.

  • Chapter 9—Overviews the task of topic classification. In contrast to the previous text classification tasks, it is a multiclass classification problem, so the chapter discusses the intricacies of this task and shows how to implement a topic classifier with scikit-learn. In addition, it also takes an unsupervised machine-learning perspective and shows how to approach this task as a clustering problem.

  • Chapter 10—Introduces the task of topic modeling with latent Dirichlet allocation (LDA). In addition, it introduces a popular toolkit called gensim, which is particularly suitable for working with topic modeling algorithms. Motivation for the LDA approach, implementation details, and techniques for the results evaluation are discussed.

  • Chapter 11—Concludes this book with another key NLP task called named-entity recognition (NER). While introducing this task, this chapter also introduces a powerful family of sequence labeling approaches widely used for NLP tasks and shows how NER integrates into further, downstream NLP applications.

About the code

This book contains many examples of source code both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text. Sometimes code is also in bold to highlight code that has changed from previous steps in the chapter, such as when a new feature adds to an existing line of code.

In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page space in the book. In rare cases, even this was not enough, and listings include line-continuation markers (). Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts.

You can get executable snippets of code from the liveBook (online) version of this book at http://livebook.manning.com/book/getting-started-with-natural-language-processing and from the book’s GitHub page at https://github.com/ekochmar/Essential-NLP. The appendix provides you with installation instructions. Please note that if you use a different version of the tools than specified in the instructions, you may get slightly different results to those discussed in the book: such differences are to be expected as the tools are constantly updated; however, the main points made will still hold.

liveBook discussion forum

Purchase of Getting Started with Natural Language Processing includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach comments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook.manning.com/book/getting-started-with-natural-language-processing/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/discussion.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

Other online resources

I hope that this book will give you a start in the exciting field of NLP and will motivate you to learn more about NLP techniques and applications. Even though the book covers a range of different applications, being a single resource, it cannot possibly cover all topics. At the same time, you might also find yourself wanting to know more about some of the topics overviewed in the book and dig deeper. Here are other online resources that will help you on this journey:

  • One of the popular NLP toolkits that this book uses a lot is NLTK. If you want to learn more about particular techniques and implementation details, you can always check the documentation at www.nltk.org/. NLTK also comes with a useful book, available at www.nltk.org/book/, which provides further examples with the toolkit.

  • Another popular NLP toolkit that you will be using a lot in the course of working with this book’s material is spaCy (https://spacy.io). SpaCy aims to provide you with industrial-strength NLP functionalities, its models are constantly updated using state-of-the-art techniques and approaches, and the toolkit is used in a wide variety of educational and industrial projects (see an overview at https://spacy.io/universe). Therefore, I recommend keeping an eye on the updates and checking the documentation and tutorials available on spaCy’s website to learn more about its rich functionality.

  • The third NLP library that you will be using is gensim (https://radimrehurek.com/gensim/), which is particularly suitable for topic modeling and semantics-oriented tasks. Just like the previous two toolkits, it comes with extensive documentation and a variety of examples and tutorials. I recommend looking into those if you’d like to learn more about this toolkit.

  • Finally, if you want to learn more about the theoretical side of things and the developments on various NLP tasks, I’d recommend an excellent comprehensive textbook called Speech and Language Processing by Dan Jurafsky and James H. Martin. The book is in its third edition, and a substantial part of it is available at https://web.stanford.edu/~jurafsky/slp3/.

Finally, no book is ever perfect, but if you find this book helpful, I would love to get your feedback. You can share it with me via LinkedIn: www.linkedin.com/in/ekaterina-kochmar-0a655b14/. Updates and corrections will be made available on the book’s GitHub page at https://github.com/ekochmar/Essential-NLP.

about the author

Kochmar

Ekaterina Kochmar is a lecturer (assistant professor) at the Department of Computer Science of the University of Bath, where she is part of the AI research group. Her research lies at the intersection of artificial intelligence, natural language processing, and intelligent tutoring systems. She holds a PhD in natural language processing, an MPhil in advanced computer science from the University of Cambridge, and an MA in computational linguistics from the University of Tuebingen. She is also a cofounder and the chief scientific officer of Korbit AI, focusing on building an AI-powered dialogue-based tutoring system capable of providing learners with high-quality, interactive, and personalized education. Ekaterina has extensive experience in teaching both within and outside of academia.

about the cover illustration

The figure on the cover of Getting Started with Natural Language Processing is "Femme de l'Isle de Santorin" or "Woman of the Island of Santorini" taken from a collection by Jacques Grasset de Saint-Sauveur, published in 1788. Each illustration is finely drawn and colored by hand.

In those days, it was easy to identify where people lived and what their trade or station in life was just by their dress. Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional culture centuries ago, brought back to life by pictures from collections such as this one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset