Data

The data that is most commonly used to test and benchmark NER is the CoNLL2003 dataset, which is a shared task for language-independent NER. The dataset contains a training, development, and test file, along with a large file of unannotated data. The development file is used for tuning the parameters of the learning method, while the training data is used for training the model, using the tuned parameters, and testing on the test dataset.

The CoNLL data, split between the development and the test, is provided to avoid tuning systems on the test data. The data for the English language is taken from news stories between August 1996 and August 1997, from the Reuters Corpus. A sample sentence from the CoNLL dataset, with its accompanying entity annotations, is shown as follows:

Only RB B-NP O
France NNP I-NP B-LOC
and CC I-NP O
Britain NNP I-NP B-LOC
backed VBD B-VP O
Fischler NNP B-NP B-PER
' s POS B-NP O
proposal NN I-NP O
. . O O

Each line of the CoNLL data contains four fields: the word, the part of speech (POS) tag of the word, the chunk tag of the word, and its named entity tag. The tag, O , is given to words outside of the named entities.

In order to handle entities where there are two tokens (for example, New York) a tagging scheme is used to distinguish different entity cases. When two entities of an <entity> type are next to one another, the first word of the second entity is tagged as B-<entity>, to show that it starts another entity. The entities provided by the CoNLL2003 task are LOC, PER, ORG, and MISC, which are locations, persons, organizations, and miscellaneous, respectively.

Another dataset that is commonly used for building NER systems is the Groningen Meaning Bank (GMB), which has many more annotations that are useful for the task of building an NER system. In this chapter, however, we will be using the CoNLL2003 data for our experiments and evaluation.

A couple of widely used open source frameworks? for off-the-shelf NER systems are Stanford's NLTK and Explosion AI's spaCy. While both of these frameworks provide excellent off-the-shelf performance for various tasks, we are interested in developing a flexible, state-of-the-art deep learning model for an NER system.

Table of Contents for Data

Create new playlist

Sign In

Sign Up

Table of Contents for
Data