Multi-language NLP

spaCy includes trained language models for English, German, Spanish, Portuguese, French, Italian, and Dutch, as well as a multi-language model for NER. Cross-language usage is straightforward since the API does not change.

We will illustrate the Spanish language model using a parallel corpus of TED Talk subtitles (see the GitHub repo for data source references). For this purpose, we instantiate both language models:

model = {}
for language in ['en', 'es']:
model[language] = spacy.load(language)

We then read small corresponding text samples in each model:

text = {}
path = Path('../data/TED')
for language in ['en', 'es']:
file_name = path / 'TED2013_sample.{}'.format(language)
text[language] = file_name.read_text()

Sentence boundary detection uses the same logic but finds a different breakdown:

parsed, sentences = {}, {}
for language in ['en', 'es']:
parsed[language] = model[language](text[language])
sentences[language] = list(parsed[language].sents)
print('Sentences:', language, len(sentences[language]))
Sentences: en 19
Sentences: es 22

POS tagging also works in the same way:

pos = {}
for language in ['en', 'es']:
pos[language] = pd.DataFrame([[t.text, t.pos_, spacy.explain(t.pos_)] for t in sentences[language][0]],
columns=['Token', 'POS Tag', 'Meaning'])
pd.concat([pos['en'], pos['es']], axis=1).head()

The result is the side-by-side token annotations for the English and Spanish documents:

Token

POS Tag

Meaning

Token

POS Tag

Meaning

There

ADV

adverb

Existe

VERB

verb

s

VERB

verb

una

DET

determiner

a

DET

determiner

estrecha

ADJ

adjective

tight

ADJ

adjective

y

CONJ

conjunction

and

CCONJ

coordinating conjunction

sorprendente

ADJ

adjective

 

The next section illustrates how to use parsed and annotated tokens to build a document-term matrix that can be used for text classification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset