The nltk.corpus.package
consists of a number of corpus reader
classes that can be used to obtain the contents of various corpora.
Treebank corpus can also be accessed from nltk.corpus
. Identifiers for files can be obtained using fileids()
:
>>> import nltk >>> import nltk.corpus >>> print(str(nltk.corpus.treebank).replace('\','/')) <BracketParseCorpusReader in 'C:/nltk_data/corpora/treebank/combined'> >>> nltk.corpus.treebank.fileids() ['wsj_0001.mrg', 'wsj_0002.mrg', 'wsj_0003.mrg', 'wsj_0004.mrg', 'wsj_0005.mrg', 'wsj_0006.mrg', 'wsj_0007.mrg', 'wsj_0008.mrg', 'wsj_0009.mrg', 'wsj_0010.mrg', 'wsj_0011.mrg', 'wsj_0012.mrg', 'wsj_0013.mrg', 'wsj_0014.mrg', 'wsj_0015.mrg', 'wsj_0016.mrg', 'wsj_0017.mrg', 'wsj_0018.mrg', 'wsj_0019.mrg', 'wsj_0020.mrg', 'wsj_0021.mrg', 'wsj_0022.mrg', 'wsj_0023.mrg', 'wsj_0024.mrg', 'wsj_0025.mrg', 'wsj_0026.mrg', 'wsj_0027.mrg', 'wsj_0028.mrg', 'wsj_0029.mrg', 'wsj_0030.mrg', 'wsj_0031.mrg', 'wsj_0032. mrg', 'wsj_0033.mrg', 'wsj_0034.mrg', 'wsj_0035.mrg', 'wsj_0036.mrg', 'wsj_0037.mrg', 'wsj_0038.mrg', 'wsj_0039.mrg', 'wsj_0040.mrg', 'wsj_0041.mrg', 'wsj_0042.mrg', 'wsj_0043.mrg', 'wsj_0044.mrg', 'wsj_0045.mrg', 'wsj_0046.mrg', 'wsj_0047.mrg', 'wsj_0048.mrg', 'wsj_0049.mrg', 'wsj_0050.mrg', 'wsj_0051.mrg', 'wsj_0052.mrg', 'wsj_0053.mrg', 'wsj_0054.mrg', 'wsj_0055.mrg', 'wsj_0056.mrg', 'wsj_0057.mrg', 'wsj_0058.mrg', 'wsj_0059.mrg', 'wsj_0060.mrg', 'wsj_0061.mrg', 'wsj_0062.mrg', 'wsj_0063.mrg', 'wsj_0064.mrg', 'wsj_0065.mrg', 'wsj_0066.mrg', 'wsj_0067.mrg', 'wsj_0068.mrg', 'wsj_0069.mrg', 'wsj_0070.mrg', 'wsj_0071.mrg', 'wsj_0072.mrg', 'wsj_0073.mrg', 'wsj_0074.mrg', 'wsj_0075.mrg', 'wsj_0076.mrg', 'wsj_0077.mrg', 'wsj_0078.mrg', 'wsj_0079.mrg', 'wsj_0080.mrg', 'wsj_0081.mrg', 'wsj_0082.mrg', 'wsj_0083.mrg', 'wsj_0084.mrg', 'wsj_0085.mrg', 'wsj_0086.mrg', 'wsj_0087.mrg', 'wsj_0088.mrg', 'wsj_0089.mrg', 'wsj_0090.mrg', 'wsj_0091.mrg', 'wsj_0092.mrg', 'wsj_0093.mrg', 'wsj_0094.mrg', 'wsj_0095.mrg', 'wsj_0096.mrg', 'wsj_0097.mrg', 'wsj_0098.mrg', 'wsj_0099.mrg', 'wsj_0100.mrg', 'wsj_0101.mrg', 'wsj_0102.mrg', 'wsj_0103.mrg', 'wsj_0104.mrg', 'wsj_0105.mrg', 'wsj_0106.mrg', 'wsj_0107.mrg', 'wsj_0108.mrg', 'wsj_0109.mrg', 'wsj_0110.mrg', 'wsj_0111.mrg', 'wsj_0112.mrg', 'wsj_0113.mrg', 'wsj_0114.mrg', 'wsj_0115.mrg', 'wsj_0116.mrg', 'wsj_0117.mrg', 'wsj_0118.mrg', 'wsj_0119.mrg', 'wsj_0120.mrg', 'wsj_0121.mrg', 'wsj_0122.mrg', 'wsj_0123.mrg', 'wsj_0124.mrg', 'wsj_0125.mrg', 'wsj_0126.mrg', 'wsj_0127.mrg', 'wsj_0128.mrg', 'wsj_0129.mrg', 'wsj_0130.mrg', 'wsj_0131.mrg', 'wsj_0132.mrg', 'wsj_0133.mrg', 'wsj_0134.mrg', 'wsj_0135.mrg', 'wsj_0136.mrg', 'wsj_0137.mrg', 'wsj_0138.mrg', 'wsj_0139.mrg', 'wsj_0140.mrg', 'wsj_0141.mrg', 'wsj_0142.mrg', 'wsj_0143.mrg', 'wsj_0144.mrg', 'wsj_0145.mrg', 'wsj_0146.mrg', 'wsj_0147.mrg', 'wsj_0148.mrg', 'wsj_0149.mrg', 'wsj_0150.mrg', 'wsj_0151.mrg', 'wsj_0152.mrg', 'wsj_0153.mrg', 'wsj_0154.mrg', 'wsj_0155.mrg', 'wsj_0156.mrg', 'wsj_0157.mrg', 'wsj_0158.mrg', 'wsj_0159.mrg', 'wsj_0160.mrg', 'wsj_0161.mrg', 'wsj_0162.mrg', 'wsj_0163.mrg', 'wsj_0164.mrg', 'wsj_0165.mrg', 'wsj_0166.mrg', 'wsj_0167.mrg', 'wsj_0168.mrg', 'wsj_0169.mrg', 'wsj_0170.mrg', 'wsj_0171.mrg', 'wsj_0172.mrg', 'wsj_0173.mrg', 'wsj_0174.mrg', 'wsj_0175.mrg', 'wsj_0176.mrg', 'wsj_0177.mrg', 'wsj_0178.mrg', 'wsj_0179.mrg', 'wsj_0180.mrg', 'wsj_0181.mrg', 'wsj_0182.mrg', 'wsj_0183.mrg', 'wsj_0184.mrg', 'wsj_0185.mrg', 'wsj_0186.mrg', 'wsj_0187.mrg', 'wsj_0188.mrg', 'wsj_0189.mrg', 'wsj_0190.mrg', 'wsj_0191.mrg', 'wsj_0192.mrg', 'wsj_0193.mrg', 'wsj_0194.mrg', 'wsj_0195.mrg', 'wsj_0196.mrg', 'wsj_0197.mrg', 'wsj_0198.mrg', 'wsj_0199.mrg'] >>> from nltk.corpus import treebank >>> print(treebank.words('wsj_0007.mrg')) ['McDermott', 'International', 'Inc.', 'said', '0', ...] >>> print(treebank.tagged_words('wsj_0007.mrg')) [('McDermott', 'NNP'), ('International', 'NNP'), ...]
Let's see the code in NLTK for accessing the Penn Treebank Corpus, which uses the Treebank Corpus Reader contained in the corpus module:
>>> import nltk >>> from nltk.corpus import treebank >>> print(treebank.parsed_sents('wsj_0007.mrg')[2]) (S (NP-SBJ (NP (NNP Bailey) (NNP Controls)) (, ,) (VP (VBN based) (NP (-NONE- *)) (PP-LOC-CLR (IN in) (NP (NP (NNP Wickliffe)) (, ,) (NP (NNP Ohio))))) (, ,)) (VP (VBZ makes) (NP (JJ computerized) (JJ industrial) (NNS controls) (NNS systems))) (. .)) >>> import nltk >>> from nltk.corpus import treebank_chunk >>> treebank_chunk.chunked_sents()[1] Tree('S', [Tree('NP', [('Mr.', 'NNP'), ('Vinken', 'NNP')]), ('is', 'VBZ'), Tree('NP', [('chairman', 'NN')]), ('of', 'IN'), Tree('NP', [('Elsevier', 'NNP'), ('N.V.', 'NNP')]), (',', ','), Tree('NP', [('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN')]), ('.', '.')]) >>> treebank_chunk.chunked_sents()[1].draw()
The preceding code obtains the following parse tree:
>>> import nltk >>> from nltk.corpus import treebank_chunk >>> treebank_chunk.chunked_sents()[1].leaves() [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')] >>> treebank_chunk.chunked_sents()[1].pos() [(('Mr.', 'NNP'), 'NP'), (('Vinken', 'NNP'), 'NP'), (('is', 'VBZ'), 'S'), (('chairman', 'NN'), 'NP'), (('of', 'IN'), 'S'), (('Elsevier', 'NNP'), 'NP'), (('N.V.', 'NNP'), 'NP'), ((',', ','), 'S'), (('the', 'DT'), 'NP'), (('Dutch', 'NNP'), 'NP'), (('publishing', 'VBG'), 'NP'), (('group', 'NN'), 'NP'), (('.', '.'), 'S')] >>> treebank_chunk.chunked_sents()[1].productions() [S -> NP ('is', 'VBZ') NP ('of', 'IN') NP (',', ',') NP ('.', '.'), NP -> ('Mr.', 'NNP') ('Vinken', 'NNP'), NP -> ('chairman', 'NN'), NP -> ('Elsevier', 'NNP') ('N.V.', 'NNP'), NP -> ('the', 'DT') ('Dutch', 'NNP') ('publishing', 'VBG') ('group', 'NN')]
Part of speech annotations are included in the tagged_words()
method:
>>> nltk.corpus.treebank.tagged_words() [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
The type of tags and the count of these tags used in the Penn Treebank Corpus are shown here:
# |
16 |
$ |
724 |
'' | |
, |
4,886 |
-LRB- |
120 |
-NONE- |
6,592 |
-RRB- |
126 |
. |
384 |
: |
563 |
`` |
712 |
CC |
2,265 |
CD |
3,546 |
DT |
8,165 |
EX |
88 |
FW |
4 |
IN |
9,857 |
JJ |
5,834 |
JJR |
381 |
JJS |
182 |
LS |
13 |
MD |
927 |
NN |
13,166 |
NNP |
9,410 |
NNPS |
244 |
NNS |
6,047 |
PDT |
27 |
POS |
824 |
PRP |
1,716 |
PRP$ |
766 |
RB |
2,822 |
RBR |
136 |
RBS |
35 |
RP |
216 |
SYM |
1 |
TO |
2,179 |
UH |
3 |
VB |
2,554 |
VBD |
3,043 |
VBG |
1,460 |
VBN |
2,134 |
VBP |
1,321 |
VBZ |
2,125 |
WDT |
445 |
WP |
241 |
WP$ |
14 |
The tags and frequency can be obtained from the following code:
>>> import nltk >>> from nltk.probability import FreqDist >>> from nltk.corpus import treebank >>> fd = FreqDist() >>> fd.items() dict_items([])
The preceding code obtains a list of tags and the frequency of each tag in the Treebank corpus.
Let's see the code in NLTK for accessing the Sinica Treebank Corpus:
>>> import nltk >>> from nltk.corpus import sinica_treebank >>> print(sinica_treebank.sents()) [['一'], ['友情'], ['嘉珍', '和', '我', '住在', '同一條', '巷子'], ...] >>> sinica_treebank.parsed_sents()[27] Tree('S', [Tree('NP', [Tree('NP', [Tree('N‧的', [Tree('Nhaa', ['我']), Tree('DE', ['的'])]), Tree('Ncb', ['腦海'])]), Tree('Ncda', ['中'])]), Tree('Dd', ['頓時']), Tree('DM', ['一片']), Tree('VH11', ['空白'])])