1.POS标签器推荐 使用nltk推荐的pos_tag()函数,基于Penn Treebank,以下代码展示了使用nltk获取句子POS标签的方法:
sentence = 'The brown fox is quick and he is jumping over the lazy dog' # recommended tagger based on PTB import nltk tokens = nltk.word_tokenize(sentence) tagged_sent = nltk.pos_tag(tokens, tagset='universal') print (tagged_sent)
输出:
[('The', 'DET'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('is', 'VERB'), ('quick', 'ADJ'), ('and', 'CONJ'), ('he', 'PRON'), ('is', 'VERB'), ('jumping', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN')]
2.建立自己的POS标签器 准备数据:# preparing the data from nltk.corpus import treebank data = treebank.tagged_sents() train_data = data[:3500] test_data = data[3500:] print (train_data[0])
输出:[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
2.1DefaultTagger默认标签器 首先我们试下从SequentialBackoffTagger 基类继承的DefaultTagger ,并为每个单词分配相同的用户输入POS标签。
# default tagger from nltk.tag import DefaultTagger dt = DefaultTagger('NN') print(dt.evaluate(test_data)) print(dt.tag(tokens))
输出:
0.1454158195372253 [('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NN'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NN'), ('jumping', 'NN'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]
默认获得了14%的准确率,由于给标签器输入的都是相同的标签(‘NN’),因此输出标签获得的都是名词。
2.2RegexpTagger正则表达式标签器 # regex tagger from nltk.tag import RegexpTagger # define regex tag patterns patterns = [ (r'.*ing$', 'VBG'), # gerunds (r'.*ed$', 'VBD'), # simple past (r'.*es$', 'VBZ'), # 3rd singular present (r'.*ould$', 'MD'), # modals (r'.*\'s$', 'NN$'), # possessive nouns (r'.*s$', 'NNS'), # plural nouns (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers (r'.*', 'NN') # nouns (default) ... ] rt = RegexpTagger(patterns) print(rt.evaluate(test_data)) print(rt.tag(tokens))
输出:
0.24039113176493368 [('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NNS'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NNS'), ('jumping', 'VBG'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]
准确率提高到了24%,还是有效果的~
2.3一、二、三元标签器 ## N gram taggers from nltk.tag import UnigramTagger from nltk.tag import BigramTagger from nltk.tag import TrigramTagger ut = UnigramTagger(train_data) bt = BigramTagger(train_data) tt = TrigramTagger(train_data) print(ut.evaluate(test_data)) print(ut.tag(tokens)) print(bt.evaluate(test_data)) print(bt.tag(tokens)) print (tt.evaluate(test_data)) print(tt.tag(tokens))
输出:
0.8607803272340013 [('The', 'DT'), ('brown', None), ('fox', None), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', None), ('dog', None)] 0.13466937748087907 [('The', 'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)] 0.08064672281924679 [('The', 'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]
发现一元的准确率最高,达到了86%,二、三元准确率低的原因可能是在训练数据中观察到的二元词组和三元词组不一定会在测试数据中以相同的方式出现。
2.4包含标签列表的组合标签器及使用backoff标签器 本质上,我们将创建一个标签器链,对于每一个标签器,吐过他不能标记输入的标识,则标签器的下一步将会回退到backoff标签器:
def combined_tagger(train_data, taggers, backoff=None): for tagger in taggers: backoff = tagger(train_data, backoff=backoff) return backoff #backoff to regtagger ct = combined_tagger(train_data=train_data, taggers=[UnigramTagger, BigramTagger, TrigramTagger], backoff=rt) print(ct.evaluate(test_data)) print(ct.tag(tokens))
输出:
0.9094781682641108 [('The', 'DT'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]
准确率到了90%
2.5ClassifierBasedPOSTagger标签器(有监督分类算法) 使用ClassifierBasedPOSTagger类中的classifier_builder参数中的有监督机器学习算法来训练标签器。
from nltk.classify import NaiveBayesClassifier, MaxentClassifier from nltk.tag.sequential import ClassifierBasedPOSTagger nbt = ClassifierBasedPOSTagger(train=train_data, classifier_builder=NaiveBayesClassifier.train) print(nbt.evaluate(test_data)) print(nbt.tag(tokens))
输出:
0.9306806079969019 [('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'VBG')]
有监督准确率达到了0.93
2.6Just For Fun!(MaxentClassifier) # try this out for fun! met = ClassifierBasedPOSTagger(train=train_data, classifier_builder=MaxentClassifier.train) print(met.evaluate(test_data)) print(met.tag(tokens))
输出:
==> Training (100 iterations) Iteration Log Likelihood Accuracy --------------------------------------- 1 -3.82864 0.007 2 -0.76176 0.957 Final nan 0.984 0.9269048310581857 [('The', 'DT'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]
MaxentClassifier准确率达到了0.92