nltk---词性标注 | CinKate's Blogs

1.POS标签器推荐

使用nltk推荐的pos_tag()函数，基于Penn Treebank，以下代码展示了使用nltk获取句子POS标签的方法：

sentence = 'The brown fox is quick and he is jumping over the lazy dog'
# recommended tagger based on PTB
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens, tagset='universal')
print (tagged_sent)

输出：

[('The', 'DET'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('is', 'VERB'), ('quick', 'ADJ'), ('and', 'CONJ'), ('he', 'PRON'), ('is', 'VERB'), ('jumping', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN')]

2.建立自己的POS标签器

准备数据：

# preparing the data
from nltk.corpus import treebank
data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
print (train_data[0])

输出：

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

2.1DefaultTagger默认标签器

首先我们试下从SequentialBackoffTagger基类继承的DefaultTagger，并为每个单词分配相同的用户输入POS标签。

# default tagger
from nltk.tag import DefaultTagger
dt = DefaultTagger('NN')
print(dt.evaluate(test_data))
print(dt.tag(tokens))

输出：

0.1454158195372253
[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NN'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NN'), ('jumping', 'NN'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]

默认获得了14%的准确率，由于给标签器输入的都是相同的标签（‘NN’），因此输出标签获得的都是名词。

2.2RegexpTagger正则表达式标签器

# regex tagger
from nltk.tag import RegexpTagger
# define regex tag patterns
patterns = [
        (r'.*ing$', 'VBG'),               # gerunds
        (r'.*ed$', 'VBD'),                # simple past
        (r'.*es$', 'VBZ'),                # 3rd singular present
        (r'.*ould$', 'MD'),               # modals
        (r'.*\'s$', 'NN$'),               # possessive nouns
        (r'.*s$', 'NNS'),                 # plural nouns
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
        (r'.*', 'NN')                     # nouns (default) ... 
]
rt = RegexpTagger(patterns)

print(rt.evaluate(test_data))
print(rt.tag(tokens))

输出：

0.24039113176493368
[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NNS'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NNS'), ('jumping', 'VBG'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]

准确率提高到了24%，还是有效果的~

2.3一、二、三元标签器

## N gram taggers
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

ut = UnigramTagger(train_data)
bt = BigramTagger(train_data)
tt = TrigramTagger(train_data)

print(ut.evaluate(test_data))
print(ut.tag(tokens))

print(bt.evaluate(test_data))
print(bt.tag(tokens))

print (tt.evaluate(test_data))
print(tt.tag(tokens))

输出：

0.8607803272340013
[('The', 'DT'), ('brown', None), ('fox', None), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', None), ('dog', None)]
0.13466937748087907
[('The', 'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]
0.08064672281924679
[('The', 'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]

发现一元的准确率最高，达到了86%,二、三元准确率低的原因可能是在训练数据中观察到的二元词组和三元词组不一定会在测试数据中以相同的方式出现。

2.4包含标签列表的组合标签器及使用backoff标签器

本质上，我们将创建一个标签器链，对于每一个标签器，吐过他不能标记输入的标识，则标签器的下一步将会回退到backoff标签器：

def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff
#backoff to regtagger
ct = combined_tagger(train_data=train_data, 
                     taggers=[UnigramTagger, BigramTagger, TrigramTagger],
                     backoff=rt)

print(ct.evaluate(test_data))        
print(ct.tag(tokens))

输出：

0.9094781682641108
[('The', 'DT'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

准确率到了90%

2.5ClassifierBasedPOSTagger标签器(有监督分类算法)

使用ClassifierBasedPOSTagger类中的classifier_builder参数中的有监督机器学习算法来训练标签器。

from nltk.classify import NaiveBayesClassifier, MaxentClassifier
from nltk.tag.sequential import ClassifierBasedPOSTagger

nbt = ClassifierBasedPOSTagger(train=train_data,
                               classifier_builder=NaiveBayesClassifier.train)

print(nbt.evaluate(test_data))
print(nbt.tag(tokens))

输出：

0.9306806079969019
[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'VBG')]

有监督准确率达到了0.93

2.6Just For Fun!(MaxentClassifier)

# try this out for fun!
met = ClassifierBasedPOSTagger(train=train_data,
                               classifier_builder=MaxentClassifier.train)
print(met.evaluate(test_data))                           
print(met.tag(tokens))

输出：


==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -3.82864        0.007
             2          -0.76176        0.957
         Final               nan        0.984
0.9269048310581857
[('The', 'DT'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

MaxentClassifier准确率达到了0.92