nltk---分词与文本预处理

参考《text-analytics-with-python》中的第三章中的处理和理解文本对nltk等常用nlp包进行总结,以供之后复习与使用~

1.tokenize(切分词(句子))

首先,标识(token)是具有一定的句法语义且独立的最小文本成分,

1.1句子切分

句子切分基本技术包括在句子之间寻找特定的分割符,例如句号(‘.’),换行符(‘\n’)或者分号(‘;’)等。
在nltk中,主要关注以下句子切分器:

  • nltk.sent_tokenize(默认句子切分器)
  • nltk.tokenize.PunktSentenceTokenizer()
  • nltk.tokenize.RegexpTokenizer()
    以下直接上代码:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint

#载入语料
alice=gutenberg.raw(fileids='carroll-alice.txt')
sample_text = 'We will discuss briefly about the basic syntax,\
structure and design philosophies. \
There is a defined hierarchical syntax for Python code which you should remember \
when writing code! Python is a really powerful programming language!'
# Total characters in Alice in Wonderland
print(len(alice))
# First 100 characters in the corpus
print(alice[0:100])

输出:

144395
[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was

1.1.1默认分词器–nltk.sent_tokenize

#默认分词器
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)
print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences :-')
pprint(sample_sentences)
print('\nTotal sentences in alice:', len(alice_sentences))
print('First 5 sentences in alice:-')
pprint(alice_sentences[0:5])

输出:

Total sentences in sample_text: 3
Sample text sentences :-
['We will discuss briefly about the basic syntax, structure and design '
'philosophies.',
'There is a defined hierarchical syntax for Python code which you should '
'remember when writing code!',
'Python is a really powerful programming language!']

Total sentences in alice: 1625
First 5 sentences in alice:-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
'Down the Rabbit-Hole\n'
'\n'
'Alice was beginning to get very tired of sitting by her sister on the\n'
'bank, and of having nothing to do: once or twice she had peeped into the\n'
'book her sister was reading, but it had no pictures or conversations in\n'
"it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
"conversation?'",
'So she was considering in her own mind (as well as she could, for the\n'
'hot day made her feel very sleepy and stupid), whether the pleasure\n'
'of making a daisy-chain would be worth the trouble of getting up and\n'
'picking the daisies, when suddenly a White Rabbit with pink eyes ran\n'
'close by her.',
'There was nothing so VERY remarkable in that; nor did Alice think it so\n'
"VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!",
'Oh dear!']

德语切分

## 其他语言句子切分器
from nltk.corpus import europarl_raw
#德语
german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
# 语料中的词数
print(len(german_text))
# 前100字符
print(german_text[0:100])

输出:

157171

Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit

使用默认分词器切分德语

# 默认句子切分器
german_sentences_def = default_st(text=german_text, language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

# verify the type of german_tokenizer
# should be PunktSentenceTokenizer
print(type(german_tokenizer))

# check if results of both tokenizers match
# should be True
print(german_sentences_def == german_sentences)
# print first 5 sentences of the corpus
for sent in german_sentences[0:5]:
print(sent)

输出:

<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>
True

Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .

1.1.2使用PunktSentenceTokenizer

## using PunktSentenceTokenizer for sentence tokenization
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
pprint(sample_sentences)

输出:

['We will discuss briefly about the basic syntax, structure and design '
'philosophies.',
'There is a defined hierarchical syntax for Python code which you should '
'remember when writing code!',
'Python is a really powerful programming language!']

1.1.3使用RegexpTokenizer

#使用正则表达式做句子切分
## using RegexpTokenizer for sentence tokenization
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(
pattern=SENTENCE_TOKENS_PATTERN,
gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
pprint(sample_sentences)

输出:

['We will discuss briefly about the basic syntax, structure and design '
'philosophies.',
' There is a defined hierarchical syntax for Python code which you should '
'remember when writing code!',
'Python is a really powerful programming language!']

1.2词语切分

1.2.1默认分词器nltk.word_tokenize

## 分词
sentence = "The brown fox wasn't that quick and he couldn't win the race"
# default word tokenizer
default_wt = nltk.word_tokenize
words = default_wt(sentence)
print(words)

输出:

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']

1.2.2Treebank分词器

# treebank word tokenizer
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
print(words)

输出:

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']

1.2.3正则分词器RegexpTokenizer

# 正则切分
TOKEN_PATTERN = r'\w+'
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,
gaps=False)
words = regex_wt.tokenize(sentence)
print(words)

输出:

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']

设置正则模式:

GAP_PATTERN = r'\s+'        
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,
gaps=True)
words = regex_wt.tokenize(sentence)
print(words)

输出:

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']

分词索引

word_indices = list(regex_wt.span_tokenize(sentence))
print(word_indices)
print([sentence[start:end] for start, end in word_indices])

输出:

[(0, 3), (4, 9), (10, 13), (14, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 47), (48, 51), (52, 55), (56, 60)]
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']

1.2.4WordPunctTokenizer分词器

# derived regex tokenizers(派生类执行分词)
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print(words)

输出:

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'race']

1.2.5WhitespaceTokenizer分词器

#WhitespaceTokenizer基于诸如缩进符、换行符及空格的空白字符将句子分割成单词
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print(words)

输出:

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']

2.文本规范化

文本规范化定义为这样的一一个过程,它包含一系列步骤, 依次是转换、清洗以及将文本数据标准化成可供NLP、分析系统和应用程序使用的格式。通常,文本切分本身也是文本规范化的一部分。除了文本切分以外,还有各种其他技术,包括文本清洗、大小写转换、词语校正、停用词删除、词干提取和词形还原。文本规范化也常常称为文本清洗或转换。

本节将讨论在文本规范化过程中使用的各种技术。在探索各种技术之前,请使用以下代码段来加载基本的依存关系以及将使用的语料库:

import nltk
import re
import string
from pprint import pprint

corpus = ["The brown fox wasn't that quick and he couldn't win the race",
"Hey that's a great deal! I just bought a phone for $199",
"@@You'll (learn) a **lot** in the book. Python is an amazing language!@@"]

2.2文本清洗

可以使用nltk中的clean_html()函数,或者BeautifulSoup库来解析HTML数据,还可以使用自定义的逻辑,包括正则表达式、xpath和lxml库来解析XML数据。

2.3文本切分

def tokenize_text(text):
sentences = nltk.sent_tokenize(text)
word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
return word_tokens

token_list = [tokenize_text(text)
for text in corpus]
print(token_list)

输出:

[[['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']], [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'], ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']], [['@', '@', 'You', "'ll", '(', 'learn', ')', 'a', '**lot**', 'in', 'the', 'book', '.'], ['Python', 'is', 'an', 'amazing', 'language', '!'], ['@', '@']]]

2.4删除特殊字符

在分词后删除特殊字符
def remove_characters_after_tokenization(tokens):
pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
filtered_tokens = [pattern.sub('', token) for token in tokens]
return filtered_tokens

filtered_list_1 = [[remove_characters_after_tokenization(tokens) for tokens in sentence_tokens] for sentence_tokens in token_list]
pprint(filtered_list_1)

输出:

[[['The',
'brown',
'fox',
'was',
'nt',
'that',
'quick',
'and',
'he',
'could',
'nt',
'win',
'the',
'race']],
[['Hey', 'that', 's', 'a', 'great', 'deal', ''],
['I', 'just', 'bought', 'a', 'phone', 'for', '', '199']],
[['', '', 'You', 'll', '', 'learn', '', 'a', 'lot', 'in', 'the', 'book', ''],
['Python', 'is', 'an', 'amazing', 'language', ''],
['', '']]]
在分词前删除特殊字符
def remove_characters_before_tokenization(sentence,
keep_apostrophes=False):
sentence = sentence.strip()
if keep_apostrophes:
PATTERN = r'[?|$|&|*|%|@|(|)|~]'
filtered_sentence = re.sub(PATTERN, r'', sentence)
else:
PATTERN = r'[^a-zA-Z0-9 ]'
filtered_sentence = re.sub(PATTERN, r'', sentence)
return filtered_sentence

filtered_list_2 = [remove_characters_before_tokenization(sentence)
for sentence in corpus]
print(filtered_list_2)
cleaned_corpus = [remove_characters_before_tokenization(sentence, keep_apostrophes=True)
for sentence in corpus]
print(cleaned_corpus)

输出:

['The brown fox wasnt that quick and he couldnt win the race', 'Hey thats a great deal I just bought a phone for 199', 'Youll learn a lot in the book Python is an amazing language']
["The brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal! I just bought a phone for 199", "You'll learn a lot in the book. Python is an amazing language!"]

2.5扩展缩写词

将is’nt 还原为is not等等…

from contractions import contractions_dict

def expand_contractions(sentence, contraction_mapping):

contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
flags=re.IGNORECASE|re.DOTALL)
def expand_match(contraction):
match = contraction.group(0)
first_char = match[0]
expanded_contraction = contraction_mapping.get(match)\
if contraction_mapping.get(match)\
else contraction_mapping.get(match.lower())
expanded_contraction = first_char+expanded_contraction[1:]
return expanded_contraction

expanded_sentence = contractions_pattern.sub(expand_match, sentence)
return expanded_sentence

expanded_corpus = [expand_contractions(sentence, contractions_dict)
for sentence in cleaned_corpus]
print(expanded_corpus)

输出:

['The brown fox was not that quick and he could not win the race', 'Hey that is a great deal! I just bought a phone for 199', 'You will learn a lot in the book. Python is an amazing language!']

2.6大小写转换

# case conversion    
print(corpus[0].lower())
print(corpus[0].upper())

输出:

the brown fox wasn't that quick and he couldn't win the race
THE BROWN FOX WASN'T THAT QUICK AND HE COULDN'T WIN THE RACE

2.7删除停用词

# removing stopwords
def remove_stopwords(tokens):
stopword_list = nltk.corpus.stopwords.words('english')
filtered_tokens = [token for token in tokens if token not in stopword_list]
return filtered_tokens

expanded_corpus_tokens = [tokenize_text(text)
for text in expanded_corpus]
filtered_list_3 = [[remove_stopwords(tokens)
for tokens in sentence_tokens]
for sentence_tokens in expanded_corpus_tokens]
print(filtered_list_3)

输出:

[[['The', 'brown', 'fox', 'quick', 'could', 'win', 'race']], [['Hey', 'great', 'deal', '!'], ['I', 'bought', 'phone', '199']], [['You', 'learn', 'lot', 'book', '.'], ['Python', 'amazing', 'language', '!']]]

2.8词语校正

删除重复的字符
# removing repeated characters
sample_sentence = 'My schooool is realllllyyy amaaazingggg'
sample_sentence_tokens = tokenize_text(sample_sentence)[0]

from nltk.corpus import wordnet

def remove_repeated_characters(tokens):
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
def replace(old_word):
if wordnet.synsets(old_word):
return old_word
new_word = repeat_pattern.sub(match_substitution, old_word)
return replace(new_word) if new_word != old_word else new_word

correct_tokens = [replace(word) for word in tokens]
return correct_tokens

print(remove_repeated_characters(sample_sentence_tokens))

输出:

['My', 'school', 'is', 'really', 'amazing']

2.9词干提取

2.9.1Port词干提取器

# porter stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()

print(ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped'))

print(ps.stem('lying'))

print(ps.stem('strange'))

输出:

jump jump jump
lie
strang

2.9.2LancasterStemmer词干提取器

# lancaster stemmer
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

print(ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped'))

print (ls.stem('lying'))

print (ls.stem('strange'))

输出:

jump jump jump
lying
strange

2.9.3RegexpStemmer正则词干提取器

# regex stemmer
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$', min=4)

print( rs.stem('jumping'), rs.stem('jumps'), rs.stem('jumped'))

print (rs.stem('lying'))

print (rs.stem('strange'))

输出:

jump jump jump
ly
strange

2.9.4SnowballStemmer词干提取器

# snowball stemmer
from nltk.stem import SnowballStemmer
ss = SnowballStemmer("german")

print ('Supported Languages:', SnowballStemmer.languages)

# autobahnen -> cars
# autobahn -> car
ss.stem('autobahnen')

# springen -> jumping
# spring -> jump
ss.stem('springen')

输出:

Supported Languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
Out[14]:
'spring'

2.10词形还原

# lemmatization
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

# lemmatize nouns
print( wnl.lemmatize('cars', 'n'))
print (wnl.lemmatize('men', 'n'))

# lemmatize verbs
print (wnl.lemmatize('running', 'v'))
print (wnl.lemmatize('ate', 'v'))

# lemmatize adjectives
print (wnl.lemmatize('saddest', 'a'))
print (wnl.lemmatize('fancier', 'a'))

# ineffective lemmatization
print (wnl.lemmatize('ate', 'n'))
print (wnl.lemmatize('fancier', 'v'))

输出:

car
men
run
eat
sad
fancy
ate
fancier
Author: CinKate
Link: http://renxingkai.github.io/2019/03/28/nltk-tokenize/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.