-
Notifications
You must be signed in to change notification settings - Fork 20
/
tokenization.txt
66 lines (51 loc) · 2.05 KB
/
tokenization.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
=====================
Sentence Tokenization
=====================
>>> para = "Hello. My name is Jacob. Today you'll be learning NLTK."
>>> from nltk import tokenize
>>> tokenize.sent_tokenize(para)
['Hello.', 'My name is Jacob.', "Today you'll be learning NLTK."]
>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> sents = tokenizer.tokenize(para)
>>> sents
['Hello.', 'My name is Jacob.', "Today you'll be learning NLTK."]
=================
Word Tokenization
=================
>>> sent = sents[2]
>>> tokenize.word_tokenize(sent)
['Today', 'you', "'ll", 'be', 'learning', 'NLTK', '.']
>>> tokenize.wordpunct_tokenize(sent)
['Today', 'you', "'", 'll', 'be', 'learning', 'NLTK', '.']
>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer = WordPunctTokenizer()
>>> tokenizer.tokenize(sent)
['Today', 'you', "'", 'll', 'be', 'learning', 'NLTK', '.']
>>> tokenizer = tokenize.PunktWordTokenizer()
>>> tokenizer.tokenize(sent)
['Today', 'you', "'ll", 'be', 'learning', 'NLTK.']
>>> tokenizer = tokenize.SpaceTokenizer()
>>> tokenizer.tokenize(sent)
['Today', "you'll", 'be', 'learning', 'NLTK.']
Choosing a Word Tokenizer
-------------------------
Your choice of word tokenizer depends on further steps down the pipeline.
There's no one right answer, it's context/pipeline dependent.
Do you need a normalized/canonical form?
How much does punctuation matter, and in what way?
What does the pos tagger and/or classifier expect?
Are you doing transformations?
==========================================
Tokenizing Words using Regular Expressions
==========================================
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer("[\w']+")
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction']
>>> from nltk.tokenize import regexp_tokenize
>>> regexp_tokenize("Can't is a contraction.", "[\w']+")
["Can't", 'is', 'a', 'contraction']
>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction.']