-
Notifications
You must be signed in to change notification settings - Fork 6
ylkuo/postagger_zh
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Since Chinese sentences are not seperated by spaces, the setences should be tokenized before POS tagging. However, most Chinese POS tagger does not provide word segmentation functionality. It is hard to use these tagger directly. This python library tries to solve this problem by wrapping segmentation and tagging tools together. - Chinese word segmentation by Ling Pipe: http://alias-i.com/lingpipe/ - Chinese part-of-speech tagger trained by nltk-trainer: https://github.com/japerk/nltk-trainer Both segmentation and pos tagging tools are train using the dataset released by Academia Sinica in Taiwan. Before tagging, the sentences are tokenized by a segmenter. The tagging results are stored in a list. The elements of the list are tuples of word tokens and their associated POS tag. The full list of Chinese POS tags are in http://db1x.sinica.edu.tw/kiwi/mkiwi/modern_e_wordtype.html ========== Dependency ========== - NLTK: http://www.nltk.org/ ============ Installation ============ python setup.py install ============= Sample script ============= # -*- coding: utf-8 -* from postagger_zh.postagger import POSTagger tagger = POSTagger() text = u'英國明年也開放台灣青年前往打工度假,且可打工兩年,\ 不少在學的年輕人認為,因為英國物價及生活費太高,\ 擔心打工甚至無法維持生活費,興趣缺缺;\ 但幾位三十歲以下的上班族則表示心動,有機會不排除前往。' for token, tag in tagger.tag(text): print token, tag ============= Sample output ============= 英國 Nca 明年 Ndaba 也 Dbb 開放 VH11 台灣 Nca ...
About
Python wrapper for Chinese word segmentation and POS tagging
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published