GitHub - tchaikov/open-gram: collect lexicon and build n-gram dataset for NLP in Chinese

open-gram

open-gram is a project tries to collect lexicon and build n-gram dataset for NLP in Chinese. This project tries to leverage existing open source resources like crfpp and CC-CEDICT.

open-gram includes 4 parts

corpus collection
segmentation
(new) word extraction
n-gram info counting

corpus collection

crawl Chinese web sites using scrapy, grab the body HTML pages of them
proprocess the pages - detect the encoding - remove HTML tags and other stuff we are not interested in - split the text into sentences

segmentation

there two ways to segment tokens into words

tagging
matching

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
build		build
data		data
lexicon		lexicon
segment/tagging		segment/tagging
tools/CRF++-0.53		tools/CRF++-0.53
.gitignore		.gitignore
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

open-gram

corpus collection

segmentation

word extraction

n-gram info counting

About

Releases

Packages

Contributors 3

Languages

tchaikov/open-gram

Folders and files

Latest commit

History

Repository files navigation

open-gram

corpus collection

segmentation

word extraction

n-gram info counting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages