jieba-hs

TODO:

Serious performance issues, see Main.hs for benchmarking program, do some profiling.
Most likely due to personal ineptitude with lists and O(n^2) concat

In Beta!

Things that are missing: TF-IDF, Modification of the dictionary at runtime, and async. At least the things that are glaringly obvious. A lot more is missing.

「結巴-hs」是「結巴」中文分詞的Haskell版本。

"Jieba-hs" is an implementation of the "Jieba" word segmentation library for Chinese in Haskell.

使用 Usage

jieba-hs的字典格式與jieba的一模一樣 (原版)。字典在data/*。HMM Model看hmm.model, 是從cppjieba借的。

The format of the dictionaries are the same as jieba, see the (original). The HMM Model is borrowed from cppjieba with a few slight modifications.

import System.IO
import Dictionary
import Jieba
import Data.List (intercalate)

main :: IO ()
main = do
    contents <- hGetContents =<< openFile "dict.txt.small" ReadMode
    let dict = dictFromContents contents
    let hmmd <- readHmmDict "data/hmm.model"
    let snt = "他来到了网易杭研大厦"
    let result = cutNoHMM dict snt
    let result' = cutHMM dict hmmd snt
    let result'' = cutAll dict snt
    putStrLn $ intercalate "/" result
    putStrLn $ intercalate "/" result'
    putStrLn $ intercalate "/" result''

*Main> main
他/来到/了/网易/杭/研/大厦 -- No HMM
他/来到/了/网易/杭研/大厦 -- With HMM
他/来/来到/到/了/网/网易/易/杭/研/大/大厦/厦 -- All possible cuts

TODO

TF-IDF
使用Trie?
QuickCheck unit tests

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
data		data
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
jieba-hs.cabal		jieba-hs.cabal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jieba-hs

使用 Usage

TODO

About

Releases

Packages

Languages

License

Jyutt/jieba-hs

Folders and files

Latest commit

History

Repository files navigation

jieba-hs

使用 Usage

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages