Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

设置停止词失效! #58

Open
chenggang0815 opened this issue Aug 14, 2017 · 3 comments
Open

设置停止词失效! #58

chenggang0815 opened this issue Aug 14, 2017 · 3 comments

Comments

@chenggang0815
Copy link

1.运行环境:
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_1.13.3 jiebaR_0.9.1 jiebaRD_0.1 ggplot2_2.2.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 withr_2.0.0 digest_0.6.12 assertthat_0.1 R6_2.2.0 grid_3.4.1
[7] plyr_1.8.4 gtable_0.2.0 git2r_0.19.0 scales_0.4.1 httr_1.2.1 curl_2.8.1
[13] lazyeval_0.2.0 tools_3.4.1 munsell_0.4.3 compiler_3.4.1 colorspace_1.3-2 memoise_1.1.0
[19] tibble_1.2

2.我在重复中文分词文档(https://qinwenfeng.com/jiebaR/section-3.html#-workerstop_word)的以下内容时发生错误:

3.0.5 添加停止词 worker(stop_word = “…”)

!!!! 对于分词,请不要修改默认加载的停止词文本,即 jiebaR::STOPPATH,请使用自定义的停止词路径。

目录下有一个 stop.txt 文件,内容如下

readLines("stop.txt")
#> [1] "停止"

分词器 = worker(stop_word = "stop.txt")
segment("这是一个停止词", 分词器)
#> [1] "这是" "一个" "词"

3.以下是我的代码,其中stop.txt里就一个词,格式另存为utf-8,文件放在我的工作目录下。

getwd()
[1] "D:/R"
readLines('D:/R/stop.txt',encoding = "UTF-8")
[1] "停止"
分词器 <- worker(stop_word = 'D:/R/stop.txt')
segment("这是一个停止词",分词器)
[1] "这是" "一个" "停止" "词"

但是发现并没有去掉停止词,这是为什么呢?

@chenggang0815
Copy link
Author

更新:
我尝试在stop.txt里写了多个词:

getwd()
[1] "D:/R"
readLines('D:/R/stop.txt',encoding = 'UTF-8')
[1] "停止" "停止" "这是" "一个"
分词器 <- worker(stop_word = 'stop.txt')
segment("这是一个停止词",分词器)
[1] "词"
发现停止词有效,于是我又在txt里写了两个停止:
getwd()
[1] "D:/R"
readLines('D:/R/stop.txt',encoding = 'UTF-8')
[1] "停止" "停止"
分词器 <- worker(stop_word = 'stop.txt')
segment("这是一个停止词",分词器)
[1] "这是" "一个" "词"
发现有效,再删除一个停止,无效:
getwd()
[1] "D:/R"
readLines('D:/R/stop.txt',encoding = 'UTF-8')
[1] "停止"
分词器 <- worker(stop_word = 'stop.txt')
segment("这是一个停止词",分词器)
[1] "这是" "一个" "停止" "词"

@qinwf
Copy link
Owner

qinwf commented Aug 15, 2017

文件的换行符是什么呢?建议用 LF 换行符,可以用文本编辑工具检查一下。

@chenggang0815
Copy link
Author

是换行符的问题,谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants