设置停止词失效！ #58

chenggang0815 · 2017-08-14T05:52:32Z

1.运行环境：
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_1.13.3 jiebaR_0.9.1 jiebaRD_0.1 ggplot2_2.2.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 withr_2.0.0 digest_0.6.12 assertthat_0.1 R6_2.2.0 grid_3.4.1
[7] plyr_1.8.4 gtable_0.2.0 git2r_0.19.0 scales_0.4.1 httr_1.2.1 curl_2.8.1
[13] lazyeval_0.2.0 tools_3.4.1 munsell_0.4.3 compiler_3.4.1 colorspace_1.3-2 memoise_1.1.0
[19] tibble_1.2

2.我在重复中文分词文档（https://qinwenfeng.com/jiebaR/section-3.html#-workerstop_word）的以下内容时发生错误：

3.0.5 添加停止词 worker(stop_word = “…”)

!!!! 对于分词，请不要修改默认加载的停止词文本，即 jiebaR::STOPPATH，请使用自定义的停止词路径。

目录下有一个 stop.txt 文件，内容如下

readLines("stop.txt")
#> [1] "停止"

分词器 = worker(stop_word = "stop.txt")
segment("这是一个停止词", 分词器)
#> [1] "这是" "一个" "词"

3.以下是我的代码，其中stop.txt里就一个词，格式另存为utf-8，文件放在我的工作目录下。

getwd()
[1] "D:/R"
readLines('D:/R/stop.txt',encoding = "UTF-8")
[1] "停止"
分词器 <- worker(stop_word = 'D:/R/stop.txt')
segment("这是一个停止词",分词器)
[1] "这是" "一个" "停止" "词"

但是发现并没有去掉停止词，这是为什么呢？

chenggang0815 · 2017-08-14T06:02:37Z

更新：
我尝试在stop.txt里写了多个词：

getwd()
[1] "D:/R"
readLines('D:/R/stop.txt',encoding = 'UTF-8')
[1] "停止" "停止" "这是" "一个"
分词器 <- worker(stop_word = 'stop.txt')
segment("这是一个停止词",分词器)
[1] "词"
发现停止词有效，于是我又在txt里写了两个停止：
getwd()
[1] "D:/R"
readLines('D:/R/stop.txt',encoding = 'UTF-8')
[1] "停止" "停止"
分词器 <- worker(stop_word = 'stop.txt')
segment("这是一个停止词",分词器)
[1] "这是" "一个" "词"
发现有效，再删除一个停止，无效：
getwd()
[1] "D:/R"
readLines('D:/R/stop.txt',encoding = 'UTF-8')
[1] "停止"
分词器 <- worker(stop_word = 'stop.txt')
segment("这是一个停止词",分词器)
[1] "这是" "一个" "停止" "词"

qinwf · 2017-08-15T00:45:21Z

文件的换行符是什么呢？建议用 LF 换行符，可以用文本编辑工具检查一下。

chenggang0815 · 2017-08-16T01:25:59Z

是换行符的问题，谢谢

chenggang0815 closed this as completed Aug 16, 2017

chenggang0815 reopened this Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

设置停止词失效！ #58

设置停止词失效！ #58

chenggang0815 commented Aug 14, 2017

chenggang0815 commented Aug 14, 2017

qinwf commented Aug 15, 2017

chenggang0815 commented Aug 16, 2017

设置停止词失效！ #58

设置停止词失效！ #58

Comments

chenggang0815 commented Aug 14, 2017

chenggang0815 commented Aug 14, 2017

qinwf commented Aug 15, 2017

chenggang0815 commented Aug 16, 2017