Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自定义分词设置未生效 #135

Open
sam6513 opened this issue May 9, 2022 · 1 comment
Open

自定义分词设置未生效 #135

sam6513 opened this issue May 9, 2022 · 1 comment

Comments

@sam6513
Copy link

sam6513 commented May 9, 2022

问题描述:把数字和量词的分词设置为 "enable_number_quantifier_recognize": false, 后,自定义的分词器仍然可以分出来数字

环境版本:elasticsearch和hanlp版本号都是 7.10.2

测试过程如下

#建立了自己的分词器
PUT hanlp_index
{
"settings": {
"analysis": {
"analyzer": {
"52dzhp_hanlp_analyzer": {
"type": "hanlp",
"enable_custom_config": true,
"enable_stop_dictionary": true,
"enable_number_quantifier_recognize": false,
"enable_custom_dictionary": true,
"enable_place_recognize": false
}
}
}
}
}

#Post了完全匹配关键词的数据测试,没问题

#测试条件1
POST hanlp_index/_analyze { "text":"声环境质量标准GB30962008", "analyzer": "52dzhp_hanlp_analyzer" }

#测试结果1
{ "tokens" : [ { "token" : "声环境质量标准GB30962008", "start_offset" : 0, "end_offset" : 17, "type" : "eswi", "position" : 0 } ] } ---------测试成功

#在文本后面加了个数字1做测试

#测试条件2
POST hanlp_index/_analyze { **"text":"声环境质量标准GB309620081",** "analyzer": "52dzhp_hanlp_analyzer" }

#测试结果里数字被直接拆出来了,不过我已经把数字和量词的分词设置为 "enable_number_quantifier_recognize": false,
#测试结果2
{ "tokens" : [ { "token" : "声环境质量标准", "start_offset" : 0, "end_offset" : 7, "type" : "esw", "position" : 0 }, { "token" : "GB", "start_offset" : 7, "end_offset" : 9, "type" : "nx", "position" : 1 }, { "token" : "309620081", "start_offset" : 9, "end_offset" : 18, "type" : "m", "position" : 2 } ] } ---------测试失败

@KennFalcon 希望可以咨询一下,谢谢。

@chunpat
Copy link

chunpat commented May 18, 2023

是加了自定义 声环境质量标准GB30962008 吗? 最后解决了问题了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants