向大神求助，新增custom_text_splitter.py未生效，请问这是什么原因啊？ #2504

278038551 · 2023-12-28T14:38:15Z

278038551
Dec 28, 2023

向大神求助，新增custom_text_splitter.py，debug该文件，发现自定义的文档拆分方法生效了，但是运行整个项目，未按预期拆分，请问这是什么原因啊？

已配置kb_config，配置信息如下：
"CustomTextSplitter": {
"source": "huggingface", # 选择tiktoken则使用openai的方法
"tokenizer_name_or_path": "",
},

@278038551 · 2023-12-28T14:46:03Z

dosubot[bot]
bot Dec 28, 2023

🤖

你好，@278038551！我是Dosu，一个可以帮助你解决问题，回答疑问，甚至成为贡献者的机器人。我在这里为你提供帮助，直到有人类维护者赶来。让我们一起解决你的问题吧！

您的问题可能是由于在make_text_splitter函数中导入自定义文本拆分器的方式。该函数首先尝试从用户定义的模块'text_splitter'中导入拆分器。如果失败，它将回退到'langchain.text_splitter'模块。

try:  ## 优先使用用户自定义的text_splitter
    text_splitter_module = importlib.import_module('text_splitter')
    TextSplitter = getattr(text_splitter_module, splitter_name)
except:  ## 否则使用langchain的text_splitter
    text_splitter_module = importlib.import_module('langchain.text_splitter')
    TextSplitter = getattr(text_splitter_module, splitter_name)

如果在执行整个项目时在'text_splitter'模块中找不到自定义拆分器，可能是由于模块不在Python路径中。如果在调试和执行期间从不同的目录运行项目，可能会发生这种情况。

为了解决这个问题，您可以确保在运行项目时'text_splitter'模块在Python路径中。您可以通过在运行项目之前将包含'text_splitter'模块的目录添加到Python路径来实现这一点。这可以通过设置PYTHONPATH环境变量或在Python脚本中修改sys.path来完成。

如果问题仍然存在，您可能需要检查'kb_config'中的配置，以确保正确指定了自定义拆分器。

参考源码：

server/knowledge_base/utils.py

另外，关于'ChineseRecursiveTextSplitter'类和'_split_text_with_regex_from_end'函数的实现，我在提供的Langchain-Chatchat代码库上下文中并未找到'make_text_splitter'函数的实现。因此，我无法根据给定的上下文为'make_text_splitter'函数提供解释或代码。

参考源码：

text_splitter/chinese_recursive_text_splitter.py

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

2 replies

278038551 Dec 29, 2023
Author

感谢机器人Dosu，请问自定义的text_splitter，在kb_config文件中应该怎么配置？

dosubot[bot] bot Dec 29, 2023

🤖

Based on the information you've provided, it seems like you've already made some correct steps towards integrating your custom text splitter into the Langchain-Chatchat project. However, there are a few more steps you need to take to ensure that your custom_text_splitter.py works properly in the entire project.

Firstly, ensure that your custom_text_splitter.py is in the correct directory where the project can import it. It should be placed in the same directory as the utils.py file for it to be correctly imported. This is because the utils.py file attempts to import the TextSplitter class from the custom_text_splitter.py file using a relative import.

Next, in your kb_config file, you need to specify the name of your custom text splitter in the TEXT_SPLITTER_NAME variable. In the text_splitter_dict dictionary, you need to add a new entry for your custom text splitter. The key should be the name of your custom text splitter and the value should be a dictionary containing the configuration for your custom text splitter. Here is an example of how you can do this:

TEXT_SPLITTER_NAME = "CustomTextSplitter"

text_splitter_dict = {
    "CustomTextSplitter": {
        "source": "custom",  # specify the source of the text splitter
        "tokenizer_name_or_path": "",  # specify the tokenizer name or path if needed
    },
    # other text splitters...
}

Finally, in the make_text_splitter function, you need to add a new condition to handle your custom text splitter. Here is an example of how you can do this:

def make_text_splitter(
        splitter_name: str = TEXT_SPLITTER_NAME,
        chunk_size: int = CHUNK_SIZE,
        chunk_overlap: int = OVERLAP_SIZE,
        llm_model: str = LLM_MODELS[0],
):
    """
    根据参数获取特定的分词器
    """
    splitter_name = splitter_name or "SpacyTextSplitter"
    try:
        if splitter_name == "CustomTextSplitter":  # add this condition
            text_splitter_module = importlib.import_module('custom_text_splitter')
            TextSplitter = getattr(text_splitter_module, splitter_name)
            text_splitter = TextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_overlap
            )
        # other conditions...
    except Exception as e:
        print(e)
        text_splitter_module = importlib.import_module('langchain.text_splitter')
        TextSplitter = getattr(text_splitter_module, "RecursiveCharacterTextSplitter")
        text_splitter = TextSplitter(chunk_size=250, chunk_overlap=50)
    return text_splitter

Please replace "CustomTextSplitter" with the actual name of your custom text splitter and custom_text_splitter with the actual module name of your custom text splitter.

You can refer to the server/knowledge_base/utils.py file for more details on how to configure the kb_config file and the make_text_splitter function.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

server/knowledge_base/utils.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

向大神求助，新增custom_text_splitter.py未生效，请问这是什么原因啊？ #2504

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

向大神求助，新增custom_text_splitter.py未生效，请问这是什么原因啊？ #2504

278038551 Dec 28, 2023

Replies: 1 comment · 2 replies

dosubot[bot] bot Dec 28, 2023

Sources

278038551 Dec 29, 2023 Author

dosubot[bot] bot Dec 29, 2023

Sources

278038551
Dec 28, 2023

Replies: 1 comment 2 replies

dosubot[bot]
bot Dec 28, 2023

278038551 Dec 29, 2023
Author