集群模式下注册自定义模型启动时会提示not found的问题 #2645

DawnOf1996 · 2024-12-10T03:12:41Z

System Info / 系統信息

Cuda==12.0
torch==2.4.1
transformers==4.47.0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

1.0.1 和 0.16.1版本都有

The command used to start Xinference / 用以启动 xinference 的命令

集群模式启动方式：xinference-supervisor/worker
单机模式启动方式：xinference-local

Reproduction / 复现过程

1）使用xinference-supervisor/worker方式部署1master-1worker的集群服务

2）在集群模式下注册自定义的embedding模型后，点击启动会提示在可用的模型中没有自定义模型信息

 Embedding model paraphrase-multilingual-MiniLM-L12-v2 not found, availableHuggingface: dict_keys(['bge-large-en', 'bge-base-en', 'gte-large', 'gte-base', 'e5-large-v2', 'bge-large-zh', 'bge-large-zh-noinstruct', 'bge-base-zh', 'multilingual-e5-large', 'bge-small-zh', 'bge-small-zh-v1.5', 'bge-base-zh-v1.5', 'bge-large-zh-v1.5', 'bge-small-en-v1.5', 'bge-base-en-v1.5', 'bge-large-en-v1.5', 'jina-embeddings-v2-small-en', 'jina-embeddings-v2-base-en', 'jina-embeddings-v2-base-zh', 'text2vec-large-chinese', 'text2vec-base-chinese', 'text2vec-base-chinese-paraphrase', 'text2vec-base-chinese-sentence', 'text2vec-base-multilingual', 'bge-m3', 'bce-embedding-base_v1', 'm3e-small', 'm3e-base', 'm3e-large', 'gte-Qwen2', 'jina-embeddings-v3'])ModelScope: dict_keys(['bge-large-en', 'bge-base-en', 'gte-large', 'gte-base', 'e5-large-v2', 'bge-large-zh', 'bge-large-zh-noinstruct', 'bge-base-zh', 'multilingual-e5-large', 'bge-small-zh', 'bge-small-zh-v1.5', 'bge-base-zh-v1.5', 'bge-large-zh-v1.5', 'bge-small-en-v1.5', 'bge-base-en-v1.5', 'bge-large-en-v1.5', 'jina-embeddings-v2-small-en', 'jina-embeddings-v2-base-en', 'jina-embeddings-v2-base-zh', 'text2vec-large-chinese', 'text2vec-base-chinese', 'text2vec-base-chinese-paraphrase', 'bge-m3', 'bce-embedding-base_v1', 'm3e-small', 'm3e-base', 'm3e-large', 'gte-Qwen2', 'jina-embeddings-v3'])

3）删除已注册的模型信息

4）使用xinference-local重新启动单机版模式，重新注册上一步的模型（可以不启动）

5）关闭单机模式，重新使用集群模式启动（1master-1worker模式），这时自定义模型列表会出现两个相同的配置信息

6）这时再启动自定义的模型时，可以正常启动，不会提示xxx模型 not found错误

Expected behavior / 期待表现

以上现象想确认一下：
集群模式如何操作是注册自定义模型的正确的方式，还是说这个现象就是集群模式下注册自定义模型的一个bug缺陷

The text was updated successfully, but these errors were encountered:

douyk · 2024-12-12T05:28:09Z

遇到了完全一样的问题

Crystalxd · 2024-12-18T09:48:46Z

same.

github-actions · 2024-12-25T19:04:11Z

This issue is stale because it has been open for 7 days with no activity.

github-actions · 2024-12-31T19:03:48Z

This issue was closed because it has been inactive for 5 days since being marked as stale.

mddmzl · 2025-01-14T07:11:31Z

遇到了一样的问题

supervisor/worker分开部署的情况下通过页面找不到一个完全OK的自定义模型注册、展示、运行的流程。

在分别起xinference-supervisor/worker的情况下，以注册自定义LLM模型model_name custom-llm为例

自定义模型的生命周期
自定义模型在supervisor/worker节点的持久化目录是XINFERENCE_MODEL_DIR/llm/custom-llm.json
1、在节点启动时持久化模型会被枚举加载到进程的UD_LLM_FAMILIES数组里
2、注册时加入UD_LLM_FAMILIES
3、反注册时从UD_LLM_FAMILIES删除
4、运行时在UD_LLM_FAMILIES里查找模型

注册逻辑
https://github.com/xorbitsai/inference/blob/d0dff35c6e9479881042505ef62ea92a3890b2c1/xinference/core/supervisor.py#L792
register_model现在的逻辑默认在自身（supervisor节点注册并持久化模型），支持指定worker节点的worker_ip将模型注册持久化在worker节点。
当前页面没有配置worker的逻辑，只会注册在supervisor节点（这是注册成功worker节点 model not found报错的原因，worker 的UD_LLM_FAMILIES没有这个模型）

list逻辑
https://github.com/xorbitsai/inference/blob/d0dff35c6e9479881042505ef62ea92a3890b2c1/xinference/core/supervisor.py#L553
遍历woker节点，并列出自身的模型（这是supervisor/worker共享同一个UD_LLM_FAMILIES目录启动多个节点出现多个相同的配置信息的原因）

疑问
是否有必要每个supervisor/worker有相同model_name但不一样的配置。是否集群的模型列表使用supervisor的信息足够了，在部署时保证自定义模型worker节点都可访问模型uri。

期待表现
在supervisor/worker模式下worker节点能访问到模型路径的情况下，自定义模型可正常注册、展示与在worker节点上运行。

如果修改方式确定，可以修改贡献PR

github-actions · 2025-01-21T19:03:49Z

This issue is stale because it has been open for 7 days with no activity.

github-actions · 2025-01-26T19:04:03Z

This issue was closed because it has been inactive for 5 days since being marked as stale.

XprobeBot added the gpu label Dec 10, 2024

XprobeBot added this to the v1.x milestone Dec 10, 2024

github-actions bot added the stale label Dec 25, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 31, 2024

amumu96 reopened this Jan 14, 2025

github-actions bot removed the stale label Jan 14, 2025

github-actions bot added the stale label Jan 21, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

集群模式下注册自定义模型启动时会提示not found的问题 #2645

集群模式下注册自定义模型启动时会提示not found的问题 #2645

DawnOf1996 commented Dec 10, 2024

douyk commented Dec 12, 2024

Crystalxd commented Dec 18, 2024

github-actions bot commented Dec 25, 2024

github-actions bot commented Dec 31, 2024

mddmzl commented Jan 14, 2025

github-actions bot commented Jan 21, 2025

github-actions bot commented Jan 26, 2025

集群模式下注册自定义模型启动时会提示not found的问题 #2645

集群模式下注册自定义模型启动时会提示not found的问题 #2645

Comments

DawnOf1996 commented Dec 10, 2024

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

douyk commented Dec 12, 2024

Crystalxd commented Dec 18, 2024

github-actions bot commented Dec 25, 2024

github-actions bot commented Dec 31, 2024

mddmzl commented Jan 14, 2025

github-actions bot commented Jan 21, 2025

github-actions bot commented Jan 26, 2025