Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

集群模式下注册自定义模型启动时会提示not found的问题 #2645

Closed
1 of 3 tasks
DawnOf1996 opened this issue Dec 10, 2024 · 7 comments
Closed
1 of 3 tasks
Milestone

Comments

@DawnOf1996
Copy link

System Info / 系統信息

Cuda==12.0
torch==2.4.1
transformers==4.47.0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

1.0.1 和 0.16.1版本都有

The command used to start Xinference / 用以启动 xinference 的命令

集群模式启动方式:xinference-supervisor/worker
单机模式启动方式:xinference-local

Reproduction / 复现过程

1)使用xinference-supervisor/worker方式部署1master-1worker的集群服务

2)在集群模式下注册自定义的embedding模型后,点击启动会提示在可用的模型中没有自定义模型信息

 Embedding model paraphrase-multilingual-MiniLM-L12-v2 not found, availableHuggingface: dict_keys(['bge-large-en', 'bge-base-en', 'gte-large', 'gte-base', 'e5-large-v2', 'bge-large-zh', 'bge-large-zh-noinstruct', 'bge-base-zh', 'multilingual-e5-large', 'bge-small-zh', 'bge-small-zh-v1.5', 'bge-base-zh-v1.5', 'bge-large-zh-v1.5', 'bge-small-en-v1.5', 'bge-base-en-v1.5', 'bge-large-en-v1.5', 'jina-embeddings-v2-small-en', 'jina-embeddings-v2-base-en', 'jina-embeddings-v2-base-zh', 'text2vec-large-chinese', 'text2vec-base-chinese', 'text2vec-base-chinese-paraphrase', 'text2vec-base-chinese-sentence', 'text2vec-base-multilingual', 'bge-m3', 'bce-embedding-base_v1', 'm3e-small', 'm3e-base', 'm3e-large', 'gte-Qwen2', 'jina-embeddings-v3'])ModelScope: dict_keys(['bge-large-en', 'bge-base-en', 'gte-large', 'gte-base', 'e5-large-v2', 'bge-large-zh', 'bge-large-zh-noinstruct', 'bge-base-zh', 'multilingual-e5-large', 'bge-small-zh', 'bge-small-zh-v1.5', 'bge-base-zh-v1.5', 'bge-large-zh-v1.5', 'bge-small-en-v1.5', 'bge-base-en-v1.5', 'bge-large-en-v1.5', 'jina-embeddings-v2-small-en', 'jina-embeddings-v2-base-en', 'jina-embeddings-v2-base-zh', 'text2vec-large-chinese', 'text2vec-base-chinese', 'text2vec-base-chinese-paraphrase', 'bge-m3', 'bce-embedding-base_v1', 'm3e-small', 'm3e-base', 'm3e-large', 'gte-Qwen2', 'jina-embeddings-v3'])

3)删除已注册的模型信息

4)使用xinference-local重新启动单机版模式,重新注册上一步的模型(可以不启动)

5)关闭单机模式,重新使用集群模式启动(1master-1worker模式),这时自定义模型列表会出现两个相同的配置信息
image-20241210105011269

6)这时再启动自定义的模型时,可以正常启动,不会提示xxx模型 not found错误

Expected behavior / 期待表现

以上现象想确认一下:
集群模式如何操作是注册自定义模型的正确的方式,还是说这个现象就是集群模式下注册自定义模型的一个bug缺陷

@XprobeBot XprobeBot added the gpu label Dec 10, 2024
@XprobeBot XprobeBot added this to the v1.x milestone Dec 10, 2024
@douyk
Copy link

douyk commented Dec 12, 2024

遇到了完全一样的问题

@Crystalxd
Copy link

same.

Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Dec 25, 2024
Copy link

This issue was closed because it has been inactive for 5 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 31, 2024
@amumu96 amumu96 reopened this Jan 14, 2025
@mddmzl
Copy link

mddmzl commented Jan 14, 2025

遇到了一样的问题

supervisor/worker分开部署的情况下通过页面找不到一个完全OK的自定义模型注册、展示、运行的流程。

在分别起xinference-supervisor/worker的情况下,以注册自定义LLM模型model_name custom-llm为例

自定义模型的生命周期
自定义模型在supervisor/worker节点的持久化目录是XINFERENCE_MODEL_DIR/llm/custom-llm.json
1、在节点启动时持久化模型会被枚举加载到进程的UD_LLM_FAMILIES数组里
2、注册时加入UD_LLM_FAMILIES
3、反注册时从UD_LLM_FAMILIES删除
4、运行时在UD_LLM_FAMILIES里查找模型

注册逻辑
https://github.com/xorbitsai/inference/blob/d0dff35c6e9479881042505ef62ea92a3890b2c1/xinference/core/supervisor.py#L792
register_model现在的逻辑默认在自身(supervisor节点注册并持久化模型),支持指定worker节点的worker_ip将模型注册持久化在worker节点。
当前页面没有配置worker的逻辑,只会注册在supervisor节点(这是注册成功worker节点 model not found报错的原因,worker 的UD_LLM_FAMILIES没有这个模型)

list逻辑
https://github.com/xorbitsai/inference/blob/d0dff35c6e9479881042505ef62ea92a3890b2c1/xinference/core/supervisor.py#L553
遍历woker节点,并列出自身的模型(这是supervisor/worker共享同一个UD_LLM_FAMILIES目录启动多个节点出现多个相同的配置信息的原因)

疑问
是否有必要每个supervisor/worker有相同model_name但不一样的配置。是否集群的模型列表使用supervisor的信息足够了,在部署时保证自定义模型worker节点都可访问模型uri。

期待表现
在supervisor/worker模式下worker节点能访问到模型路径的情况下,自定义模型可正常注册、展示与在worker节点上运行。

如果修改方式确定,可以修改贡献PR

@github-actions github-actions bot removed the stale label Jan 14, 2025
Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Jan 21, 2025
Copy link

This issue was closed because it has been inactive for 5 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants