Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing ModelMeta for Chinese models #1803

Open
x-tabdeveloping opened this issue Jan 14, 2025 · 10 comments
Open

Missing ModelMeta for Chinese models #1803

x-tabdeveloping opened this issue Jan 14, 2025 · 10 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed leaderboard issues related to the leaderboard

Comments

@x-tabdeveloping
Copy link
Collaborator

x-tabdeveloping commented Jan 14, 2025

Many of the models that have been run on the original C-MTEB and we have results on are currently missing ModelMeta objects in the library.

Here's a list of Chinese-specific models that have yet to be added to MTEB:

missing_meta_chinese = [
    "BAAI/bge-base-zh", # not planned, outdated
    "BAAI/bge-base-zh-v1.5",
    "BAAI/bge-large-zh", # not planned, outdated
    "BAAI/bge-large-zh-noinstruct",  # not planned, outdated
    "BAAI/bge-large-zh-v1.5",
    "BAAI/bge-small-zh",  # not planned, outdated
    "BAAI/bge-small-zh-v1.5",
    "DMetaSoul/Dmeta-embedding-zh-small",
    "DMetaSoul/sbert-chinese-general-v1",
    "Erin/IYun-large-zh",
    "Erin/mist-zh",
    "Pristinenlp/alime-embedding-large-zh",
    "Pristinenlp/alime-reranker-large-zh",
    "RookieHX/bge_m3e_stella",
    "akarum/cloudy-large-zh",
    "arkohut/jina-embeddings-v2-base-zh",
    "dunzhang/stella-large-zh-v3-1792d",
    "dunzhang/stella-mrl-large-zh-v3.5-1792d",
    "fangxq/XYZ-embedding-zh",
    "fangxq/XYZ-embedding-zh-v2",
    "iampanda/zpoint_large_embedding_zh",
    "infgrad/stella-base-zh",
    "infgrad/stella-base-zh-v2",
    "infgrad/stella-base-zh-v3-1792d",
    "infgrad/stella-large-zh",
    "infgrad/stella-large-zh-v2",
    "jinaai/jina-embeddings-v2-base-zh",
    "moka-ai/m3e-base",
    "moka-ai/m3e-large",
    "neofung/m3e-ernie-xbase-zh",
    "sensenova/piccolo-base-zh",
    "sensenova/piccolo-large-zh",
    "sensenova/piccolo-large-zh-v2",
    "shanghung/stella-base-zh-v3-1792d",
    "shibing624/text2vec-base-chinese",  # Needs custom implementation
    "shibing624/text2vec-large-chinese",  # Needs custom implementation
    "silverjam/jina-embeddings-v2-base-zh",
    "thenlper/gte-base-zh",
    "thenlper/gte-large-zh",
    "thenlper/gte-small-zh",
    "towing/gte-small-zh", # not planned
    "lier007/xiaobu-embedding",
    "Classical/Yinka",
    "TencentBAC/Conan-embedding-v1",
    "lier007/xiaobu-embedding-v2",
]

As well as a list of multilingual models that are currently missing metadata:

missing_meta_multilingual = [
    "Alibaba-NLP/gte-multilingual-base",
    "BAAI/bge-multilingual-gemma2",
    "EdwardBurgin/paraphrase-multilingual-mpnet-base-v2",
    "HIT-TMG/KaLM-embedding-multilingual-max-instruct-v1",
    "barisaydin/text2vec-base-multilingual", # Needs custom implementation
    "beademiguelperez/sentence-transformers-multilingual-e5-small",
    "bedrock/cohere-embed-multilingual-v3",
    "gizmo-ai/Cohere-embed-multilingual-v3.0",
    "sentence-transformers/distiluse-base-multilingual-cased-v2",
    "sentence-transformers/use-cmlm-multilingual",
    "vprelovac/universal-sentence-encoder-multilingual-3",
    "vprelovac/universal-sentence-encoder-multilingual-large-3",
]

Most of these should be pretty trivial to add.

@x-tabdeveloping x-tabdeveloping added good first issue Good for newcomers help wanted Extra attention is needed leaderboard issues related to the leaderboard labels Jan 14, 2025
@x-tabdeveloping
Copy link
Collaborator Author

I will add BGE and GTE to start with

@x-tabdeveloping
Copy link
Collaborator Author

Added BGE and GTE in #1805

@x-tabdeveloping
Copy link
Collaborator Author

I have implemented some of them, and I have checked on all models.
I excluded models that were either dubious finetunes, distillations, quantizations or didn't have any information (no README, or one or two lines, or just template sentence-transformers without any extra info on the model, or same README as parent model).

Here's a curated and updated list:

model_metas_to_implement = [
    "DMetaSoul/Dmeta-embedding-zh-small",
    "DMetaSoul/sbert-chinese-general-v1",
    # Stella Models
    "dunzhang/stella-large-zh-v3-1792d",
    "dunzhang/stella-mrl-large-zh-v3.5-1792d",
    "iampanda/zpoint_large_embedding_zh",
    "infgrad/stella-base-zh",
    "infgrad/stella-base-zh-v2",
    "infgrad/stella-base-zh-v3-1792d",
    "infgrad/stella-large-zh",
    "infgrad/stella-large-zh-v2",
     # Jina models
    "jinaai/jina-embeddings-v2-base-zh",
     # Moka models
    "moka-ai/m3e-base",
    "moka-ai/m3e-large",
     # piccolo models
    "sensenova/piccolo-base-zh",
    "sensenova/piccolo-large-zh",
    "sensenova/piccolo-large-zh-v2",
    # Text2Vec models
    "shibing624/text2vec-base-chinese",  # Needs custom implementation
    "shibing624/text2vec-large-chinese",  # Needs custom implementation
    "barisaydin/text2vec-base-multilingual", # Needs custom implementation
    # Multilingual SBERT
    "sentence-transformers/distiluse-base-multilingual-cased-v2",
    "sentence-transformers/use-cmlm-multilingual",
    # Miscellaneous
    "lier007/xiaobu-embedding",
    "Classical/Yinka",
    "TencentBAC/Conan-embedding-v1",
    "lier007/xiaobu-embedding-v2",
]

@x-tabdeveloping
Copy link
Collaborator Author

I'm adding the missing SBERT models right now

@x-tabdeveloping
Copy link
Collaborator Author

Then Moka

@x-tabdeveloping
Copy link
Collaborator Author

x-tabdeveloping commented Jan 15, 2025

Updating the list again after #1814 :

model_metas_to_implement = [
    # Stella Models
    "dunzhang/stella-large-zh-v3-1792d",
    "dunzhang/stella-mrl-large-zh-v3.5-1792d",
    "iampanda/zpoint_large_embedding_zh",
    "infgrad/stella-base-zh", # Outdated not planned
    "infgrad/stella-base-zh-v2", # Outdated not planned
    "infgrad/stella-base-zh-v3-1792d",
    "infgrad/stella-large-zh", # outdated not planned
    "infgrad/stella-large-zh-v2", # outdated not planned
    # Text2Vec models
    "shibing624/text2vec-base-chinese",  # Needs custom implementation
    "shibing624/text2vec-large-chinese",  # Needs custom implementation
    "barisaydin/text2vec-base-multilingual", # Needs custom implementation
    # Miscellaneous
    "lier007/xiaobu-embedding",
    "Classical/Yinka",
    "TencentBAC/Conan-embedding-v1",
    "lier007/xiaobu-embedding-v2",
]

@KennethEnevoldsen
Copy link
Contributor

re.:

    "shibing624/text2vec-base-chinese",  # Needs custom implementation
    "shibing624/text2vec-large-chinese",  # Needs custom implementation
    "barisaydin/text2vec-base-multilingual", # Needs custom implementation

If you want these to appear in the leaderboard, but don't want to implement it you can simply leader the loader to None (or add a "NotImplemented" class).

@x-tabdeveloping
Copy link
Collaborator Author

Sure, but shouldn't take too long to implement. @KennethEnevoldsen do you have time for the Stella models? Then I'll take the rest

@x-tabdeveloping
Copy link
Collaborator Author

Oh I seem to have been wrong, it looks like you can use text2vec with SentenceTransformers.

@x-tabdeveloping
Copy link
Collaborator Author

Added Stella in #1824 . This issue will be ready to close once it's merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed leaderboard issues related to the leaderboard
Projects
None yet
Development

No branches or pull requests

2 participants