Skip to content

A stable, fast and easy-to-use inference library with a focus on a sync-to-async API

License

Notifications You must be signed in to change notification settings

michaelfeil/embed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

embed

A stable, blazing fast and easy-to-use inference library with a focus on a sync-to-async API

ci Downloads

Installation

pip install embed

Why embed?

Embed makes it easy to load any embedding, classification and reranking models from Huggingface. It leverages Infinity as backend for async computation, batching, and Flash-Attention-2.

CPU Benchmark Diagram Benchmarking on an Nvidia-L4 instance. Note: CPU uses bert-small, CUDA uses Bert-large. Methodology.

from embed import BatchedInference
from concurrent.futures import Future

# Run any model
register = BatchedInference(
    model_id=[
        # sentence-embeddings
        "michaelfeil/bge-small-en-v1.5",
        # sentence-embeddings and image-embeddings
        "jinaai/jina-clip-v1",
        # classification models
        "philschmid/tiny-bert-sst2-distilled",
        # rerankers
        "mixedbread-ai/mxbai-rerank-xsmall-v1",
    ],
    # engine to `torch` or `optimum`
    engine="torch",
    # device `cuda` (Nvidia/AMD) or `cpu`
    device="cpu",
)

sentences = ["Paris is in France.", "Berlin is in Germany.", "A image of two cats."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
question = "Where is Paris?"

future: "Future" = register.embed(
    sentences=sentences, model_id="michaelfeil/bge-small-en-v1.5"
)
future.result()
register.rerank(
    query=question, docs=sentences, model_id="mixedbread-ai/mxbai-rerank-xsmall-v1"
)
register.classify(model_id="philschmid/tiny-bert-sst2-distilled", sentences=sentences)
register.image_embed(model_id="jinaai/jina-clip-v1", images=images)

# manually stop the register upon termination to free model memory.
register.stop()

All functions return Futures(vector_embedding, token_usage), enables you to wait for them and removes batching logic from your code.

>>> embedding_fut = register.embed(sentences=sentences, model_id="michaelfeil/bge-small-en-v1.5")
>>> print(embedding_fut)
<Future at 0x7fa0e97e8a60 state=pending>
>>> time.sleep(1) and print(embedding_fut)
<Future at 0x7fa0e97e9c30 state=finished returned tuple>
>>> embedding_fut.result()
([array([-3.35943862e-03, ..., -3.22808176e-02], dtype=float32)], 19)

Licence and Contributions

embed is licensed as MIT. All contribrutions need to adhere to the MIT License. Contributions are welcome.

About

A stable, fast and easy-to-use inference library with a focus on a sync-to-async API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published