Skip to content

Latest commit

 

History

History
182 lines (146 loc) · 13.5 KB

BENCHMARKS.md

File metadata and controls

182 lines (146 loc) · 13.5 KB

UForm Model Benchmarks

Accuracy

Embedding Models

Few retrieval benchmarks exist for multimodal embeddings. The most famous ones for English are "MS-COCO" and "Flickr30k". Evaluating uform-vl-english model, one can expect the following numbers for search quality.

Dataset Recall @ 1 Recall @ 5 Recall @ 10
Flickr 0.727 0.915 0.949
MS-COCO ¹ 0.510 0.761 0.838

For multilingual benchmarks, we've created the unum-cloud/coco-sm repository². Evaluating the unum-cloud/uform-vl-multilingual-v2 model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32 OpenCLIP model.

Language OpenCLIP @ 1 UForm @ 1 OpenCLIP @ 5 UForm @ 5 OpenCLIP @ 10 UForm @ 10 Speakers
English 🇺🇸 37.8 37.7 63.5 65.0 73.5 75.9 1'452 M
Chinese 🇨🇳 27.3 32.2 51.3 59.0 62.1 70.5 1'118 M
Hindi 🇮🇳 20.7 31.3 42.5 57.9 53.7 69.6 602 M
Spanish 🇪🇸 32.6 35.6 58.0 62.8 68.8 73.7 548 M
Arabic 🇸🇦 22.7 31.7 44.9 57.8 55.8 69.2 274 M
French 🇫🇷 31.3 35.4 56.5 62.6 67.4 73.3 274 M

All languages:

Language OpenCLIP @ 1 UForm @ 1 OpenCLIP @ 5 UForm @ 5 OpenCLIP @ 10 UForm @ 10 Speakers
Arabic 🇸🇦 22.7 31.7 44.9 57.8 55.8 69.2 274 M
Armenian 🇦🇲 5.6 22.0 14.3 44.7 20.2 56.0 4 M
Chinese 🇨🇳 27.3 32.2 51.3 59.0 62.1 70.5 1'118 M
English 🇺🇸 37.8 37.7 63.5 65.0 73.5 75.9 1'452 M
French 🇫🇷 31.3 35.4 56.5 62.6 67.4 73.3 274 M
German 🇩🇪 31.7 35.1 56.9 62.2 67.4 73.3 134 M
Hebrew 🇮🇱 23.7 26.7 46.3 51.8 57.0 63.5 9 M
Hindi 🇮🇳 20.7 31.3 42.5 57.9 53.7 69.6 602 M
Indonesian 🇮🇩 26.9 30.7 51.4 57.0 62.7 68.6 199 M
Italian 🇮🇹 31.3 34.9 56.7 62.1 67.1 73.1 67 M
Japanese 🇯🇵 27.4 32.6 51.5 59.2 62.6 70.6 125 M
Korean 🇰🇷 24.4 31.5 48.1 57.8 59.2 69.2 81 M
Persian 🇮🇷 24.0 28.8 47.0 54.6 57.8 66.2 77 M
Polish 🇵🇱 29.2 33.6 53.9 60.1 64.7 71.3 41 M
Portuguese 🇵🇹 31.6 32.7 57.1 59.6 67.9 71.0 257 M
Russian 🇷🇺 29.9 33.9 54.8 60.9 65.8 72.0 258 M
Spanish 🇪🇸 32.6 35.6 58.0 62.8 68.8 73.7 548 M
Thai 🇹🇭 21.5 28.7 43.0 54.6 53.7 66.0 61 M
Turkish 🇹🇷 25.5 33.0 49.1 59.6 60.3 70.8 88 M
Ukranian 🇺🇦 26.0 30.6 49.9 56.7 60.9 68.1 41 M
Vietnamese 🇻🇳 25.4 28.3 49.2 53.9 60.3 65.5 85 M
Mean 26.5±6.4 31.8±3.5 49.8±9.8 58.1±4.5 60.4±10.6 69.4±4.3 -
Google Translate 27.4±6.3 31.5±3.5 51.1±9.5 57.8±4.4 61.7±10.3 69.1±4.3 -
Microsoft Translator 27.2±6.4 31.4±3.6 50.8±9.8 57.7±4.7 61.4±10.6 68.9±4.6 -
Meta NLLB 24.9±6.7 32.4±3.5 47.5±10.3 58.9±4.5 58.2±11.2 70.2±4.3 -

Generative Models

Model LLM Size SQA MME MMBench Average¹
UForm-Gen2-Qwen-500m 0.5B 45.5 880.1 42.0 29.31
MobileVLM v2 1.4B 52.1 1302.8 57.7 36.81
LLaVA-Phi 2.7B 68.4 1335.1 59.8 42.95

For captioning evaluation we measure CLIPScore and RefCLIPScore³.

Model Size Caption Length CLIPScore RefCLIPScore
llava-hf/llava-1.5-7b-hf 7B Long 0.878 0.529
llava-hf/llava-1.5-7b-hf 7B Short 0.886 0.531
Salesforce/instructblip-vicuna-7b 7B Long 0.902 0.534
Salesforce/instructblip-vicuna-7b 7B Short 0.848 0.523
unum-cloud/uform-gen 1.5B Long 0.847 0.523
unum-cloud/uform-gen 1.5B Short 0.842 0.522
unum-cloud/uform-gen-chat 1.5B Long 0.860 0.525
unum-cloud/uform-gen-chat 1.5B Short 0.858 0.525

Results for VQAv2 evaluation.

Model Size Accuracy
llava-hf/llava-1.5-7b-hf 7B 78.5
unum-cloud/uform-gen 1.5B 66.5

¹ Train split was in training data.
² Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section.
³ We used apple/DFN5B-CLIP-ViT-H-14-378 CLIP model.

Speed

Embedding Models

UForm comes pre-packaged with speed benchmarks for the models.

$ python python/scripts/bench_encoders.py --help
usage: bench_encoders.py [-h] [--filter-out FILTER_OUT] [--batch-size BATCH_SIZE]

options:
  -h, --help            show this help message and exit
  --filter-out FILTER_OUT
                        Filter out models, backends, or devices with a Regular Expression.
  --batch-size BATCH_SIZE
                        Batch size for the benchmark. Batch size 1 measures latency. Large batch sizes may not fit on every GPU.

Running that script for a fairly small batch size of 50 on an Nvidia H100 GPU and

Model Name Device Backend Images Preprocessed/s Images Encoded/s Texts Preprocessed/s Texts Encoded/s
unum-cloud/uform3-image-text-english-base cpu torch 23.03 76.57 15,978.03 562.28
unum-cloud/uform3-image-text-english-base cpu onnx 23.11 77.75 13,880.27 1,067.40
unum-cloud/uform3-image-text-english-base cuda torch 22.87 1,060.40 12,348.94 13,242.83
unum-cloud/uform3-image-text-english-large cpu torch 22.41 10.84 13,350.45 145.12
unum-cloud/uform3-image-text-english-large cpu onnx 23.13 19.60 18,031.85 960.09
unum-cloud/uform3-image-text-english-large cuda torch 22.78 244.86 13,226.40 10,204.04
unum-cloud/uform3-image-text-english-small cpu torch 20.08 71.68 12,147.05 249.63
unum-cloud/uform3-image-text-english-small cpu onnx 22.84 195.27 13,636.99 1,385.25
unum-cloud/uform3-image-text-english-small cuda torch 22.63 2,662.16 14,731.18 14,694.87
unum-cloud/uform3-image-text-multilingual-base cpu torch 22.98 64.28 10,129.27 209.76
unum-cloud/uform3-image-text-multilingual-base cpu onnx 23.06 66.81 8,963.13 1,104.32
unum-cloud/uform3-image-text-multilingual-base cuda torch 22.88 1,051.95 15,639.72 12,416.12

If you are interested in performance numbers on consumer grade hardware, compared to third-party models, here are some rough estimates. On Nvidia RTX 3090:

Model Multilingual Speed Speedup
bert-base-uncased No 1'612 sequences/second
distilbert-base-uncased No 3'174 sequences/second x 1.96
sentence-transformers/all-MiniLM-L12-v2 Yes 3'604 sequences/second x 2.24
unum-cloud/uform3-image-text-multilingual-base Yes 6'809 sequences/second x 4.22

Given the small size of the model it also work well on mobile devices. On Apple M2 Arm chips the energy efficiency of inference can exceed that of the RTX 3090 GPU and other Ampere-generation cards.

Device Speed Device TDP Efficiency
Nvidia RTX 3090 ~ 140 tokens/second < 350W 0.40 tokens/joule
Apple M2 Pro unplugged ~ 19 tokens/second < 20W 0.95 tokens/joule
Apple M2 Max unplugged ~ 38 tokens/second < 36W 1.06 tokens/joule
Apple M2 Max plugged ~ 56 tokens/second < 89W 0.63 tokens/joule

Generative Models

$ python python/scripts/bench_decoders.py --help
usage: bench_decoders.py [-h] [--filter-out FILTER_OUT] [--batch-size BATCH_SIZE]

options:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        Batch size for the benchmark. Batch size 1 measures latency. Large batch sizes may not fit on every GPU.
  --max-length MAX_LENGTH
                        Maximum length of the generated text in tokens.

On Nvidia H100 GPU, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model Size Decoding Speed Decoding Parallel Streams
llava-hf/llava-1.5-7b-hf 7 B ~ 141 tokens/s ~ 4 K tokens/s (32 streams)
Salesforce/instructblip-vicuna-7b 7 B ~ 211 tokens/s ~ 2 K tokens/s (32 streams)
unum-cloud/uform-gen 1.5 B ~ 252 tokens/s ~ 3 K tokens/s (128 streams)
unum-cloud/uform-gen2-dpo 1.2 B ~ 372 tokens/s ~ 10 K tokens/s (64 streams)

On Nvidia RTX 3090, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model Size Decoding Speed Speedup
llava-hf/llava-1.5-7b-hf 7 B ~ 40 tokens/s
Salesforce/instructblip-vicuna-7b 7 B ~ 40 tokens/s
unum-cloud/uform-gen 1.5 B ~ 140 tokens/s x 3.5