Skip to content

pfnet-research/pfgen-bench

Repository files navigation

Preferred Generation Benchmark

pfgen-benchmark is a benchmark designed to evaluate Japanese text generation specifically for pretrained models. Unlike conventional benchmarks that use templates containing instructions, this benchmark relies solely on providing numerous examples. By conveying expectations such as the question-answering nature of the task, responses of approximately 100 characters, and outputs resembling formal public documents purely through examples, it minimizes the influence of differences in instructions or templates. Additionally, output evaluation is conducted using n-gram-based methods, enabling quick, cost-effective, and deterministic evaluations, unlike the LLM as a Judge approach.

To enable comparisons across as many models as possible, the leaderboard actively includes a wide range of models. These include openly accessible models, models mentioned in academic papers, and those announced by companies through press releases. Contributions of model outputs are encouraged, and results can be submitted via pull requests. For detailed instructions on how to contribute, please refer to the "How to Contribute" section.

See more details: TBD (arxiv)

pfgen-benchmark は事前学習モデル向けに設計された日本語の生成文を評価するベンチマークです。通常のベンチマークでは指示文を含むテンプレートを使いますが、このベンチマークでは多数の例示のみを行います。質問応答タスクであることや、約100字の回答、公用文に近い出力を期待していることを例示のみで伝えることで、指示文やテンプレートの差異による影響を小さくしています。また、出力文の評価は n-gram を用いた方法を用いており、LLM as a Judge の手法と異なり、短時間、低コストでかつ決定的な評価を可能にしています。

詳しくはこちら: Jxiv preprint

できる限り多くのモデルを同じ軸で比較できるように、リーダーボードには積極的に多くのモデル掲載しています。オープンにアクセス可能なモデル、論文で言及されているモデル、企業がプレスリリースを出しているモデルなど、比較の価値があると思われるモデルについては、是非プルリクエストで出力を追加してください。追加方法については「How to contribute」を参照ください。

License of LLM output

The license of the parts of this repository other than the output of LLM is Apache License Version 2.0. The license of the output of LLM depends on the license of each model.

How to evaluate model

You can evaluate the model using run-hf.py (which uses transformers) or run-vllm.py (which uses vLLM). For detailed parameters, refer to --help. The --num-trials parameter, which is the number of patterns for which the model will generate answers, should be decided considering the trade-off between execution time and required accuracy.

# Run a model using Huggingface library or vLLM.
python ./run-hf.py --model=pfnet/plamo-13b --num-trials=5

# Evaluate output and update leaderboard.
make

How to contribute

Follow the instructions in the "How to Evaluate Model" section to run the evaluation. This process will generate config.json and trials.jsonl.xz files under the result directory. Please create a pull request containing only these two files.

To ensure more accurate ranking among models, the number of executions (--num-trials) should be as many as possible, within the limit of 100 trials.

Leaderboard

Rank Score                    Model                                       Length           Fluency Truthfulness Helpfulness
N/A 1.0501 (±0.0000/√1) 👑 system/ground-truth 100.0 (±0.0) 1.155 0.996 1.000
1 0.9303 (±0.0083/√10) 💬 anthropic/claude-3-5-sonnet-20240620 102.2 (±10.4) 0.949 0.959 0.883
2 0.9144 (±0.0037/√2) 💬 deepseek-ai/DeepSeek-V3 87.4 (±14.9) 0.960 0.983 0.800
3 0.8615 (±0.0092/√10) 💬 openai/gpt-4o 84.5 (±18.6) 0.919 0.980 0.686
N/A 0.8494 (±0.0253/√1000) 🎯 system/criteria 100.0 (±3.4) 0.936 0.978 0.505
4 0.8270 (±0.0229/√10) 💬 anthropic/claude-3-opus-20240229 102.3 (±9.5) 0.911 0.944 0.627
5 0.8059 (±0.0169/√5) 💬 google/gemini-2.0-flash-exp 68.0 (±17.7) 0.834 0.984 0.600
6 0.8036 (±0.0133/√10) 💬 openai/gpt-4-turbo 86.5 (±17.4) 0.820 0.959 0.632
7 0.7916 (±0.0146/√10) 💬 openai/gpt-4 107.2 (±11.6) 0.888 0.951 0.536
8 0.7821 (±0.0166/√5) 💬 Qwen/Qwen2.5-72B-Instruct 98.3 (±14.9) 0.871 0.933 0.542
9 0.7789 (±0.0213/√100) 🟢 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 109.1 (±36.8) 0.890 0.941 0.506
10 0.7773 (±0.0168/√100) 💬 pfnet/plamo-1.0-prime 178.2 (±114.5) 0.874 0.942 0.516
11 0.7768 (±0.0113/√5) 💬 mlx-community/Qwen2.5-72B-Instruct-4bit 100.8 (±17.7) 0.860 0.933 0.538
12 0.7766 (±0.0276/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-hf 104.1 (±17.9) 0.884 0.938 0.507
13 0.7756 (±0.0264/√100) 🟢 tokyotech-llm/Swallow-70b-NVE-instruc... 104.1 (±18.5) 0.878 0.938 0.510
14 0.7748 (±0.0000/√1) 💬 openai/chatgpt-o1 76.3 (±17.7) 0.755 0.960 0.610
15 0.7650 (±0.0263/√100) 🟢 tokyotech-llm/Swallow-70b-instruct-hf 102.5 (±14.4) 0.872 0.929 0.494
16 0.7643 (±0.0000/√1) 💬 openai/chatgpt-o1-pro 79.5 (±17.3) 0.748 0.955 0.590
17 0.7628 (±0.0275/√100) 🟢 tokyotech-llm/Swallow-70b-hf 103.5 (±16.1) 0.876 0.930 0.483
18 0.7601 (±0.0289/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-v0.1 106.3 (±21.0) 0.864 0.925 0.492
19 0.7538 (±0.0251/√100) 🟢 turing-motors/Llama-3-heron-brain-70B... 101.1 (±16.9) 0.857 0.925 0.479
20 0.7483 (±0.0215/√50) 💬 weblab-GENIAC/Tanuki-8x8B-dpo-v1.0 182.6 (±88.9) 0.847 0.922 0.475
21 0.7469 (±0.0270/√100) 🟢 pfnet/plamo-100b-base 115.2 (±64.0) 0.861 0.920 0.460
22 0.7444 (±0.0260/√100) 🟢 sbintuitions/sarashina2-70b 120.0 (±49.4) 0.825 0.923 0.485
23 0.7423 (±0.0302/√100) 💬 cyberagent/Llama-3.1-70B-Japanese-Ins... 199.2 (±110.3) 0.817 0.905 0.505
24 0.7392 (±0.0232/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 93.6 (±23.5) 0.847 0.941 0.429
25 0.7370 (±0.0217/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-70B-I... 97.5 (±19.8) 0.846 0.932 0.433
26 0.7365 (±0.0218/√100) 🟢 CohereForAI/c4ai-command-r-plus 107.5 (±42.3) 0.818 0.913 0.478
27 0.7336 (±0.0254/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-v0.1 108.2 (±24.7) 0.837 0.908 0.456
28 0.7320 (±0.0201/√10) 💬 anthropic/claude-3-sonnet-20240229 114.3 (±18.9) 0.810 0.910 0.476
29 0.7249 (±0.0247/√100) 💬 cyberagent/calm3-22b-chat 136.8 (±46.7) 0.813 0.907 0.455
30 0.7246 (±0.0250/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 89.8 (±33.9) 0.812 0.940 0.422
31 0.7217 (±0.0219/√100) 🟢 cyberagent/calm3-22b-chat 105.0 (±13.1) 0.824 0.916 0.425
32 0.7194 (±0.0321/√10) 💬 google/text-bison 77.6 (±31.9) 0.790 0.968 0.401
33 0.7185 (±0.0000/√1) 💬 elyza/Llama-3-ELYZA-JP-70B 98.6 (±33.8) 0.837 0.931 0.388
34 0.7175 (±0.0257/√100) 🟢 nvidia/nemotron-4-340b-instruct 107.3 (±28.4) 0.816 0.908 0.429
35 0.7084 (±0.0207/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 95.9 (±19.7) 0.835 0.930 0.360
36 0.7046 (±0.0248/√100) 💬 nvidia/nemotron-4-340b-instruct 94.5 (±39.1) 0.768 0.910 0.435
37 0.7024 (±0.0238/√100) 🟢 rinna/nekomata-14b 104.3 (±18.0) 0.812 0.912 0.383
38 0.7023 (±0.0271/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.2 112.6 (±33.2) 0.818 0.901 0.388
39 0.7008 (±0.0318/√100) 🟢 tokyotech-llm/Swallow-13b-instruct-hf 104.5 (±13.0) 0.812 0.898 0.392
40 0.6990 (±0.0288/√100) 🟢 tokyotech-llm/Swallow-13b-NVE-hf 106.2 (±19.2) 0.820 0.906 0.371
41 0.6980 (±0.0252/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 98.7 (±50.0) 0.798 0.927 0.369
42 0.6958 (±0.0236/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 92.9 (±20.0) 0.814 0.931 0.343
43 0.6945 (±0.0300/√100) 🟢 sbintuitions/sarashina2-13b 107.8 (±28.3) 0.794 0.900 0.390
44 0.6938 (±0.0217/√100) 🟢 weblab-GENIAC/Tanuki-8B-dpo-v1.0 111.5 (±22.8) 0.800 0.893 0.389
45 0.6924 (±0.0232/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-70B-I... 74.1 (±31.4) 0.755 0.948 0.373
46 0.6891 (±0.0255/√100) 🟢 tokyotech-llm/Swallow-13b-hf 104.8 (±17.7) 0.811 0.901 0.355
47 0.6853 (±0.0201/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-In... 96.6 (±18.8) 0.815 0.919 0.322
48 0.6794 (±0.0243/√100) 🟢 cyberagent/Llama-3.1-70B-Japanese-Ins... 128.8 (±72.2) 0.764 0.883 0.391
49 0.6759 (±0.0232/√10) 🟢 meta-llama/Meta-Llama-3.1-405B 101.2 (±15.1) 0.767 0.892 0.368
50 0.6745 (±0.0152/√10) 💬 google/gemini-1.5-pro-001 52.4 (±15.0) 0.666 0.980 0.377
51 0.6737 (±0.0276/√100) 🟢 sbintuitions/sarashina1-13b 105.4 (±23.4) 0.775 0.882 0.364
52 0.6715 (±0.0284/√100) 🟢 tokyotech-llm/Llama-3.1-Swallow-8B-v0.1 107.5 (±22.2) 0.787 0.881 0.347
53 0.6697 (±0.0277/√100) 🟢 nvidia/nemotron-4-340b-base 106.9 (±26.5) 0.768 0.884 0.357
54 0.6677 (±0.0250/√100) 🟢 llm-jp/llm-jp-3-13b 101.1 (±9.7) 0.770 0.884 0.349
55 0.6673 (±0.0225/√100) 🟢 sbintuitions/sarashina1-65b 104.2 (±20.0) 0.776 0.894 0.332
56 0.6663 (±0.0262/√100) 🟢 tokyotech-llm/Swallow-7b-plus-hf 106.1 (±18.1) 0.780 0.880 0.339
57 0.6656 (±0.0169/√10) 💬 google/gemini-1.5-flash-001 55.1 (±21.7) 0.687 0.967 0.342
58 0.6625 (±0.0140/√10) 💬 anthropic/claude-3-haiku-20240307 81.9 (±31.0) 0.747 0.943 0.298
59 0.6590 (±0.0133/√10) 💬 google/gemini-2.0-flash-thinking-exp-... 49.8 (±11.0) 0.639 0.984 0.354
60 0.6572 (±0.0518/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 108.9 (±63.7) 0.764 0.895 0.313
61 0.6473 (±0.0182/√100) 💬 Qwen/Qwen2-72B-Instruct 108.7 (±24.8) 0.703 0.853 0.386
62 0.6456 (±0.0255/√100) 🟢 sbintuitions/sarashina2-7b 105.6 (±22.8) 0.746 0.874 0.316
63 0.6447 (±0.0251/√100) 💬 tokyotech-llm/Llama-3.1-Swallow-8B-In... 74.3 (±31.3) 0.706 0.934 0.294
64 0.6445 (±0.0241/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-v0.1 110.3 (±28.4) 0.748 0.867 0.319
65 0.6399 (±0.1763/√100) 💬 turing-motors/Llama-3-heron-brain-70B... 155.4 (±101.8) 0.718 0.805 0.397
66 0.6368 (±0.0207/√100) 🟢 tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1 105.5 (±21.0) 0.753 0.870 0.287
67 0.6350 (±0.0260/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-instruct... 104.0 (±16.9) 0.755 0.863 0.287
68 0.6337 (±0.0265/√100) 🟢 tokyotech-llm/Swallow-7b-hf 106.5 (±18.7) 0.746 0.866 0.289
69 0.6335 (±0.0252/√100) 🟢 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 103.2 (±16.6) 0.766 0.872 0.263
70 0.6318 (±0.0264/√100) 🟢 tokyotech-llm/Llama-3-Swallow-70B-Ins... 119.2 (±74.3) 0.724 0.861 0.311
71 0.6303 (±0.0252/√100) 🟢 cyberagent/calm2-7b-chat-dpo-experime... 110.0 (±24.3) 0.735 0.863 0.293
72 0.6285 (±0.0239/√100) 🟢 pfnet/nekomata-14b-pfn-qfin-inst-merge 124.7 (±47.2) 0.725 0.866 0.295
73 0.6279 (±0.0252/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-hf 108.1 (±24.5) 0.747 0.870 0.267
74 0.6274 (±0.0772/√100) 🟢 rinna/nekomata-14b-instruction 98.3 (±24.2) 0.732 0.855 0.295
75 0.6267 (±0.0263/√100) 🟢 sbintuitions/sarashina1-7b 106.7 (±25.1) 0.737 0.866 0.276
76 0.6252 (±0.0246/√100) 🟢 karakuri-ai/karakuri-lm-70b-v0.1 106.0 (±27.0) 0.713 0.852 0.310
77 0.6214 (±0.0063/√10) 💬 google/gemini-1.0-pro-001 47.4 (±15.2) 0.635 0.976 0.254
78 0.6202 (±0.0251/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.3 (±19.2) 0.733 0.848 0.280
79 0.6197 (±0.0258/√100) 🟢 stockmark/stockmark-13b 108.9 (±49.3) 0.727 0.860 0.272
80 0.6191 (±0.0284/√100) 🟢 stockmark/stockmark-13b-instruct 108.0 (±46.8) 0.720 0.859 0.278
81 0.6178 (±0.0230/√100) 🟢 karakuri-ai/karakuri-lm-70b-chat-v0.1 104.7 (±27.5) 0.706 0.842 0.306
82 0.6176 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-7b-instruct-hf 106.3 (±17.8) 0.716 0.851 0.285
83 0.6136 (±0.0143/√10) 💬 openai/gpt-35-turbo 64.0 (±22.2) 0.658 0.944 0.239
84 0.6095 (±0.0225/√100) 💬 rinna/llama-3-youko-70b-instruct 135.3 (±46.8) 0.683 0.817 0.328
85 0.6091 (±0.0277/√100) 🟢 pfnet/nekomata-14b-pfn-qfin 85.1 (±28.4) 0.672 0.893 0.262
86 0.6087 (±0.1545/√100) 💬 tokyotech-llm/Swallow-70b-NVE-instruc... 135.7 (±74.0) 0.678 0.804 0.344
87 0.6060 (±0.0238/√100) 🟢 Qwen/Qwen2-72B 105.5 (±23.5) 0.703 0.836 0.279
88 0.6037 (±0.0239/√100) 🟢 tokyotech-llm/Swallow-7b-NVE-instruct-hf 105.7 (±16.4) 0.719 0.847 0.245
89 0.6030 (±0.0287/√100) 💬 karakuri-ai/karakuri-lm-8x7b-instruct... 197.4 (±72.1) 0.703 0.832 0.274
90 0.6029 (±0.0223/√100) 🟢 Qwen/Qwen2-72B-Instruct 106.0 (±26.7) 0.684 0.825 0.299
91 0.5987 (±0.0264/√100) 🟢 cyberagent/calm2-7b-chat 107.5 (±20.8) 0.701 0.843 0.253
92 0.5971 (±0.0235/√100) 🟢 stockmark/stockmark-100b 107.2 (±24.7) 0.709 0.842 0.240
93 0.5945 (±0.1370/√100) 💬 tokyotech-llm/Swallow-13b-instruct-hf 167.3 (±116.4) 0.670 0.790 0.323
94 0.5921 (±0.0211/√100) 🟢 elyza/Llama-3-ELYZA-JP-8B 115.6 (±44.8) 0.685 0.831 0.260
95 0.5832 (±0.0220/√100) 🟢 augmxnt/shisa-gamma-7b-v1 106.7 (±21.8) 0.706 0.831 0.213
96 0.5825 (±0.0249/√100) 🟢 tokyotech-llm/Swallow-MS-7b-v0.1 106.4 (±25.9) 0.702 0.828 0.218
97 0.5811 (±0.0218/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 103.6 (±15.6) 0.675 0.816 0.252
98 0.5808 (±0.0220/√100) 🟢 stabilityai/japanese-stablelm-base-ga... 106.9 (±17.2) 0.690 0.822 0.230
99 0.5783 (±0.0217/√100) 🟢 microsoft/Phi-3-medium-4k-instruct 105.9 (±20.0) 0.675 0.826 0.234
100 0.5777 (±0.0228/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 105.2 (±14.5) 0.675 0.811 0.247
101 0.5754 (±0.0182/√100) 🟢 Xwin-LM/Xwin-LM-70B-V0.1 105.4 (±26.8) 0.681 0.833 0.213
102 0.5737 (±0.0209/√100) 🟢 microsoft/Phi-3-medium-128k-instruct 107.7 (±24.7) 0.674 0.825 0.223
103 0.5735 (±0.0216/√100) 🟢 google/gemma-2-9b-it 95.9 (±22.0) 0.674 0.837 0.209
104 0.5734 (±0.1980/√100) 💬 tokyotech-llm/Swallow-70b-instruct-hf 130.9 (±105.0) 0.636 0.758 0.326
105 0.5724 (±0.0209/√100) 🟢 rinna/llama-3-youko-70b 104.6 (±20.6) 0.681 0.826 0.210
106 0.5716 (±0.0230/√100) 🟢 sbintuitions/sarashina2.1-1b 116.9 (±41.3) 0.668 0.821 0.226
107 0.5712 (±0.0194/√100) 💬 karakuri-ai/karakuri-lm-8x7b-chat-v0.1 244.4 (±49.3) 0.678 0.816 0.220
108 0.5710 (±0.0226/√100) 🟢 rinna/llama-3-youko-8b-instruct 111.6 (±23.4) 0.672 0.809 0.232
109 0.5659 (±0.0234/√100) 🟢 meta-llama/Meta-Llama-3.1-70B 103.7 (±20.1) 0.665 0.822 0.211
110 0.5656 (±0.0226/√100) 💬 meta-llama/Meta-Llama-3-70B-Instruct 110.2 (±36.4) 0.665 0.777 0.254
111 0.5646 (±0.0240/√100) 💬 microsoft/Phi-3-medium-4k-instruct 131.3 (±50.6) 0.633 0.807 0.253
112 0.5642 (±0.0261/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.1 (±19.5) 0.646 0.799 0.247
113 0.5620 (±0.0254/√100) 🟢 meta-llama/Meta-Llama-3-70B 102.0 (±17.2) 0.664 0.809 0.213
114 0.5588 (±0.0230/√100) 🟢 stabilityai/japanese-stablelm-instruc... 105.6 (±17.0) 0.673 0.812 0.191
115 0.5574 (±0.0216/√100) 🟢 rinna/nekomata-7b 108.4 (±18.0) 0.678 0.816 0.178
116 0.5569 (±0.0244/√100) 🟢 rinna/llama-3-youko-8b 104.9 (±17.0) 0.670 0.813 0.188
117 0.5568 (±0.0200/√100) 🟢 meta-llama/Meta-Llama-3-70B-Instruct 111.8 (±55.9) 0.655 0.780 0.236
118 0.5562 (±0.0952/√100) 💬 stockmark/stockmark-13b-instruct 137.2 (±89.6) 0.633 0.798 0.238
119 0.5537 (±0.0204/√100) 🟢 tokyotech-llm/Llama-3-Swallow-8B-Inst... 114.4 (±48.5) 0.657 0.812 0.192
120 0.5516 (±0.1016/√100) 💬 cyberagent/calm2-7b-chat-dpo-experime... 181.1 (±120.1) 0.644 0.775 0.236
121 0.5511 (±0.0203/√100) 🟢 google/gemma-2-27b-it 110.3 (±56.8) 0.599 0.836 0.218
122 0.5500 (±0.0605/√100) 💬 tokyotech-llm/Llama-3-Swallow-70B-Ins... 156.5 (±106.5) 0.633 0.780 0.237
123 0.5500 (±0.0467/√100) 💬 tokyotech-llm/Swallow-7b-instruct-hf 121.9 (±77.3) 0.612 0.812 0.225
124 0.5437 (±0.0218/√100) 💬 Xwin-LM/Xwin-LM-70B-V0.1 200.7 (±63.1) 0.652 0.782 0.198
125 0.5436 (±0.0246/√100) 🟢 llm-jp/llm-jp-3-3.7b 101.3 (±10.4) 0.646 0.795 0.189
126 0.5432 (±0.0208/√100) 💬 CohereForAI/c4ai-command-r-plus 48.9 (±16.5) 0.505 0.931 0.194
127 0.5429 (±0.0238/√100) 🟢 meta-llama/Meta-Llama-3.1-70B-Instruct 157.6 (±221.7) 0.636 0.770 0.222
128 0.5387 (±0.0269/√100) 💬 rinna/llama-3-youko-8b-instruct 265.4 (±104.1) 0.635 0.771 0.210
129 0.5386 (±0.0215/√100) 💬 microsoft/Phi-3-medium-128k-instruct 91.9 (±44.7) 0.589 0.834 0.193
130 0.5377 (±0.0481/√100) 💬 meta-llama/Meta-Llama-3.1-70B-Instruct 135.8 (±194.8) 0.617 0.779 0.218
131 0.5349 (±0.0203/√100) 💬 google/gemma-2-27b-it 74.7 (±42.7) 0.545 0.874 0.186
132 0.5347 (±0.0188/√100) 🟢 rinna/youri-7b 107.6 (±16.3) 0.654 0.802 0.148
133 0.5316 (±0.0273/√100) 💬 lightblue/karasu-7B-chat 111.8 (±46.5) 0.621 0.800 0.174
134 0.5301 (±0.0476/√100) 💬 lightblue/karasu-7B-chat-plus 107.1 (±46.7) 0.615 0.798 0.178
135 0.5283 (±0.0585/√100) 💬 lightblue/karasu-7B-chat-plus-unleashed 104.6 (±45.3) 0.614 0.794 0.177
136 0.5179 (±0.0264/√100) 🟢 cyberagent/calm2-7b 106.0 (±26.2) 0.601 0.770 0.182
137 0.5164 (±0.0209/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 109.3 (±33.5) 0.606 0.788 0.155
138 0.5143 (±0.0212/√100) 🟢 llm-jp/llm-jp-13b-v2.0 104.1 (±11.2) 0.604 0.760 0.180
139 0.5143 (±0.0170/√100) 🟢 moneyforward/houou-instruction-7b-v3 112.2 (±37.8) 0.629 0.778 0.135
140 0.5085 (±0.0160/√100) 🟢 moneyforward/houou-instruction-7b-v1 105.9 (±41.0) 0.617 0.781 0.128
141 0.5080 (±0.0306/√100) 💬 stabilityai/japanese-stablelm-instruc... 111.3 (±58.3) 0.548 0.782 0.195
142 0.5073 (±0.0208/√100) 💬 Qwen/Qwen2-57B-A14B-Instruct 154.8 (±89.5) 0.615 0.734 0.173
143 0.5045 (±0.0208/√100) 🟢 Qwen/Qwen2-57B-A14B 106.7 (±22.5) 0.617 0.757 0.139
144 0.5041 (±0.0225/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 106.2 (±29.3) 0.579 0.778 0.155
145 0.5022 (±0.0221/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-jaste... 95.0 (±36.2) 0.579 0.795 0.132
146 0.5013 (±0.0196/√100) 🟢 google/gemma-2-9b 107.3 (±26.0) 0.595 0.761 0.148
147 0.5013 (±0.0375/√100) 💬 karakuri-ai/karakuri-lm-70b-chat-v0.1 427.4 (±151.5) 0.579 0.723 0.202
148 0.5002 (±0.0218/√100) 🟢 Qwen/Qwen-72B-Chat 223.0 (±258.3) 0.614 0.716 0.171
149 0.4995 (±0.0211/√100) 💬 Qwen/Qwen1.5-72B-Chat 119.3 (±58.1) 0.582 0.708 0.208
150 0.4963 (±0.0189/√100) 🟢 Qwen/Qwen1.5-72B-Chat 128.1 (±77.7) 0.586 0.698 0.206
151 0.4959 (±0.0235/√100) 🟢 llm-jp/llm-jp-13b-v1.0 115.0 (±40.9) 0.576 0.756 0.156
152 0.4953 (±0.0203/√100) 🟢 meta-llama/Llama-2-70b-hf 110.4 (±25.8) 0.596 0.745 0.145
153 0.4949 (±0.0177/√100) 💬 moneyforward/houou-instruction-7b-v1 180.5 (±66.6) 0.604 0.734 0.146
154 0.4931 (±0.0247/√100) 🟢 Rakuten/RakutenAI-7B-instruct 105.6 (±33.1) 0.598 0.750 0.132
155 0.4921 (±0.0219/√100) 🟢 Rakuten/RakutenAI-7B-chat 114.9 (±44.7) 0.592 0.760 0.124
156 0.4916 (±0.0201/√100) 🟢 moneyforward/houou-instruction-7b-v2 104.7 (±41.2) 0.588 0.770 0.116
157 0.4895 (±0.0440/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 268.1 (±133.1) 0.548 0.722 0.199
158 0.4872 (±0.0237/√100) 🟢 lightblue/karasu-7B 110.1 (±19.0) 0.586 0.739 0.137
159 0.4870 (±0.0215/√100) 🟢 Qwen/Qwen-72B 134.6 (±114.6) 0.593 0.715 0.152
160 0.4868 (±0.0163/√100) 💬 google/gemma-2-9b-it 47.6 (±14.6) 0.477 0.880 0.104
161 0.4863 (±0.1167/√100) 💬 pfnet/nekomata-14b-pfn-qfin-inst-merge 93.4 (±55.0) 0.544 0.721 0.194
162 0.4862 (±0.0221/√100) 🟢 Qwen/Qwen2-57B-A14B-Instruct 116.9 (±82.5) 0.601 0.734 0.124
163 0.4857 (±0.0168/√100) 💬 moneyforward/houou-instruction-7b-v2 207.0 (±57.3) 0.591 0.719 0.147
164 0.4829 (±0.0211/√100) 🟢 Qwen/Qwen1.5-72B 136.2 (±85.6) 0.591 0.705 0.153
165 0.4827 (±0.0464/√100) 💬 llm-jp/llm-jp-13b-instruct-full-ac_00... 269.1 (±131.5) 0.542 0.716 0.191
166 0.4762 (±0.0810/√100) 💬 stabilityai/japanese-stablelm-instruc... 126.2 (±67.4) 0.545 0.726 0.158
167 0.4746 (±0.0210/√100) 🟢 rinna/youri-7b-chat 102.1 (±16.4) 0.571 0.752 0.100
168 0.4744 (±0.0227/√100) 🟢 pfnet/plamo-13b 108.2 (±28.5) 0.558 0.749 0.116
169 0.4743 (±0.0987/√100) 💬 tokyotech-llm/Swallow-7b-NVE-instruct-hf 129.0 (±72.8) 0.535 0.725 0.163
170 0.4730 (±0.0166/√100) 🟢 Xwin-LM/Xwin-LM-13B-V0.2 109.7 (±27.4) 0.582 0.723 0.114
171 0.4723 (±0.0204/√100) 💬 Rakuten/RakutenAI-7B-chat 233.0 (±133.0) 0.565 0.734 0.118
172 0.4723 (±0.0808/√100) 💬 tokyotech-llm/Llama-3-Swallow-8B-Inst... 199.3 (±155.6) 0.563 0.699 0.154
173 0.4698 (±0.0200/√100) 🟢 Rakuten/RakutenAI-7B 105.4 (±25.6) 0.576 0.721 0.113
174 0.4692 (±0.0161/√100) 🟢 shisa-ai/shisa-v1-qwen2-7b 109.0 (±23.9) 0.563 0.712 0.133
175 0.4661 (±0.0210/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-dolly... 111.6 (±44.2) 0.536 0.756 0.106
176 0.4659 (±0.0438/√100) 💬 deepseek-ai/deepseek-llm-67b-chat 146.0 (±62.1) 0.555 0.703 0.139
177 0.4659 (±0.0202/√100) 🟢 llm-jp/llm-jp-3-1.8b 105.0 (±16.9) 0.568 0.725 0.105
178 0.4648 (±0.1659/√100) 💬 cyberagent/calm2-7b-chat 124.7 (±95.9) 0.536 0.688 0.171
179 0.4622 (±0.0195/√100) 🟢 Qwen/Qwen-14B-Chat 135.5 (±84.3) 0.572 0.718 0.097
180 0.4619 (±0.0162/√100) 💬 lmsys/vicuna-13b-v1.5-16k 126.5 (±48.4) 0.574 0.715 0.097
181 0.4609 (±0.0113/√10) 🟢 google/gemma-2-2b-jpn-it 69.4 (±24.1) 0.509 0.805 0.069
182 0.4607 (±0.0165/√100) 🟢 SakanaAI/EvoLLM-JP-v1-7B 111.2 (±30.4) 0.579 0.708 0.095
183 0.4601 (±0.0184/√100) 🟢 shisa-ai/shisa-v1-llama3-8b 112.9 (±31.4) 0.557 0.703 0.120
184 0.4597 (±0.0268/√100) 🟢 CohereForAI/c4ai-command-r-v01 179.2 (±166.3) 0.590 0.592 0.197
185 0.4586 (±0.0141/√100) 🟢 google/gemma-2-2b-it 88.2 (±30.8) 0.536 0.761 0.079
186 0.4561 (±0.0202/√100) 🟢 pfnet/plamo-13b-instruct 144.0 (±147.7) 0.532 0.763 0.073
187 0.4559 (±0.0201/√100) 🟢 pfnet/plamo-13b-instruct-nc 156.0 (±183.1) 0.523 0.768 0.077
188 0.4558 (±0.0156/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 75.3 (±26.6) 0.488 0.804 0.076
189 0.4543 (±0.0217/√100) 🟢 rinna/youri-7b-instruction 96.2 (±29.5) 0.530 0.743 0.090
190 0.4535 (±0.0348/√100) 💬 Rakuten/RakutenAI-7B-instruct 128.6 (±83.2) 0.527 0.726 0.108
191 0.4535 (±0.0183/√100) 🟢 THUDM/glm-4-9b 110.3 (±36.9) 0.554 0.689 0.118
192 0.4527 (±0.0146/√100) 🟢 lmsys/vicuna-13b-v1.5-16k 107.9 (±25.9) 0.576 0.708 0.075
193 0.4504 (±0.0224/√100) 🟢 rinna/nekomata-7b-instruction 96.4 (±23.7) 0.528 0.734 0.089
194 0.4486 (±0.0161/√100) 💬 Qwen/Qwen2-7B-Instruct 163.6 (±61.4) 0.547 0.688 0.111
195 0.4484 (±0.0191/√100) 💬 SakanaAI/EvoLLM-JP-v1-7B 123.9 (±68.1) 0.545 0.706 0.094
196 0.4477 (±0.0205/√100) 🟢 rinna/llama-3-youko-70b-instruct 130.7 (±95.3) 0.527 0.670 0.146
197 0.4426 (±0.0204/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-inst... 111.1 (±28.2) 0.544 0.687 0.097
198 0.4409 (±0.1064/√100) 💬 lightblue/karasu-7B 138.1 (±92.9) 0.512 0.679 0.131
199 0.4404 (±0.0146/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 75.9 (±22.7) 0.493 0.773 0.056
200 0.4387 (±0.0655/√100) 💬 Qwen/Qwen-72B-Chat 117.7 (±137.1) 0.541 0.632 0.143
201 0.4385 (±0.0285/√100) 💬 rinna/youri-7b-chat 95.4 (±41.1) 0.500 0.733 0.083
202 0.4377 (±0.0107/√100) 🟢 google/gemma-1.1-7b-it 86.8 (±21.4) 0.509 0.732 0.072
203 0.4374 (±0.0217/√100) 🟢 Qwen/Qwen1.5-32B-Chat 127.0 (±57.0) 0.538 0.642 0.133
204 0.4336 (±0.0168/√100) 🟢 stabilityai/japanese-stablelm-base-be... 107.1 (±17.2) 0.539 0.689 0.073
205 0.4335 (±0.0221/√100) 🟢 Qwen/Qwen-14B 118.1 (±71.6) 0.530 0.675 0.096
206 0.4332 (±0.0164/√100) 🟢 Qwen/Qwen2-7B-Instruct 119.1 (±45.7) 0.531 0.670 0.098
207 0.4330 (±0.0149/√100) 💬 google/gemma-2-2b-it 56.0 (±27.8) 0.445 0.788 0.066
208 0.4320 (±0.0171/√100) 🟢 Qwen/Qwen2-7B 109.1 (±40.1) 0.532 0.671 0.093
209 0.4296 (±0.0322/√100) 💬 Qwen/Qwen-14B-Chat 159.0 (±69.7) 0.522 0.675 0.092
210 0.4295 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-instruct 111.5 (±31.4) 0.530 0.676 0.083
211 0.4292 (±0.0181/√100) 💬 Xwin-LM/Xwin-LM-13B-V0.2 240.7 (±48.4) 0.533 0.670 0.085
212 0.4282 (±0.0193/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 110.8 (±26.0) 0.518 0.688 0.078
213 0.4272 (±0.0273/√100) 🟢 mistralai/Mistral-Nemo-Instruct-2407 155.8 (±132.8) 0.548 0.611 0.122
214 0.4265 (±0.0115/√100) 💬 google/gemma-1.1-7b-it 78.7 (±28.4) 0.475 0.739 0.066
215 0.4256 (±0.0270/√100) 🟢 rinna/japanese-gpt-neox-3.6b 129.8 (±73.4) 0.485 0.685 0.106
216 0.4228 (±0.0185/√100) 🟢 stabilityai/japanese-stablelm-base-ja... 110.4 (±28.6) 0.528 0.668 0.073
217 0.4222 (±0.0138/√100) 🟢 Xwin-LM/Xwin-LM-7B-V0.2 110.6 (±29.3) 0.520 0.677 0.070
218 0.4220 (±0.0185/√100) 🟢 lmsys/vicuna-7b-v1.5-16k 111.8 (±31.8) 0.522 0.670 0.074
219 0.4207 (±0.0189/√100) 🟢 stabilityai/japanese-stablelm-3b-4e1t... 112.8 (±27.0) 0.507 0.683 0.072
220 0.4201 (±0.0177/√100) 💬 lmsys/vicuna-7b-v1.5-16k 128.1 (±52.5) 0.514 0.668 0.078
221 0.4164 (±0.0244/√100) 🟢 google/gemma-7b 135.5 (±132.3) 0.533 0.631 0.085
222 0.4150 (±0.0212/√100) 💬 Qwen/Qwen1.5-32B-Chat 125.7 (±250.5) 0.496 0.620 0.130
223 0.4149 (±0.0375/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 186.6 (±108.4) 0.469 0.685 0.090
224 0.4144 (±0.0149/√100) 💬 01-ai/Yi-1.5-34B-Chat 170.6 (±47.1) 0.514 0.628 0.101
225 0.4140 (±0.0208/√100) 🟢 meta-llama/Meta-Llama-3-8B-Instruct 116.8 (±44.3) 0.523 0.637 0.082
226 0.4125 (±0.0303/√100) 💬 CohereForAI/c4ai-command-r-v01 137.7 (±324.6) 0.519 0.562 0.157
227 0.4122 (±0.0199/√100) 🟢 rinna/bilingual-gpt-neox-4b 121.0 (±43.6) 0.485 0.660 0.092
228 0.4097 (±0.0187/√100) 🟢 meta-llama/Meta-Llama-3.1-8B 108.7 (±35.4) 0.512 0.650 0.068
229 0.4087 (±0.0201/√100) 🟢 meta-llama/Llama-2-70b-chat-hf 161.3 (±140.8) 0.519 0.608 0.099
230 0.4087 (±0.0146/√100) 🟢 microsoft/Phi-3-small-8k-instruct 109.1 (±24.1) 0.514 0.644 0.068
231 0.4076 (±0.0142/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast-... 109.0 (±32.9) 0.503 0.644 0.076
232 0.4074 (±0.0207/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-inst... 156.6 (±65.9) 0.490 0.646 0.086
233 0.4073 (±0.0175/√100) 🟢 stabilityai/japanese-stablelm-instruc... 110.0 (±26.5) 0.490 0.663 0.070
234 0.4058 (±0.0295/√100) 💬 rinna/youri-7b-instruction 97.0 (±57.0) 0.439 0.713 0.065
235 0.4050 (±0.0191/√100) 🟢 mistralai/Mixtral-8x22B-v0.1 115.6 (±55.4) 0.517 0.615 0.084
236 0.4048 (±0.0175/√100) 🟢 meta-llama/Meta-Llama-3-8B 109.0 (±19.8) 0.505 0.641 0.068
237 0.4045 (±0.0186/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 133.1 (±57.4) 0.475 0.678 0.061
238 0.4042 (±0.0131/√100) 🟢 microsoft/Orca-2-13b 115.5 (±42.6) 0.510 0.630 0.073
239 0.4041 (±0.0218/√100) 💬 meta-llama/Meta-Llama-3-8B-Instruct 131.4 (±88.3) 0.508 0.614 0.090
240 0.4035 (±0.0151/√100) 🟢 SakanaAI/EvoLLM-JP-A-v1-7B 110.4 (±31.3) 0.508 0.633 0.069
241 0.4033 (±0.0164/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast... 107.2 (±28.5) 0.495 0.643 0.072
242 0.4032 (±0.0237/√100) 🟢 Qwen/Qwen1.5-32B 150.3 (±104.8) 0.505 0.605 0.100
243 0.4024 (±0.0187/√100) 🟢 01-ai/Yi-1.5-34B 109.9 (±28.2) 0.493 0.631 0.083
244 0.4011 (±0.0236/√100) 🟢 cyberagent/open-calm-7b 143.8 (±97.0) 0.472 0.641 0.091
245 0.4006 (±0.0166/√100) 💬 microsoft/Phi-3-small-8k-instruct 189.7 (±84.1) 0.500 0.630 0.073
246 0.4001 (±0.0199/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 117.6 (±48.9) 0.464 0.684 0.052
247 0.3985 (±0.0161/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b 138.4 (±51.8) 0.493 0.634 0.069
248 0.3960 (±0.0199/√100) 🟢 line-corporation/japanese-large-lm-1.7b 179.2 (±174.5) 0.474 0.650 0.065
249 0.3949 (±0.0193/√100) 💬 meta-llama/Meta-Llama-3.1-8B-Instruct 216.6 (±345.2) 0.487 0.624 0.074
250 0.3948 (±0.0190/√100) 💬 Qwen/Qwen1.5-14B-Chat 127.9 (±50.6) 0.500 0.604 0.080
251 0.3946 (±0.0201/√100) 🟢 Qwen/Qwen1.5-14B 130.9 (±67.8) 0.509 0.609 0.066
252 0.3934 (±0.0201/√100) 🟢 stabilityai/japanese-stablelm-instruc... 107.8 (±38.0) 0.466 0.648 0.066
253 0.3914 (±0.0172/√100) 🟢 mistralai/Mixtral-8x7B-Instruct-v0.1 95.1 (±25.2) 0.488 0.636 0.050
254 0.3863 (±0.0160/√100) 🟢 Qwen/Qwen1.5-14B-Chat 131.4 (±55.8) 0.491 0.593 0.075
255 0.3837 (±0.0188/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 117.4 (±42.4) 0.462 0.649 0.041
256 0.3823 (±0.0645/√100) 💬 mistralai/Mistral-Nemo-Instruct-2407 157.9 (±140.3) 0.484 0.563 0.100
257 0.3822 (±0.0647/√100) 💬 llm-jp/llm-jp-13b-instruct-full-dolly... 97.6 (±76.2) 0.397 0.664 0.086
258 0.3819 (±0.0265/√100) 🟢 google/gemma-2-27b 214.2 (±183.3) 0.450 0.608 0.087
259 0.3804 (±0.0161/√100) 🟢 Qwen/Qwen-7B-Chat 140.8 (±65.1) 0.485 0.612 0.045
260 0.3803 (±0.0249/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-instruct 136.4 (±70.7) 0.452 0.619 0.070
261 0.3772 (±0.0162/√100) 💬 microsoft/Phi-3-small-128k-instruct 199.7 (±111.9) 0.473 0.590 0.069
262 0.3760 (±0.0236/√100) 🟢 cyberagent/open-calm-3b 123.2 (±79.0) 0.442 0.624 0.062
263 0.3759 (±0.0149/√100) 🟢 lmsys/longchat-7b-v1.5-32k 116.9 (±31.6) 0.474 0.609 0.045
264 0.3740 (±0.0164/√100) 🟢 meta-llama/Llama-2-13b-hf 108.5 (±21.8) 0.474 0.603 0.045
265 0.3737 (±0.0197/√100) 🟢 meta-llama/Meta-Llama-3.1-8B-Instruct 204.5 (±303.4) 0.478 0.589 0.055
266 0.3720 (±0.0622/√100) 💬 Xwin-LM/Xwin-LM-7B-V0.2 205.3 (±79.1) 0.466 0.590 0.060
267 0.3720 (±0.0157/√100) 🟢 elyza/ELYZA-japanese-Llama-2-13b-fast 177.5 (±147.2) 0.458 0.598 0.061
268 0.3699 (±0.0345/√100) 💬 Qwen/Qwen-7B-Chat 182.9 (±110.3) 0.468 0.600 0.042
269 0.3694 (±0.0103/√100) 🟢 google/gemma-7b-it 89.7 (±21.6) 0.446 0.640 0.022
270 0.3685 (±0.0173/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b 140.0 (±52.8) 0.462 0.596 0.047
271 0.3673 (±0.0089/√100) 💬 google/gemma-7b-it 110.0 (±47.6) 0.448 0.633 0.020
272 0.3655 (±0.0116/√100) 🟢 deepseek-ai/deepseek-llm-7b-chat 113.9 (±24.7) 0.474 0.579 0.043
273 0.3642 (±0.0165/√100) 🟢 llm-jp/llm-jp-1.3b-v1.0 134.0 (±62.6) 0.437 0.612 0.044
274 0.3637 (±0.0223/√100) 🟢 cyberagent/open-calm-large 122.3 (±73.9) 0.424 0.611 0.056
275 0.3637 (±0.0152/√100) 🟢 elyza/ELYZA-japanese-Llama-2-7b-fast 168.0 (±77.4) 0.452 0.587 0.052
276 0.3632 (±0.0237/√100) 💬 elyza/ELYZA-japanese-Llama-2-7b-fast-... 178.6 (±113.6) 0.443 0.582 0.064
277 0.3628 (±0.0145/√100) 🟢 Qwen/Qwen-7B 117.3 (±39.0) 0.468 0.582 0.039
278 0.3554 (±0.0178/√100) 🟢 meta-llama/Llama-2-7b-chat-hf 139.3 (±93.1) 0.464 0.570 0.031
279 0.3545 (±0.0445/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 48.8 (±50.1) 0.283 0.723 0.058
280 0.3543 (±0.0439/√100) 💬 lmsys/longchat-7b-v1.5-32k 160.1 (±73.5) 0.448 0.572 0.043
281 0.3538 (±0.0175/√100) 🟢 01-ai/Yi-1.5-9B 113.0 (±29.4) 0.457 0.555 0.050
282 0.3531 (±0.0159/√100) 🟢 mistralai/Mixtral-8x7B-v0.1 94.3 (±20.8) 0.450 0.573 0.037
283 0.3514 (±0.0102/√100) 🟢 google/gemma-1.1-2b-it 80.4 (±21.6) 0.404 0.625 0.025
284 0.3495 (±0.0268/√100) 🟢 cyberagent/open-calm-1b 141.3 (±110.0) 0.412 0.578 0.059
285 0.3471 (±0.0131/√100) 🟢 microsoft/Orca-2-7b 131.1 (±70.7) 0.447 0.555 0.039
286 0.3465 (±0.0202/√100) 💬 deepseek-ai/deepseek-llm-7b-chat 167.2 (±76.5) 0.435 0.562 0.042
287 0.3463 (±0.0178/√100) 💬 mistralai/Mixtral-8x7B-Instruct-v0.1 147.1 (±111.8) 0.448 0.548 0.043
288 0.3449 (±0.0986/√100) 💬 stabilityai/japanese-stablelm-instruc... 109.4 (±66.2) 0.397 0.585 0.053
289 0.3440 (±0.0978/√100) 💬 stabilityai/japanese-stablelm-3b-4e1t... 127.8 (±80.5) 0.401 0.576 0.055
290 0.3436 (±0.0126/√100) 💬 01-ai/Yi-1.5-9B-Chat 143.6 (±60.1) 0.438 0.540 0.053
291 0.3428 (±0.0163/√100) 🟢 meta-llama/Llama-2-7b-hf 112.3 (±28.0) 0.440 0.550 0.038
292 0.3408 (±0.0225/√100) 🟢 anthracite-org/magnum-32b-v2 191.9 (±223.2) 0.442 0.507 0.073
293 0.3393 (±0.0225/√100) 🟢 stockmark/gpt-neox-japanese-1.4b 92.2 (±63.7) 0.351 0.641 0.025
294 0.3322 (±0.0151/√100) 🟢 Qwen/Qwen1.5-7B-Chat 127.7 (±117.0) 0.431 0.520 0.045
295 0.3315 (±0.0203/√100) 🟢 Qwen/Qwen1.5-7B 141.8 (±126.5) 0.445 0.504 0.046
296 0.3313 (±0.0115/√100) 🟢 google/gemma-2b-it 85.9 (±24.7) 0.393 0.577 0.024
297 0.3293 (±0.0252/√100) 💬 Qwen/Qwen1.5-7B-Chat 195.7 (±113.1) 0.429 0.503 0.056
298 0.3276 (±0.0709/√100) 💬 elyza/ELYZA-japanese-Llama-2-13b-fast... 134.0 (±98.8) 0.395 0.543 0.045
299 0.3272 (±0.0101/√100) 💬 01-ai/Yi-1.5-6B-Chat 194.4 (±75.0) 0.426 0.530 0.025
300 0.3187 (±0.0142/√100) 🟢 Qwen/Qwen2-1.5B-Instruct 131.4 (±46.7) 0.421 0.513 0.022
301 0.3172 (±0.0150/√100) 🟢 Qwen/Qwen2-1.5B 120.9 (±30.7) 0.422 0.511 0.019
302 0.3161 (±0.0119/√100) 🟢 deepseek-ai/deepseek-llm-7b-base 113.7 (±21.6) 0.424 0.501 0.024
303 0.3147 (±0.0175/√100) 💬 Qwen/Qwen2-1.5B-Instruct 180.7 (±101.0) 0.408 0.511 0.025
304 0.3078 (±0.0195/√100) 🟢 cyberagent/open-calm-medium 117.3 (±59.4) 0.363 0.537 0.024
305 0.3058 (±0.1106/√100) 💬 rinna/nekomata-7b-instruction 61.2 (±57.0) 0.307 0.567 0.043
306 0.3053 (±0.0177/√100) 🟢 google/gemma-2b 151.5 (±113.6) 0.410 0.480 0.026
307 0.3050 (±0.0190/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B 146.4 (±90.3) 0.412 0.468 0.035
308 0.2993 (±0.0095/√100) 🟢 01-ai/Yi-1.5-6B-Chat 133.3 (±46.2) 0.394 0.481 0.022
309 0.2993 (±0.0107/√100) 🟢 tiiuae/falcon-11B 121.6 (±31.5) 0.398 0.483 0.016
310 0.2957 (±0.0641/√100) 💬 meta-llama/Llama-2-13b-chat-hf 305.2 (±299.7) 0.402 0.453 0.032
311 0.2953 (±0.0442/√100) 🟢 augmxnt/shisa-base-7b-v1 200.4 (±160.3) 0.378 0.478 0.030
312 0.2924 (±0.0506/√100) 💬 Qwen/Qwen1.5-MoE-A2.7B-Chat 245.1 (±209.1) 0.381 0.453 0.043
313 0.2914 (±0.0133/√100) 🟢 mistralai/Mistral-7B-v0.1 117.4 (±40.4) 0.402 0.454 0.018
314 0.2907 (±0.0175/√100) 🟢 Qwen/Qwen1.5-MoE-A2.7B-Chat 149.8 (±91.0) 0.388 0.448 0.036
315 0.2853 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B-Chat 127.8 (±71.2) 0.395 0.441 0.019
316 0.2809 (±0.0133/√100) 🟢 Qwen/Qwen1.5-1.8B-Chat 178.3 (±92.0) 0.381 0.445 0.017
317 0.2770 (±0.0131/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.2 146.2 (±70.1) 0.387 0.419 0.024
318 0.2769 (±0.0324/√100) 💬 llm-jp/llm-jp-13b-instruct-full-jaste... 16.9 (±24.6) 0.125 0.693 0.013
319 0.2769 (±0.1029/√100) 💬 stabilityai/japanese-stablelm-instruc... 117.0 (±115.0) 0.307 0.489 0.035
320 0.2666 (±0.0241/√100) 🟢 deepseek-ai/deepseek-llm-67b-chat 140.2 (±83.0) 0.351 0.440 0.009
321 0.2661 (±0.0128/√100) 🟢 Qwen/Qwen1.5-1.8B 129.7 (±65.7) 0.360 0.424 0.014
322 0.2613 (±0.0136/√100) 🟢 Qwen/Qwen2-0.5B-Instruct 176.8 (±98.9) 0.351 0.426 0.007
323 0.2604 (±0.0148/√100) 🟢 mistralai/Mistral-7B-Instruct-v0.1 139.8 (±101.3) 0.367 0.400 0.014
324 0.2598 (±0.0129/√100) 🟢 Qwen/Qwen2-0.5B 122.7 (±43.5) 0.350 0.420 0.009
325 0.2581 (±0.0196/√100) 🟢 cyberagent/open-calm-small 119.1 (±54.1) 0.310 0.460 0.004
326 0.2555 (±0.0163/√100) 🟢 Qwen/Qwen1.5-4B 149.2 (±76.6) 0.363 0.388 0.015
327 0.2543 (±0.0266/√100) 🟢 mosaicml/mpt-30b-chat 121.3 (±46.4) 0.327 0.428 0.008
328 0.2414 (±0.0281/√100) 💬 Qwen/Qwen1.5-1.8B-Chat 480.0 (±210.3) 0.329 0.392 0.003
329 0.2394 (±0.0745/√100) 💬 Qwen/Qwen1.5-4B-Chat 105.3 (±104.1) 0.307 0.390 0.021
330 0.2317 (±0.0455/√100) 💬 mistralai/Mistral-7B-Instruct-v0.1 202.3 (±153.9) 0.320 0.362 0.012
331 0.2231 (±0.0166/√100) 💬 mistralai/Mistral-7B-Instruct-v0.2 261.2 (±166.3) 0.316 0.334 0.019
332 0.2182 (±0.0152/√100) 🟢 microsoft/phi-1 47.6 (±34.3) 0.234 0.420 0.000
333 0.2177 (±0.0110/√100) 🟢 Qwen/Qwen1.5-0.5B-Chat 143.4 (±52.1) 0.317 0.327 0.009
334 0.2169 (±0.0561/√100) 💬 Qwen/Qwen2-0.5B-Instruct 129.5 (±114.3) 0.265 0.379 0.006
335 0.2169 (±0.0218/√100) 🟢 mosaicml/mpt-30b-instruct 109.8 (±36.1) 0.274 0.370 0.008
336 0.2146 (±0.0151/√100) 🟢 microsoft/phi-2 78.0 (±31.4) 0.287 0.356 0.001
337 0.2061 (±0.0820/√100) 💬 meta-llama/Llama-2-70b-chat-hf 523.3 (±444.5) 0.271 0.303 0.045
338 0.2040 (±0.0152/√100) 🟢 Qwen/Qwen1.5-0.5B 138.6 (±55.9) 0.296 0.314 0.003
339 0.2038 (±0.0538/√100) 🟢 mosaicml/mpt-30b 236.5 (±433.3) 0.271 0.334 0.007
340 0.1885 (±0.0194/√100) 🟢 microsoft/phi-1_5 77.5 (±33.6) 0.258 0.306 0.001
341 0.1833 (±0.0406/√100) 💬 google/gemma-1.1-2b-it 32.6 (±26.7) 0.171 0.376 0.003
342 0.1765 (±0.0439/√100) 💬 Qwen/Qwen1.5-0.5B-Chat 214.3 (±172.6) 0.251 0.276 0.002
343 0.1687 (±0.0172/√100) 🟢 upstage/SOLAR-10.7B-v1.0 171.0 (±87.1) 0.265 0.237 0.004
344 0.1544 (±0.0132/√100) 🟢 01-ai/Yi-1.5-34B-Chat 730.0 (±533.6) 0.201 0.256 0.006
345 0.1475 (±0.0826/√100) 💬 mosaicml/mpt-30b-chat 112.2 (±112.4) 0.182 0.254 0.007
346 0.1241 (±0.0558/√100) 💬 google/gemma-2b-it 24.1 (±24.6) 0.115 0.257 0.000
347 0.1226 (±0.0240/√100) 🟢 Deci/DeciLM-7B 174.0 (±165.5) 0.190 0.174 0.003
348 0.1160 (±0.0081/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 212.1 (±148.9) 0.153 0.195 0.000
349 0.1009 (±0.0846/√100) 💬 meta-llama/Llama-2-7b-chat-hf 241.5 (±336.2) 0.136 0.158 0.009
350 0.1004 (±0.0094/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 123.1 (±128.8) 0.119 0.182 0.000
351 0.0987 (±0.0145/√100) 🟢 deepseek-ai/deepseek-llm-67b-base 154.2 (±77.3) 0.174 0.121 0.000
352 0.0982 (±0.1596/√100) 💬 rinna/nekomata-14b-instruction 16.0 (±38.1) 0.115 0.141 0.039
353 0.0955 (±0.0102/√100) 🟢 rinna/japanese-gpt-neox-3.6b-instruct... 129.5 (±141.0) 0.116 0.170 0.000
354 0.0939 (±0.0064/√100) 🟢 sbintuitions/tiny-lm-chat 250.2 (±275.6) 0.133 0.149 0.000
355 0.0936 (±0.0082/√100) 💬 sbintuitions/tiny-lm-chat 276.7 (±209.6) 0.135 0.145 0.000
356 0.0921 (±0.0058/√100) 🟢 sbintuitions/tiny-lm 471.9 (±199.0) 0.135 0.142 0.000
357 0.0880 (±0.0334/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 134.0 (±144.7) 0.105 0.159 0.000
358 0.0762 (±0.0033/√100) 🟢 line-corporation/japanese-large-lm-3.6b 1066.6 (±31.6) 0.125 0.103 0.000
359 0.0760 (±0.0032/√100) 🟢 line-corporation/japanese-large-lm-3.... 1066.4 (±31.8) 0.125 0.103 0.000
360 0.0758 (±0.0034/√100) 💬 line-corporation/japanese-large-lm-3.... 1067.2 (±31.8) 0.125 0.102 0.000
361 0.0673 (±0.0085/√100) 🟢 moneyforward/houou-instruction-7b-v3 143.2 (±112.2) 0.098 0.104 0.000
362 0.0625 (±0.0169/√100) 🟢 llm-jp/llm-jp-13b-instruct-full-ac_00... 31.6 (±10.3) 0.088 0.099 0.000
363 0.0429 (±0.0440/√100) 🟢 rinna/bilingual-gpt-neox-4b-instructi... 31.7 (±54.7) 0.045 0.084 0.000
364 0.0406 (±0.0028/√100) 🟢 microsoft/Phi-3-small-128k-instruct 268.1 (±123.4) 0.083 0.039 0.000
365 0.0337 (±0.0026/√100) 🟢 augmxnt/shisa-7b-v1 590.7 (±238.2) 0.076 0.025 0.000
366 0.0284 (±0.0012/√100) 🟢 lightblue/karasu-7B-chat-plus 285.1 (±53.8) 0.080 0.005 0.000
367 0.0225 (±0.0702/√100) 💬 SakanaAI/EvoLLM-JP-A-v1-7B 5.9 (±27.6) 0.026 0.037 0.005
368 0.0180 (±0.0039/√100) 🟢 mistralai/Mistral-Nemo-Base-2407 607.5 (±344.5) 0.039 0.015 0.000
369 0.0047 (±0.0024/√100) 🟢 ai-forever/mGPT-13B 321.1 (±266.7) 0.008 0.006 0.000
370 0.0022 (±0.0006/√100) 🟢 lightblue/qarasu-14B-chat-plus-unleashed 937.5 (±557.0) 0.004 0.002 0.000
371 0.0019 (±0.0002/√100) 🟢 01-ai/Yi-1.5-9B-Chat 1440.0 (±51.9) 0.005 0.001 0.000
372 0.0018 (±0.0004/√100) 🟢 CohereForAI/aya-23-8B 1676.6 (±351.0) 0.004 0.002 0.000
373 0.0006 (±0.0002/√100) 🟢 meta-llama/Llama-2-13b-chat-hf 1523.9 (±43.5) 0.001 0.001 0.000
374 0.0000 (±0.0000/√100) 🟢 01-ai/Yi-1.5-6B 0.0 (±0.0) 0.000 0.000 0.000
375 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-1.1B 0.0 (±0.0) 0.000 0.000 0.000
376 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat-plus-unleashed 0.0 (±0.0) 0.000 0.000 0.000
377 0.0000 (±0.0000/√100) 🟢 lightblue/karasu-7B-chat 0.0 (±0.0) 0.000 0.000 0.000
378 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-japanese 300.0 (±0.0) 0.000 0.000 0.000
379 0.0000 (±0.0000/√100) 🟢 lightblue/suzume-llama-3-8B-multilingual 300.0 (±0.0) 0.000 0.000 0.000

Citation

If you use this repository, please cite the following paper:

@preprint{Imos2024-pre-pfgen,
  title={{pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク}},
  author={今城, 健太郎 and 平野, 正徳 and 鈴木, 脩司 and 三上, 裕明},
  doi={10.51094/jxiv.1008},
  year={2024}
}

Or cite directory this repository:

@misc{imajo2024-pfgen
    title={{Preferred Generation Benchmark}},
    author={Kentaro Imajo and Masanori Hirano and Shuji Suzuki and Hiroaki Mikami},
    year={2024},
    url = {https://github.com/pfnet-research/pfgen-bench}
}

About

Preferred Generation Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published