Skip to content

Latest commit

 

History

History

en

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Overview of Japanese LLMs

[ English | Français | 日本語 ]

Parameter sizes of Japanese and non-Japanese LLMs over time

Evolution of parameter sizes for Japanese LLMs and non-Japanese LLMs. The information on the Japanese models is derived from this article, while the information on the non-Japanese models can be referred from the Models table on LifeArchitect.ai. However, due to space constraints in the figure, some models have been omitted. Additionally, estimates are included in the parameter count for non-Japanese models. Please notify us of any corrections, additions, or updates.

A list of publicly available LLMs trained with a focus on Japanese, along with their evaluation benchmarks, maintained by volunteers from various sources like academic papers and other public resources.

::: warning Caution

  1. We can't guarantee the accuracy or completeness of any information here.
  2. Some information is based on conjecture and might not reflect your specific use case.
  3. While many models are released under permissive licenses like MIT or Apache 2.0, some are subject to more restrictive terms including non-commercial use clauses (e.g CC BY-NC-SA 4.0) or other stipulations. :::

Please point out any errors on the issues page. Feel free to contribute directly with a pull request.

::: details Table of Contents [[toc]] :::

Text Generation Models

For multimodal models, see below.

Models built from scratch

General purpose

Architecture Max Context Length Training Data Developer License / Terms of Use
Sarashina2-8x70B Mixtral
(8x70b (465b))
8,192 Sparse Upcycling on Sarashina2 (70B) SB Intuitions Sarashina Model NonCommercial License
LLM-jp-3 172B Llama
(172b, 172b-instruct3)
4,096 Pre-training: llm-jp-corpus-v3
(2.1T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, magpie-sft-v1.0, Daring-Anteater, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft-ja, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k
DPO: synthetic data
Research and Development Center for Large Language Models Pre-trained model: LLM-jp-3 172B Terms of Use
Post-trained model: llm-jp-3-172b-instruct3 Terms of Use
LLM-jp-3 172B beta2 Llama
(172b-beta2, 172b-beta2-instruct2)
4,096 Pre-training: part of llm-jp-corpus-v3
(1.4T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, magpie-sft-v1.0, Daring-Anteater, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft-ja, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k
Research and Development Center for Large Language Models LLM-jp-3 172B beta2 Terms of Use
LLM-jp-3 172B beta1 Llama
(172b-beta1, 172b-beta1-instruct)
4,096 Pre-training: part of llm-jp-corpus-v3
(0.7T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2, Aya Dataset, ichikara-instruction-format, Daring-Anteater, FLAN
Research and Development Center for Large Language Models LLM-jp-3 172B beta1 Terms of Use
LLM-jp-3 172B alpha Llama
(172b-alpha1, 172b-alpha1-instruct, 172b-alpha2, 172b-alpha2-instruct)
4,096 Pre-training: part of llm-jp-corpus-v3
(alpha1: 0.7T tokens, alpha2: 1.4T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2, Aya Dataset, ichikara-instruction-format, Daring-Anteater, FLAN
Research and Development Center for Large Language Models Apache 2.0
Stockmark-100b Llama
(100b, 100b-instruct-v0.1)
4,096 Pre-training: RedPajama, Japanese Wikipedia, Japanese mC4, Japanese CommonCrawl, Japanese Patent, Stockmark Web Corpus
(910B tokens)
Instruction Tuning (LoRA): ichikara-instruction
Stockmark MIT
PLaMo-100B-Pretrained Llama1
(100b)
4,096 Pre-training: Japanese CommonCrawl, RefinedWeb, undisclosed
(2.0T tokens)
Preferred Elements (Preferred Networks) PLaMo Non-Commercial License
Sarashina2 Llama
(7b, 13b, 70b)
7b, 13b: 4,096
70b: 8,192
Pre-training: Japanese Common Crawl, SlimPajama, StarCoder
(2.1T tokens)
SB Intuitions MIT
Sarashina1 GPT-NeoX
(7b, 13b, 65b)
2,048 Pre-training: Japanese Common Crawl
(1T tokens)
SB Intuitions MIT
Tanuki-8×8B Tanuki (MoE) (47b)
(v1.0, v1.0-AWQ, v1.0-GPTQ-4bit, v1.0-GPTQ-8bit, v1.0-GGUF)
4,096 Pre-training: various Web & synthetic datasets(1.7T tokens)
SFT, DPO: various synthetic datasets 2
Matsuo Lab LLM Development Project Apache 2.0
CyberAgentLM3 (CALM3) Llama
(22b-chat)
16,384 undisclosed
(2.0T tokens)
CyberAgent Apache 2.0
LLM-jp-3 13B Llama
(1.8b, 1.8b-instruct, 3.7b, 3.7b-instruct, 13b, 13b-instruct)
4,096 Pre-training: llm-jp-corpus-v3
(2.1T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k
Research and Development Center for Large Language Models Apache 2.0
llm-jp-3-3.7b-instruct-EZO Llama
(3.7b-instruct-EZO-Common, 3.7b-instruct-EZO-Humanities)
4,096 additionally trained on LLM-jp-3 (3.7B) Axcxept Apache 2.0
LLM-jp-13B v2.0 Llama
(13b-v2.0, 13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0, 13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0, 13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0)
4,096 Pre-training: llm-jp-corpus-v2
(260B tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2
LLM-jp Apache 2.0
Fugaku-LLM GPT
(13B, 13B-instruct, 13B-instruct-gguf)
2,048 Pre-training: undisclosed dataset
Instruction Tuning: OASST1, Dolly Dataset, GSM8K
Titech, Tohoku Univ., Fujitsu, RIKEN, Nagoya Univ., CyberAgent, Kotoba Technologies Fugaku-LLM Terms of Use
LLM-jp-13B v1.1 GPT
(13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1, 13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1, 13b-dpo-lora-hh_rlhf_ja-v1.1)
2,048 Instruction Tuning (LoRA or Full-parameter FT): Dolly Dataset, OASST1, ichikara-instruction
DPO (LoRA): HH RLHF
LLM-jp Apache 2.0
LLM-jp-13B GPT
(1.3b-v1.0, 13b-v1.0, 13b-instruct-full-jaster-v1.0, 13b-instruct-full-jaster-dolly-oasst-v1.0, 13b-instruct-full-dolly-oasst-v1.0, 13b-instruct-lora-jaster-v1.0, 13b-instruct-lora-jaster-dolly-oasst-v1.0, 13b-instruct-lora-dolly-oasst-v1.0)
2,048 Pre-training: llm-jp-corpus (Wikipedia, Japanese mC4, The Pile, Stack) (300B tokens)
Instruction Tuning (Full-parameter FT or LoRA): jaster, Dolly Dataset, OASST1
LLM-jp Apache 2.0
PLaMo-13B Llama3
(13b, 13b-instruct, 13b-instruct-nc)
base: 4,096
instruct, instruct-nc: 8,192
Pre-training: C4, Project Gutenberg, RedPajama, Japanese Wikipedia, Japanese mC4
(1.5T tokens)
Instruction Tuning: Dolly, HH RLHF, OASST1, wikinews (+Alpaca in NC model)
Preferred Networks Apache 2.0
(CC BY-NC 4.0 as for NC model)
Stockmark-13b Llama
(13b, 13b-instruct)
2,048 Pre-training: Japanese Wikipedia, Japanese CC-100, Japanese mC4, Japanese CommonCrawl, Japanese Patent, Stockmark Web Corpus
(220B tokens)
Instruction Tuning (LoRA): ichikara-instruction
Stockmark base: MIT
instruct: CC BY-NC-SA 4.0
Weblab-10B GPT-NeoX
(10b, 10b-instruction-sft)
2,048 Japanese mC4, The Pile
(600B tokens)
Instruction Tuning: Alpaca, FLAN
University of Tokyo Matsuo Lab CC BY‑NC 4.0
Tanuki-8B Tanuki (8b)
(v1.0, v1.0-AWQ, v1.0-GPTQ-4bit, v1.0-GPTQ-8bit, v1.0-GGUF)
4,096 Pre-training: various Web & synthetic datasets(1.3T tokens)
SFT, DPO: various synthetic datasets 2
Matsuo Lab LLM Development Project Apache 2.0
Japanese StableLM Alpha GPT-NeoX
(base-alpha-7b, instruct-alpha-7b, instruct-alpha-7b-v2)
2,048 Wikipedia, Japanese CC‑100, Japanese mC4, Japanese OSCAR, RedPajama, private datasets4
(750B tokens)
Instruction Tuning: Dolly, HH‑RLHF, wikinews, Alpaca (discarded in v2)
Stability AI base: Apache 2.0
instruct (v1): Research license
instruct (v2): Apache 2.0
CyberAgentLM2 (CALM2) Llama
(7b, 7b-chat, 7b-chat-dpo-experimental)
base: 4,096
chat: 32,768
publicly available Japanese and English datasets (details unknown)
(1.3T tokens)
DPO: Chatbot Arena Conversations JA (calm2) Dataset
CyberAgent Apache 2.0
(CC BY 4.0 as for DPO model)
OpenCALM GPT-NeoX
(small, medium, large, 1b(1.4b), 3b(2.7b), 7b(6.8b))
2,048 Japanese Wikipedia, Japanese mC4, Japanese CC‑100 CyberAgent CC BY‑SA 4.0
Stormy GPT-NeoX
(7b(6.8b))
2,048 OpenCALM fine-tuned on
llm-japanese-dataset v0 non-translation tasks
University of Tokyo Izumi Lab CC BY‑SA 4.0
rinna GPT
(En-Ja Bilingual)
GPT-NeoX
(4b(3.8b), 4b(3.8b)-8k, 4b(3.8b)-instruction-sft, 4b(3.8b)-instruction-ppo)
8k model: 8,192
others: 2,048
Wikipedia, Japanese CC‑100, Japanese C4, RedPajama, The Pile
(524B tokens)
Instruction Tuning: HH‑RLHF, FLAN
PPO: HH‑RLHF for reinforcement learning
8k: trained with long context
rinna MIT
japanese-large-lm GPT-NeoX
(1.7b, 3.6b, 1.7b-instruction-sft, 3.6b-instruction-sft)
2,048 Japanese Wikipedia, Japanese CC‑100, Japanese C4, Japanese OSCAR and private datasets
(650GB)
Instruction Tuning: OASST1
LINE Apache 2.0
rinna GPT
(Japanese only)
GPT / GPT-NeoX
(xsmall, small, medium, 1b, neox-small, neox-3.6b, neox-3.6b-instruction-sft, neox-3.6b-instruction-sft-v2, neox-3.6b-instruction-ppo)
≤ 2,048 Japanese Wikipedia, Japanese CC‑100
(1b and up models add
Japanese mC4)
Instruction Tuning: HH‑RLHF, FLAN, SHP
PPO: HH‑RLHF for reinforcement learning
rinna MIT
RetrievaT5 T5
(small (short), small (medium), small (long), base (short), base (medium), base (long), large (short), large (medium), large (long), xl(3b))
Japanese Wikipedia, Japanese mC4 Retrieva CC BY‑SA 4.0
Spiral-RetNet-3b-base RetNet
(3b)
2,048 Wikipedia, Japanese CC-100, CulturaX Spiral.AI MIT
kotomamba-2.8B Mamba
(2.8B-v1.0)
2,048 Japanese Wikipedia, Swallow Corpus, SlimPajama Kotoba Technologies Apache 2.0
ABEJA GPT GPT / GPT-NeoX
(large, neox-2.7b)
Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR ABEJA MIT
WasedaGPT GPT
(small, xl(1.5b))
Japanese Wikipedia, Japanese CC‑100 Waseda Kawahara Lab CC BY‑SA 4.0
StockmarkGPT GPT-NeoX
(1.4b)
Japanese Wikipedia (0.88B tokens), Japanese CC‑100 (10.5B tokens), private data (8.6B tokens) Stockmark MIT
YellowbackGPT GPT-NeoX
(1.3b)
Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR Yellowback Apache 2.0
Sarashina2.1-1B Llama
(1b)
8,192 Japanese and English data on the web (10T tokens) SB Intuitions Sarashina Model NonCommercial License
colorfulscoop GPT GPT
(small)
Japanese Wikipedia Colorful Scoop CC BY‑SA 3.0
TitechGPT GPT
(medium, medium-reversed) 5
Japanese Wikipedia, Japanese CC‑100 Titech Okazaki Lab CC BY‑SA 4.0
KyotoUniversityGPT GPT
(small, medium, large)
Japanese Wikipedia (3.2GB), Japanese CC‑100 (85GB), Japanese OSCAR (54GB) Kyoto University Language Media Processing Lab CC BY‑SA 4.0
JapaneseBART BART
(base, large)
Japanese Wikipedia (18M sentences) Kyoto University Language Media Processing Lab CC BY‑SA 4.0
Megagon Labs T5 T5
(base)
Japanese mC4 (782 GB), Japanese wiki40b (2 GB) Megagon Labs
(Recruit Co.,Ltd.)
Apache 2.0

Domain Specific

Domain Architecture Training Data Developer License
Japanese Dialog Transformer Dialog Transformer Twitter japanese reply pairs NTT Evaluation Licence
Japanese News BART Business BART (base) Japanese business news articles (21M articles) Stockmark MIT
AcademicBART Science BART (base) CiNii Japanese Papers Ehime University AI Lab Apache 2.0

Models built off non-Japanese LLMs (w/ continual pre-training on Japanese)

General purpose

Base Model Training Data Developer License / Terms of Use
Llama 3.1 Swallow 70B
(70B-v0.1, 70B-Instruct-v0.1, 70B-Instruct-v0.3)
Llama 3.1 (70b) Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie
Swallow Project Llama 3.1 Community License
(Gemma Terms of Use is also applied to the Instruct model)
cyberagent/Llama-3.1-70B-Japanese-Instruct-2407 Llama 3.1 (70b) undisclosed CyberAgent Llama 3.1 Community License
Llama 3 Swallow 70B
(70B-v0.1, 70B-Instruct-v0.1)
Llama 3 (70b) Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath
Instruction Tuning: OASST1 6
Swallow Project Llama 3 Community License
turing-motors/Llama-3-heron-brain-70B-v0.3 Llama 3 (70b) additionally trained on Llama 3 Swallow 70B (details undisclosed) Turing Llama 3 Community License
Llama 3 Youko 70B
(70b, 70b-instruct, 70b-gptq, 70b-instruct-gptq)
Llama 3 (70b) Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(5B tokens)
Instruction Tuning: undisclosed datasetト7
rinna Llama 3 Community License
Swallow 70B
(70b-hf, 70b-instruct-hf, 70b-instruct-v0.1, 70b-NVE-hf, 70b-NVE-instruct-hf)
Llama 2 (70b) Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
*v0.1: OASST1, OASST2
Swallow Project Llama 2 Community License
KARAKURI LM
(70b-v0.1, 70b-chat-v0.1)
Llama 2 (70b) Pre-training: mC4, CC100, OSCAR, RedPajama, undisclosed dataset
(16B tokens)
SteerLM: OASST2, undisclosed dataset
KARAKURI Llama 2 Community License8
Japanese Stable LM Beta 70B
(base-beta-70b, instruct-beta-70b)
Llama 2 (70b) Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3)
(100B tokens)
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
Stability AI Llama 2 Community License
Swallow-MX 8x7B
(8x7b-NVE-v0.1)
Mixtral-8x7B-Instruct-v0.1 (46.7b) Pre-training: Algebraic Stack, Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile, The Vault Swallow Project Apache 2.0
KARAKURI LM 8x7B Instruct v0.1
(8x7b-instruct-v0.1)
Mixtral-8x7B-Instruct-v0.1 (46.7b) trained Swallow-MX 8x7B on the following datasets: Dolly Dataset, OASST2, HelpSteer, glaive-code-assistant-v3, glaive-function-calling-v2, synthetic_text_to_sql, MetaMathQA, orca-math-word-problems-200k, rag-dataset-12000, rag-hallucination-dataset-1000, undisclosed dataset KARAKURI Apache 2.0 (?)9
KARAKURI LM 8x7B Chat v0.1
(8x7b-chat-v0.1)
Mixtral-8x7B-Instruct-v0.1 (46.7b) trained Swallow-MX 8x7B on OASST2, HelpSteer, and undisclosed datasets using SteerLM KARAKURI Apache 2.0
ABEJA-Mixtral-8x7B-japanese
(8x7B-v0.1-japanese, 8x7B-Instruct-v0.1-japanese, 8x7B-Instruct-v0.1-japanese-alpha, 8x7B-Instruct-v0.1-japanese-alpha-merged)
Mixtral-8x7B-Instruct-v0.1 (46.7b)
*The model without "Instruct" in its name is based on Mixtral-8x7B-v0.1
Pre-training: Japanese CC, Redpajama, undisclosed dataset
450B tokens)
ABEJA Apache 2.0
Nekomata 14B
(14b, 14b-instruction, 14b-gguf, 14b-instruction-gguf)
Qwen (14b) Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(66B tokens)
Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset
rinna Tongyi Qianwen LICENSE
Swallow 13B
(13b-hf, 13b-instruct-hf, 13b-instruct-v0.1, 13b-NVE-hf)
Llama 2 (13b) Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
*v0.1: OASST1, OASST2
Swallow Project Llama 2 Community License
LEIA-Swallow-13B
(13b)
Llama 2 (13b) additionally trained Swallow 13B using LEIA Individual (Ikuya Yamada, Ryokan Ri) Llama 2 Community License
ELYZA-japanese-Llama-2-13b
(13b, 13b-instruct, 13b-fast, 13b-fast-instruct)
Llama 2 (13b) Pre-training: Japanese Wikipedia, Japanese OSCAR, and other crawled data
(18B tokens)
Instruction Tuning: undisclosed dataset
ELYZA Llama 2 Community License
cyberagent/Mistral-Nemo-Japanese-Instruct-2408 Mistral NeMo (12b) undisclosed CyberAgent Apache 2.0
Llama 3.1 Swallow 8B
(8B-v0.1, 8B-Instruct-v0.1, 8B-v0.2, 8B-Instruct-v0.2, 8B-Instruct-v0.3)
Llama 3.1 (8b) Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie
Swallow Project Llama 3.1 Community License
(Gemma Terms of Use is also applied to the Instruct model)
Llama 3 Swallow 8B
(8B-v0.1, 8B-Instruct-v0.1)
Llama 3 (8b) Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath
Instruction Tuning: OASST1 6
Swallow Project Llama 3 Community License
turing-motors/Llama-3-heron-brain-8B-v0.3 Llama 3 (8b) additionally trained on Llama 3 Swallow 8B (details undisclosed) Turing Llama 3 Community License
Llama 3 Youko 8B
(8b, 8b-instruct, 8b-gptq, 8b-instruct-gptq)
Llama 3 (8b) Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(22B tokens)
Instruction Tuning7: Aya Dataset (Japanese subset), FLAN, Dolly Dataset, HH RLHF, OASST1, OASST2, MetaMathQA, CodeAlpaca Dataset, undisclosed dataset
DPO: HelpSteer, HelpSteer2, undisclosed dataset
rinna Llama 3 Community License
Llama 3 ELYZA JP 8B
(8B, 8B-GGUF, 8B-AWQ)
Llama 3 (8b) undisclosed ELYZA Llama 3 Community License
Llama 3 neoAI 8B Chat v0.1
(8B-Chat-v0.1)
Llama 3 (8b) undisclosed neoAI Llama 3 Community License
Llama 3 tedllm
(v0)
Llama 3 (8b) Pre-training: Japanese generic corpus Tokyo Electron Device Llama 3 Community License
Swallow 7B
(7b-hf, 7b-instruct-hf, 7b-instruct-v0.1, 7b-NVE-hf, 7b-NVE-instruct-hf, 7b-plus-hf)
Llama 2 (7b) Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
*v0.1: OASST1, OASST2
Swallow Project Llama 2 Community License
LEIA-Swallow-7B
(7b)
Llama 2 (7b) additionally trained Swallow 7B using LEIA Individual (Ikuya Yamada, Ryokan Ri) Llama 2 Community License
ELYZA-japanese-Llama-2-7b
(7b, 7b-instruct, 7b-fast, 7b-fast-instruct)
Llama 2 (7b) Pre-training: Japanese Wikipedia, Japanese OSCAR, and other crawled data
(18B tokens)
Instruction Tuning: undisclosed dataset
ELYZA Llama 2 Community License
Youri 7B
(7b, 7b-instruction, 7b-chat, 7b-gptq, 7b-instruction-gptq, 7b-chat-gptq)
Llama 2 (7b) Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(40B tokens)
Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset
rinna Llama 2 Community License
houou-7b
(instruction-7b-v1, instruction-7b-v2, instruction-7b-v3)
Llama 2 (7b) Instruction-tuned Youri 7B (base) on ichikara-instruction MoneyForward Llama 2 Community License
Japanese Stable LM Beta 7B
(base-beta-7b, base-ja_vocab-beta-7b, instruct-beta-7b, instruct-ja_vocab-beta-7b)
Llama 2 (7b) Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3)
(100B tokens)
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
Stability AI Llama 2 Community License
SambaLingo-Japanese
(Base, Chat)
Llama 2 (7b) Pre-training: CulturaX
Instruction Tuning: ultrachat_200k
DPO: ultrafeedback, cai-conversation-harmless
SambaNova Systems Llama 2 Community License (?)9
blue-lizard
(blue-lizard)
Llama 2 (7b) undisclosed Deepreneur Llama 2 Community License
Swallow-MS 7B
(7b-v0.1, 7b-instruct-v0.1)
Mistral-7B-v0.1 (7b) Pre-training: Algebraic Stack, Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile
Instruction Tuning: Dolly Dataset, OASST1
Swallow Project Apache 2.0
RakutenAI-7B
(7B, 7B-instruct, 7B-chat)
Mistral-7B-v0.1 (7b) Pre-training: undisclosed
Instruction Tuning: Dolly Dataset, OASST1, datasets converted from the train split of NLU datasets (like jaster), undisclosed dataset
Rakuten Apache 2.0
Japanese Stable LM Gamma 7B
(base-gamma-7b, instruct-gamma-7b)
Mistral-7B-v0.1 (7b) Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3)
(100B tokens)
Instruction Tuning: Dolly Dataset, HH RLHF, wikinews subset of llm-japanese-dataset
Stability AI Apache 2.0
ChatNTQ JA 7B
(7b-v1.0)
Mistral-7B-v0.1 (7b) Instruction-tuned Japanese Stable LM Gamma 7B (base) on their own datasets NTQ Solution Apache 2.0
Shisa Gamma 7B
(7b-v1)
Mistral-7B-v0.1 (7b) Instruction-tuned Japanese Stable LM Gamma 7B (base) on ultra-orca-boros-en-ja AUGMXNT Apache 2.0 (?)9
Shisa 7B
(base-7b-v1, 7b-v1)
Mistral-7B-v0.1 (7b) Pre-training: shisa-pretrain-en-ja-v1 (8B tokens)
Instruction Tuning & DPO: ultra-orca-boros-en-ja, shisa-en-ja-dpo-v1
AUGMXNT Apache 2.0 (?)9
Karasu
(7B, 7B-chat, 7B-chat-plus, 7B-chat-plus-unleashed)
Mistral-7B-v0.1 (7b) Additionally trained Shisa 7B (base) on Aozora Bunko, Japanese Law Precedent Dataset, Japanese Wikipedia, Japanese domain webscrapes from the Japanese subset of CulturaX, UltraChat 200k
(7B tokens)
Instruction Tuning: ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, undisclosed dataset
Lightblue Apache 2.0 (?)9
Nekomata 7B
(7b, 7b-instruction, 7b-gguf, 7b-instruction-gguf)
Qwen (7b) Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(66B tokens)
Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset
rinna Tongyi Qianwen LICENSE
lightblue/japanese-mpt-7b MPT (7b) Japanese mC4 Lightblue Apache 2.0
Japanese Stable LM 3B-4E1T
(3b-4e1t-base, 3b-4e1t-instruct)
StableLM-3B-4E1T (3b) Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3)
(100B tokens)
Instruction Tuning: Dolly Dataset, HH RLHF, wikinews subset of llm-japanese-dataset
Stability AI Apache 2.0
kotomamba-2.8B-CL mamba-2.8b-slimpj
(2.8b)
Japanese Wikipedia, Swallow Corpus, SlimPajama Kotoba Technologies Apache 2.0
Gemma 2 Baku 2B
(2b, 2b-it)
Gemma 2 (2b) Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(80B tokens)
OPRO: undisclosed dataset 10
rinna Gemma Terms of Use
Japanese Stable LM 2 1.6B
(base, instruct)
Stable LM 2 1.6B (1.6b) Pre-training: Wikipedia, CulturaX
Instruction Tuning: jaster, ichikara-instruction, alpaca-gpt4-japanese, ultra-orca-boros-en-ja-v1
Stability AI STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE
karasu-1.1B TinyLlama (1.1b) Pre-training: Japanese OSCAR, Japanese mC4
(3B tokens)
Lightblue Apache 2.0

Domain specific

Domain Base Model Developer License
Llama3-Preferred-MedSwallow-70B
(70B)
Medicine Llama 3 (70b) Preferred Networks Llama 3 Community License
AIgroup-CVM-utokyohospital/MedSwallow-70b Medicine Llama 2 (70b) University of Tokyo Hospital Department of Cardiovascular Medicine AI Group CC BY-NC-SA 4.0
nekomata-14b-pfn-qfin
(qfin, qfin-inst-merge)
Finance Qwen (14b) Preferred Networks Tongyi Qianwen LICENSE
Watashiha-Llama-2-13B-Ogiri-sft
(sft, sft-neuron)
Oogiri Llama 2 (13b) Watashiha Llama 2 Community License
ELYZA-japanese-CodeLlama-7b
(7b, 7b-instruct)
Coding Code Llama
(7b)
ELYZA Llama 2 Community License
AIBunCho/japanese-novel-gpt-j-6b Storytelling GPT-J (6b) Individual (Hiroyuki Osone) CreativeML OpenRAIL-M License
NovelAI/genji-jp Storytelling GPT-J (6b) NovelAI

Models built off non-Japanese LLMs (w/ post-training on Japanese)

General purpose

Base Model Training Data Developer License / Terms of Use
AXCXEPT/EZO-Qwen2.5-72B-Instruct
AXCXEPT/EZO-AutoCoTRAG-Qwen2.5-72B-Instruct_q4
Qwen2.5 (72b) Axcxept Qwen License
ao-Karasu
(72B)
Qwen1.5 (72b) ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, Japanese technical blogs, News stories, QA site answers, undisclosed dataset Lightblue Tongyi Qianwen LICENSE (?)9
AXCXEPT/Llama-3.1-70B-EZO-1.1-it Llama 3.1 (70b) Axcxept Llama 3.1 Community License
Llama 3 shisa-v1-llama3-70b
(70b)
Llama 3 (70b) ultra-orca-boros-en-ja-v1 Shisa.AI Llama 3 Community License (?)9
AIgroup-CVM-utokyohospital/Llama-2-70b-chat-4bit-japanese Llama 2 (70b) University of Tokyo Hospital Department of Cardiovascular Medicine AI Group Llama 2 Community License
doshisha-mil/llama-2-70b-chat-4bit-japanese-v1 Llama 2 (70b) Doshisha University Media Informatics Lab
AXCXEPT/EZO-Qwen2.5-32B-Instruct
AXCXEPT/EZO-AutoCoTRAG-Qwen2.5-32B-Instruct
Qwen2.5 (32b) Axcxept Apache 2.0
Qarasu
(14B-chat-plus-unleashed)
Qwen (14b) ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, undisclosed dataset Lightblue Tongyi Qianwen LICENSE (?)9
Sparticle/llama-2-13b-chat-japanese-lora Llama 2 (13b) Sparticle
izumi-lab/llama-13b-japanese-lora-v0-1ep Llama (13b) University of Tokyo Izumi Lab
AXCXEPT/EZO-Common-9B-gemma-2-it Gemma 2 (9b) Axcxept Gemma Terms of Use
AXCXEPT/EZO-Humanities-9B-gemma-2-it Gemma 2 (9b) Axcxept Gemma Terms of Use
AXCXEPT/Llama-3.1-8B-EZO-1.1-it Llama 3.1 (8b) Axcxept Llama 3.1 Community License
Llama 3 Suzume 8B
(8B-japanese, 8B-japanese-gguf)
Llama 3 (8b) megagonlabs/instruction_ja, ShareGPT, undisclosed dataset Lightblue Llama 3 Community License (?)9
Llama 3 shisa-v1-llama3-8b
(8b)
Llama 3 (8b) ultra-orca-boros-en-ja-v1 Shisa.AI Llama 3 Community License (?)9
AXCXEPT/Llama-3-EZO-8b-Common-it Llama 3 (8b) Axcxept Llama 3 Community License
ganchengguang/Yoko-7B-Japanese-v1 Llama 2 (7b) Yokohama National University Mori Lab
Sparticle/llama-2-7b-chat-japanese-lora Llama 2 (7b) Sparticle
izumi-lab/llama-7b-japanese-lora-v0-5ep Llama (7b) University of Tokyo Izumi Lab
lightblue/jod Mistral-7B-SlimOrca (7b) Lightblue Apache 2.0
NTQAI/chatntq-7b-jpntuned RWKV-4 World (7b) NTQ Solution
Borea
(Jp, Common, Coding)
Phi-3.5 (3.8b) Axcxept MIT
AXCXEPT/EZO-Llama-3.2-3B-Instruct-dpoE Llama 3.2 (3b) Axcxept Llama 3.2 Community License
Gemma-2-JPN
(2b-jpn-it)
Gemma 2 (2b) Google Gemma Terms of Use
AXCXEPT/EZO-gemma-2-2b-jpn-it Gemma 2 (2b) Axcxept Gemma Terms of Use
AXCXEPT/EZO-Common-T2-2B-gemma-2-it Gemma 2 (2b) Axcxept Gemma Terms of Use

Domain specific

Domain Base Model Developer License
JMedLoRA
(llama2-jmedlora-6.89ep)
Medicine Llama 2 (70b) University of Tokyo Hospital Department of Cardiovascular Medicine AI Group CC BY-NC 4.0
AXCXEPT/Qwen2.5-Math-7B-Instruct-jp-EZO_OREO Mathematics Qwen2.5-Math-7B-Instruct (7b) Axcxept Apache 2.0

Merged models

Original Models (Japanese LLMs in bold) Developer License
EQUES/MedLLama3-JP-v2 Llama 3 Swallow 8B (Instruct), OpenBioLLM-8B, MMed-Llama 3 8B, Llama 3 ELYZA JP 8B EQUES Llama 3 Community License
EvoLLM-JP-A
(v1-7B)
Shisa Gamma 7B (v1), Arithmo2 Mistral 7B, Abel 7B 002 Sakana AI Apache 2.0
EvoLLM-JP
(v1-7B, v1-10B)
Shisa Gamma 7B (v1), WizardMath-7B-V1.1, Abel 7B 002 Sakana AI MICROSOFT RESEARCH LICENSE

API-based models

Max Context Length Developer Platform
Solar mini chat ja
(solar-1-mini-chat-ja)
32,768 Upstage self-owned
AI Novelist 2,400 ~ 8,192 Bit192 self-owned
LHTM-OPT alt Inc. AWS Marketplace
tsuzumi
(tsuzumi-7b)
NTT Azure AI Foundry

Encoder models

General purpose

Architecture Max Input Length Training Data Developer License HuggingFace? 11
KyotoUniBERT BERT (base, large) 512 Japanese Wikipedia (18M articles) Kyoto University Language Media Processing Lab Apache 2.0
TohokuUniversityBERT BERT (base, large) 512 base (v1):
Japanese Wikipedia (17M articles / 2.6GB)
base (v2) & large:
Japanese Wikipedia 4.0GB
base (v3) & large (v2):
Japanese Wikipedia (4.9GB), Japanese CC‑100 (74.3GB)
Tohoku University NLP Group base (v1, v2) & large: CC BY‑SA 3.0
base (v3) & large (v2): Apache 2.0

(base (v1), base (v1, char-level), base (v2), base (v2, char-level), large, large (char-level), base (v3), base (v3, char-level), large (v2), large (v2, char-level))
TohokuNLP BERT-alpha 500M Llama-based encoder12 4,096
or
8,192
Japanese subset of llm-jp-corpus-v3 Tohoku University NLP Group Apache 2.0 ◯ (sq4096-alpha, sq8192-alpha)
NICT BERT BERT (base) 512 Japanese Wikipedia NICT CC BY 4.0
Laboro BERT BERT (base, large) 512 Japanese Web Corpus
(News and blogs, etc) (12GB)
Laboro.AI CC BY‑NC 4.0
colorfulscoop BERT BERT (base) 512 Japanese Wikipedia Colorful Scoop CC BY‑SA 3.0
UniversityOfTokyoBERT BERT (small) 512 Japanese Wikipedia (2.9GB) University of Tokyo Izumi Lab CC BY‑SA 4.0
chiTra (Sudachi Transformers) BERT (base) 512 NINJAL Web Japanese Corpus (148GB) NINJAL, WAP Tokushima Laboratory of AI and NLP Apache 2.0
ACCMS BERT BERT (base) 512 Japanese Wikipedia (3.3GB) Kyoto University ACCMS CC BY‑SA 4.0
HitachiBERT BERT (base) 512 Japanese Wikipedia, Japanese CC‑100 Hitachi CC BY‑NC‑SA 4.0 13
RetrievaBERT BERT 14 2,048 Japanese CommonCrawl, RefinedWeb, Chinese Wikipedia, Korean Wikipedia, The Stack Retrieva Apache 2.0
Bandai Namco DistilBERT DistilBERT 512 (Distillation of TohokuUniversityBERT(base)) Bandai Namco Research MIT
Laboro DistilBERT DistilBERT 512 (Distillation of Laboro BERT(base)) Laboro.AI CC BY‑NC 4.0
LINE DistilBERT DistilBERT 512 (Distillation of LINE internal BERT model) LINE Apache 2.0
rinna RoBERTa RoBERTa (base) 512 Japanese Wikipedia, Japanese CC‑100 rinna MIT
WasedaRoBERTa RoBERTa (base, large) 512 Japanese Wikipedia, Japanese CC‑100 Waseda Kawahara Lab CC BY‑SA 4.0
(base, large, large (seq512))15
InformatixRoBERTa RoBERTa (base) 512 Japanese Wikipedia, Web Articles
(25GB)
Informatix Apache 2.0
KyotoUniversityRoBERTa RoBERTa (base, large) 512 Japanese Wikipedia, Japanese CC‑100 Kyoto University Language Media Processing Lab CC BY‑SA 4.0
(base (char-level), large (char-level))
YokohamaNationalRoBERTa RoBERTa (base) 512 Japanese Wikipedia (3.45GB) Yokohama National University Mori Lab Apache 2.0
Megagon Labs RoBERTa RoBERTa (base)16 1,282 Japanese mC4 (200M sentences) Megagon Labs
(Recruit Co.,Ltd.)
MIT
ACCMS RoBERTa RoBERTa (base) 512 Japanese Wikipedia (3.3GB) + Japanese CC‑100 (70GB) Kyoto University ACCMS CC BY‑SA 4.0
CinnamonELECTRA ELECTRA (small) 512 Japanese Wikipedia Cinnamon Apache 2.0
Megagon Labs ELECTRA ELECTRA (base) 512 Japanese mC4 (200M sentences) Megagon Labs
(Recruit Co.,Ltd.)
MIT
UniversityOfTokyoELECTRA ELECTRA (small, base) 512 Japanese Wikipedia (2.9GB) University of Tokyo Izumi Lab CC BY‑SA 4.0
(small, base)
JapaneseRoFormer RoFormer (base) 512 Japanese Wikipedia (3.45GB) Yokohama National University Mori Lab Apache 2.0
JapaneseLUKE LUKE (base, large) 512 Japanese Wikipedia Studio Ousia Apache 2.0
(base, large)
KyotoUniversityDeBERTaV2 DeBERTaV2 (tiny, base, large) 512 Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR
(171GB)
Kyoto University Language Media Processing Lab CC BY‑SA 4.0
(tiny, tiny (char-level), base, large)
KyotoUniversityDeBERTaV3 DeBERTaV3 (base) 512 llm-jp-corpus Kyoto University Language Media Processing Lab Apache 2.0
UniversityOfTokyoDeBERTaV2 DeBERTaV2 (small, base) 512 Japanese Wikipedia, Japanese Wikinews, Japanese CC-100, Japanese mC4, Japanese OSCAR University of Tokyo Izumi Lab CC BY-SA 4.0 ◯ (small, base)
GLOBIS DeBERTaV3 DeBERTaV3 (xsmall, base, large) 512 Wikipedia, WikiBooks, Aozora Bunko, Japanese CC-100, Japanese mC4, Japanese OSCAR GLOBIS CC BY-SA 4.0 ◯ (xsmall, base, large)
JapaneseBigBird BigBird (base) 4,096 Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR Waseda Kawahara Lab CC BY‑SA 4.0
JapaneseLayoutLM LayoutLM (base) 512 Pre-trained on Japanese Wikipedia, initialized with TohokuUniversityBERT The Japan Research Institute, Limited CC BY-SA 3.0

Domain Specific

Domain Architecture Training Data Developer License HuggingFace?
JapaneseBlogELECTRA Colloquial language ELECTRA (small) Japanese Blog Corpus (354M sentences) Kitami Institute of Technology Masui-Ptaszynski Lab CC BY‑SA 4.0
JapaneseSpokenLanguageBERT Spoken language BERT (base) Additional training for TohokuUniversityBERT using Corpus of Spontaneous Japanese (CSJ)
(In the DAPT model, the diet record is also used)
Retrieva Apache 2.0
AcademicRoBERTa Science RoBERTa (base) CiNii Japanese Papers (6.3M sentences) Ehime University AI Lab Apache 2.0
local-politics-BERT Politics BERT (base) Wikipedia, Minutes of the National Diet, Minutes of the Local Assembly Japanese Local Assembly Minutes Corpus Project CC BY-SA 4.0 ◯ (SC-min, SC-minwiki, SC-2M-wiki, SC-2M-min, SC-2M-minwiki, FP-min, FP-minwiki) 17
UBKE-LUKE Economics LUKE (base) Japanese Wikipedia, Securities Reports, Economic News Articles Uzabase CC BY-NC
JapaneseFinancialBERT Finance BERT (small, base)18 Japanese Wikipedia, Japanese Financial Corpus (27M sentences/5.2GB) University of Tokyo Izumi Lab CC BY‑SA 4.0
(small, base)
JapaneseFinancialELECTRA Finance ELECTRA (small) Japanese Wikipedia (20M sentences/2.9GB), Japanese Financial Corpus (27M sentences/5.2GB) University of Tokyo Izumi Lab CC BY‑SA 4.0
JapaneseNewsBERT Business BERT (base) Japanese Business Articles (3M articles) Stockmark CC BY 4.0
JapaneseNewsXLNet Business XLNet (base) Japanese Business Articles (3M articles) Stockmark
※ Unofficial release
JapaneseNewsALBERT Business ALBERT (base) Japanese Business Articles (3M articles) Stockmark
MinpakuBERT Cultural Heritage BERT (base) Additional training with National Museum of Ethnology's cultural heritage data on top of Tohoku University BERT University of Hyogo Ohshima Lab MIT ◯ (minpaku-v1, minpaku-v3, minpaku-v3-no-additional-token)
UTH-BERT Medicine BERT (base) Japanese Medical Records(120M lines) University of Tokyo Hospital
Medical AI Development Course
CC BY‑NC‑SA 4.0
medBERTjp Medicine BERT (base) Japanese Wikipedia, Japanese Medical Corpus ("今日の診療プレミアム/Today's Care Premium" Web Version) Osaka University Hospital
Medical Informatics Lab
CC BY‑NC‑SA 4.0
JMedRoBERTa Medicine RoBERTa (base) Japanese Medical Papers (11M sentences/1.8GB) NII Aizawa Lab CC BY‑NC‑SA 4.0
(ManbyoWordPiece, SentencePiece)19

Sentence and Document Embeddings 20

Bi-Encoders

Single-representation bi-encoders

Max Context Length Developer License
sbintuitions/sarashina-embedding-v1-1b 8,192 SB Intuitions Sarashina Model NonCommercial License
RoSEtta
(pkshatech/RoSEtta-base-ja)
1,024 PKSHA Technology Apache 2.0
GLuCoSE v2
(pkshatech/GLuCoSE-base-ja-v2)
512 PKSHA Technology Apache 2.0
Ruri
(cl-nagoya/ruri-pt-small, cl-nagoya/ruri-pt-base, cl-nagoya/ruri-pt-large, cl-nagoya/ruri-small, cl-nagoya/ruri-base, cl-nagoya/ruri-large)
512 Nagoya University Sasano Group Apache 2.0
Japanese SimCSE
(cl-nagoya/unsup-simcse-ja-base, cl-nagoya/unsup-simcse-ja-large, cl-nagoya/sup-simcse-ja-base, cl-nagoya/sup-simcse-ja-large)
512 Nagoya University Sasano Group CC BY-SA 4.0
GLuCoSE
(pkshatech/GLuCoSE-base-ja)
512 PKSHA Technology Apache 2.0
colorfulscoop/sbert-base-ja Colorful Scoop CC BY‑SA 4.0
MU-Kindai/SBERT-JSNLI-base
MU-Kindai/SBERT-JSNLI-large
Kindai University
MU-Kindai/Japanese-SimCSE-BERT-base-unsup
MU-Kindai/Japanese-SimCSE-BERT-large-unsup
MU-Kindai/Japanese-SimCSE-RoBERTa-base-unsup
MU-Kindai/Japanese-SimCSE-BERT-base-sup
MU-Kindai/Japanese-SimCSE-BERT-large-sup
Kindai University MIT
pkshatech/simcse-ja-bert-base-clcmlp PKSHA Technology CC BY‑SA 4.0
MU-Kindai/Japanese-MixCSE-BERT-base
MU-Kindai/Japanese-MixCSE-BERT-large
Kindai University MIT
MU-Kindai/Japanese-DiffCSE-BERT-base Kindai University MIT
bclavie/fio-base-japanese-v0.1 Individual (Benjamin Clavié)
cl-nagoya/shioriha-large-pt Nagoya University Sasano Group

Multi-representation bi-encoders

Developer License
JaColBERTv2.5
(JaColBERTv2.4, JaColBERTv2.5)
Answer.AI MIT
JaColBERTv2
(JaColBERTv2)
Individual (Benjamin Clavié) MIT
JaColBERT
(JaColBERT)
Individual (Benjamin Clavié) MIT

Cross-Encoders

Developer License
Ruri-Reranker
(cl-nagoya/ruri-reranker-stage1-small, cl-nagoya/ruri-reranker-stage1-base, cl-nagoya/ruri-reranker-stage1-large, cl-nagoya/ruri-reranker-small, cl-nagoya/ruri-reranker-base, cl-nagoya/ruri-reranker-large)
Nagoya University Sasano Group Apache 2.0
hotchpotch/japanese-reranker-cross-encoder-xsmall-v1
hotchpotch/japanese-reranker-cross-encoder-small-v1
hotchpotch/japanese-reranker-cross-encoder-base-v1
hotchpotch/japanese-reranker-cross-encoder-large-v1
hotchpotch/japanese-bge-reranker-v2-m3-v1
Individual (Yuichi Tateno) MIT

Vision-Language Models

Text+Image to Text

Models built from scratch

General purpose

Architecture Training Data Developer License / Terms of Use
llava-calm2-siglip
(llava-calm2-siglip)
LLaVA-1.5 coversational data generated from MS-COCO and VisualGenome CyberAgent Apache 2.0
LLM-jp-3 VILA 14B
(14b)
LLaVA-1.5 Japanese image text pairs, LLaVA-Pretrain, Japanese interleaved data, coyo (subset), mmc4-core (subset), llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja, LLaVA-1.5 instruction data (subset) Research and Development Center for Large Language Models Apache 2.0 & OpenAI Terms of Use
Heron
(blip-ja-stablelm-base-7b-v0, blip-ja-stablelm-base-7b-v1, blip-ja-stablelm-base-7b-v1-llava-620k, git-ja-stablelm-base-7b-v0, git-ELYZA-fast-7b-v0, git-ja-stablelm-base-7b-v1)
BLIP-2 / GIT v1: LLaVA-Instruct-150K-JA or LLaVA-Instruct-620K-JA
v0: LLaVA-Instruct-150K-JA, Japanese STAIR Captions, Japanese Visual Genome VQA dataset
Turing CC BY-NC 4.0
Japanese Stable VLM
(japanese-stable-vlm)
LLaVA-1.5 Japanese CC12M, STAIR Captions, Japanese Visual Genome VQA dataset Stability AI STABILITY AI JAPANESE STABLE VLM COMMUNITY LICENSE
Japanese InstructBLIP Alpha
(japanese-instructblip-alpha)
InstructBLIP Japanese CC12M, STAIR Captions, Japanese Visual Genome VQA dataset Stability AI JAPANESE STABLELM RESEARCH LICENSE
rinna MiniGPT-4
(bilingual-gpt-neox-4b-minigpt4)
MiniGPT-4 CC12M, COCO 2014, Visual Genome, STAIR Captions, Japanese Visual Genome VQA dataset rinna MIT

Domain Specific

Architecture Domain Developer License
watashiha/Watashiha-Llama-2-13B-Ogiri-sft-vlm LLaVA Oogiri Watashiha Llama 2 Community License

Models built off non-Japanese VLMs

Base Model Training Data Developer License
AXCXEPT/EZO-InternVL2-26B InternVL2 -  Axcxept MIT

Merged models

Original Models (Japanese LLMs in bold) Developer License
Llama-3-EvoVLM-JP-v2
(v2)
Mantis-8B-SigLIP-Llama-3, Llama-3-ELYZA-JP-8B, Bunny-v1.1-Llama-3-8B-V Sakana AI Llama 3 Community License
AXCXEPT/Llama-3-EZO-VLM-1 - (trained from Llama-3-EvoVLM-JP-v2) Axcxept Llama 3 Community License
EvoVLM-JP
(v1-7B)
Shisa Gamma 7B (v1), LLaVA-1.6-Mistral-7B Sakana AI Apache 2.0

Text to Image

General Purpose

Architecture Training Data Developer License
CommonArt β
(commonart-beta)
PixArt-Σ CommonCatalog-cc-by, Megalith-10M, Smithonian Open Access, ArtBench (CC-0 only) AI Picasso Apache 2.0
EvoSDXL-JP
(v1)
Stable Diffusion - (merged from several diffusion models, including Japanese Stable Diffusion XL) Sakana AI Apache 2.021
Japanese Stable Diffusion XL
(japanese-stable-diffusion-xl)
Stable Diffusion undisclosed Stability AI STABILITY AI JAPANESE STABLE DIFFUSION XL COMMUNITY LICENSE
TohokuUniversity Stable Diffusion
(base, refiner)
Stable Diffusion WMT2023 Shared Task English-Japanese parallel corpus, about 13 million captions from laion2B-multi Tohoku University NLP Group CreativeML OpenRAIL-M License
rinna Stable Diffusion
(japanese-stable-diffusion)
Stable Diffusion LAION-5B Japanese Subset (100M images) rinna CreativeML OpenRAIL-M License

Domain Specific

Architecture Domain Developer License
Evo-Nishikie
(v1)
Stable Diffusion (ControlNet) Ukiyo-e Sakana AI Apache 2.021
Evo-Ukiyoe
(v1)
Stable Diffusion Ukiyo-e Sakana AI Apache 2.021

Text to Video

Architecture Training Data Developer License
AIdeaLab VideoJP
(AIdeaLab-VideoJP)
CogVideoX Pixabay, FineVideo AIdeaLab Apache 2.0

Others

Architecture Training Data Developer License
LY CLIP
(clip-japanese-base)
CLIP CommonCrawl, CC12M, YFCC100M LY Corp. Apache 2.0
Recruit CLIP
(japanese-clip-vit-b-32-roberta-base)
CLIP about 120 million captions from laion2B-multi Recruit Co.,Ltd. CC BY-4.0
Japanese Stable CLIP
(japanese-stable-clip-vit-l-16)
SigLIP CC12M translated to Japanese, STAIR Captions Stability AI STABILITY AI JAPANESE STABLE CLIP COMMUNITY LICENSE
rinna CLIP
(japanese-clip-vit-b-16)
CLIP CC12M translated to Japanese rinna Apache 2.0
rinna CLOOB
(japanese-cloob-vit-b-16)
CLOOB CC12M translated to Japanese rinna Apache 2.0
HAKUHODO Technologies CLIP
(base, deeper, wider)
CLIP about 120 million captions from laion2B-multi HAKUHODO Technologies CC BY-NC-SA 4.0

Speech-Language Models

Automatic Speech Recognition

Architecture Training Data Developer License
Kotoba-Whisper
(v1.0, v1.0-ggml, v1.0-faster, v1.1, bilingual-v1.0, bilingual-v1.0-ggml, bilingual-v1.0-faster, v2.0, v2.0-ggml, v2.0-faster, v2.1, v2.2)
Distil-Whisper ReazonSpeech Kotoba Technologies Apache 2.0
Nue ASR
(nue-asr)
Nue ASR
(HuBERT + LLM)
ReazonSpeech rinna Apache 2.0
ReazonSpeech
(espnet-v1, espnet-next, espnet-v2, nemo-v2)
ESPnet (Conformer-Transducer) / NeMo (FastConformer-RNNT) ReazonSpeech Reazon Holdings Apache 2.0

Others

Architecture Training Data Developer License
Kotoba-Speech
(v0.1)
Transformer undisclosed Kotoba Technologies Apache 2.0
UniversityOfTokyoHuBERT
(base-jtube)
HuBERT JTubeSpeech University of Tokyo
Saruwatari & Takamichi Lab
MIT
rinna HuBERT
(base, large)
HuBERT ReazonSpeech rinna Apache 2.0
Reazon wav2vec 2.0
(base, large)
wav2vec 2.0 ReazonSpeech Reazon Holdings Apache 2.0
rinna wav2vec 2.0
(base)
wav2vec 2.0 ReazonSpeech rinna Apache 2.0

Evaluation Benchmarks for Japanese LLMs

Hybrid Benchmarks

Description Developer
Nejumi LLM Leaderboard3 Evaluates the Japanese language capabilities of LLMs from three perspectives: language understanding ability, application ability, and alignment (including controllability and safety). For more details, see this article. Weights & Biases
Japanese LLM Evaluation Conducts a comprehensive evaluation of various LLMs based on three types of tasks: Japanese language understanding and generation tasks, Japanese multi-turn dialogue tasks, and English language understanding and generation tasks. Also publishes swallow-evaluation, an evaluation script that integrates and improves existing LLM evaluation tools. Swallow Project

Traditional Benchmarks based on Natural Language Understanding tasks

Description Developer
Open Japanese LLM Leaderboard Evaluates Japanese language models in 16 different tasks using llm-jp-eval. LLM-jp, Hugging Face
llm-jp-eval A tool that evaluates Japanese LLMs automatically across multiple datasets.
The complete list of supported datasets can be found here (which also includes tasks such as JNLI and JCommonsenseQA from JGLUE).
LLM-jp
JP Language Model Evaluation Harness A fork by Stability AI of EleutherAI/lm-evaluation-harness. It is a tool for automatically evaluating Japanese LLMs across multiple datasets.
The complete list of supported datasets can be found here (which also includes tasks such as JNLI and JCommonsenseQA from JGLUE).
There is a detailed summary of the evaluation results by rinna: [rinna] Benchmark of Stability-AI/lm-evaluation-harness
Stability AI
JGLUE Japanese version of the GLUE benchmark suite, including the MARC-ja, JCoLA, JSTS, JNLI, JSQuAD, and JCommonsenseQA tasks. JCoLA is by the University of Tokyo's Oseki Lab. See here and here (ja only) for further details about each task. Waseda University Kawahara Lab and Yahoo
JMMLU A benchmark constructed as a Japanese version of the MMLU Benchmark, consisting of multiple-choice questions from a wide range of academic fields including natural sciences, humanities, and social sciences. In addition to translating the original MMLU, it features newly added problems based on the unique cultural background of Japan (Japan-specific problems). Waseda University Kawahara Lab

Benchmarks on open-ended generative tasks

Description Developer
Japanese MT-bench The Japanese version of MT-bench asks about multi-turn conversational ability. It includes 80 questions, 10 each, from 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities. Some questions have been modified to fit with Japanese culture during the production of the Japanese version. It also includes a script that performs a 10-level absolute evaluation by GPT-4. Stability AI
ELYZA-tasks-100 Ranking based on model responses to 100 complex and diverse tasks, including tasks testing summarization, correction, abstraction, induction, and other skills. Uses humans to score the model responses and then ranks models based on their mean scores. ELYZA
Preferred Generation Benchmark
(pfgen-bench)
A benchmark to measure the Japanese language generation ability of LLMs based on 50 common sense questions unique to the Japanese context. It evaluates along three axes: Fluency, Truthfulness, and Helpfulness. The evaluation is conducted without using LLM-as-a-Judge by calculating n-gram or rule-based metrics. Preferred Elements (Preferred Networks)
Rakuda Benchmark Ranking based on model answers to 40 open-ended questions on Japanese geography, history, politics, and society. Uses GPT-4 to judge model outputs pairwise, and then ranks models by fitting a Maximum Likelihood Elo/Bradley-Terry model to GPT-4's preferences. YuzuAI
Japanese Vicuna QA Benchmark This is the Japanese version of vicuna-blog-eval, which is the predecessor of MT-Bench. It includes 80 questions on general knowledge, role-playing, common sense, Fermi estimation, counterfactual thinking, coding, mathematics, and writing. It also includes a script for automatic evaluation by GPT-4 (win-rate calculation). The leaderboard can be found here. Kyoto University Language Media Processing Lab
Tengu-Bench Includes 120 free-form questions from various categories. Categories of questions: table interpretation, logic puzzles, idea generation, function calling, long document summarization (over a thousand tokens), conversation summarization, long document closed QA (over a thousand tokens), honorifics, project creation, math, translation, extraction, ethical control, cost estimation, Japan, chit-chat, puns, formatting, construction, business, legal judgment, politics, hypothetical questions. Lightblue
Shaberi A framework that can collectively evaluate the Japanese MT-bench, Rakuda Benchmark, ELYZA-tasks-100, and Tengu-Bench. There is also a fork by Shisa.AI. Lightblue

Benchmarks for measuring performance in specific domains

Description Developer
Japanese Language Model Financial Evaluation Harness A benchmark for Japanese LLM in the financial sector. It includes tasks such as sentiment analysis in finance (chabsa), basic knowledge tasks in securities analysis (cma_basics), tasks related to audits in certified public accountant examinations (cpa_audit), multiple choice question tasks in financial planner exams (fp2), and mock exam tasks for securities salespeople exams (security_sales_1). For more details, please see here. Preferred Networks
pfmt-bench-fin-ja A benchmark for measuring the generation capabilities of Japanese LLMs in the financial domain. Preferred Networks
Stockmark Business Questions The collection includes 50 questions that probe knowledge on topics such as market trends, current affairs, social issues, and business trends. Stockmark
JMED-LLM A dataset for evaluating LLMs in the Japanese medical domain. It compiles previously developed Japanese medical language processing tasks for LLM benchmarking. NAIST Social Computing Lab.
JMedBench A benchmark for LLMs in the Japanese medical field. It includes 20 datasets in 5 types of tasks: multi-choice question-answering, machine translation, named entity recognition, document classification, and semantic textual similarity (some datasets are borrowed from JMMLU and JMED-LLM). A tool called med-eval is developed to facilitate evaluation on JMedBench. NII Aizawa Lab
Japanese Medical Language Model Evaluation Harness A benchmark for evaluating Japanese LLMs in the medical domain in both Japanese and English, executable by a single command. Individual (Issey Sukeda)
karakuri-bench A dataset for measuring performance of Japanese LLMs in customer support. KARAKURI

Benchmarks for measuring factuality and safety

Description Developer
JTruthfulQA The Japanese version of the dataset for evaluating the factuality of LLMs TruthfulQA. It includes questions about superstitions and other beliefs held by some people that are not factual, as well as questions about Japan-specific knowledge, all collected from scratch. Waseda University Kawahara Lab
JCommonsenseMorality A dataset on Japanese commonsense morality. Sentences describing actions are labeled with binary values indicating whether they are morally wrong or acceptable. Hokkaido University Language Media Lab
JBBQ The Japanese version of the social bias QA dataset BBQ, developed through translation, revision, and addition of questions based on Japanese culture and customs. University of Tokyo Yanaka Lab

Benchmarks for measuring logical reasoning capabilities

Description Developer
JFLD (Japanese Formal Logic Deduction) A dataset for evaluating deductive reasoning capabilities of Japanese LLMs (the Japanese version of the FLD (Formal Logic Deduction) proposed by the same authors). It is characterized by being composed of counterfactual samples to evaluate apart from the knowledge the LLM possesses. Hitachi
JHumanEval A Japanese version of the HumanEval benchmark, which assesses the ability to generate Python code from English instructions. In creating the Japanese version, the text was first machine-translated and then manually corrected. Japan Women's University Kuramitsu Lab

Benchmarks on controlled text generation

Description Developer
LCTG Bench A benchmark for the controllability of Japanese LLMs. It evaluates whether LLMs can adhere to constraints in four aspects: output format, character count, keywords, and forbidden words. The quality of the generated text is also evaluated. CyberAgent

Benchmarks for embedding models

Description Developer
JMTEB A benchmark developed as the Japanese version of MTEB. It consists of tasks such as document clustering, text classification, sentence similarity, sentence pair labeling prediction, and text extraction (a reranking task was recently added). SB Intuitions
JQaRA A dataset for evaluating Japanese document extraction and reranking accuracy. Each of the 1,667 questions is assigned 100 candidate documents, of which at least one can answer the question. The questions are taken from JAQKET, and the candidate documents are sourced from Japanese Wikipedia. Individual (Yuichi Tateno)
JaCWIR A dataset created for evaluating document extraction and reranking in domains other than Wikipedia. Each of the 5,000 questions is assigned one Web page that serves as the source of the question and 99 unrelated Web pages. Individual (Yuichi Tateno)

Benchmarks for vision-language models

Description Developer
JMMMU A benchmark constructed as the Japanese version of MMMU Benchmark. It consists of 720 translated MMMU problems and 600 new problems unique to Japanese culture. University of Tokyo Aizawa Lab
JDocQA A question-answer dataset based on Japanese documents (pamphlets, slides, reports, websites), consisting of a total of 11,600 questions. It includes various question formats, including unanswerable questions. NAIST Watanabe Lab
Heron VLM Leaderboard powered by Nejumi/WandB Summarizes the evaluation results of Japanese-Heron-Bench and LLaVA-Bench-In-the-Wild (Japanese). Turing, Weights & Biases
Japanese-Heron-Bench 21 images are assigned a total of 102 questions. It is characterized by image-question pairs that require knowledge related to Japan. Turing
JA-VLM-Bench-In-the-Wild A dataset independently prepared by Sakana AI to evaluate EvoVLM-JP-v1-7B. It consists of 50 questions assigned to 42 images. It is characterized by images and questions that require knowledge about Japan. Sakana AI
JA-Multi-Image-VQA A dataset for evaluating the question-answering ability in Japanese for multiple images. Sakana AI
LLaVA-Bench-In-the-Wild (Japanese) This is the Japanese version of LLaVA-Bench-In-the-Wild, translated using DeepL. It consists of 60 questions assigned to 24 images. Turing
LLaVA-Bench (COCO) Japanese This is the Japanese version, translated by DeepL, of the LLaVA-Bench (COCO) dataset used to evaluate LLaVA. It consists of 30 images, each with 3 types of questions assigned to them. Turing
Japanese Visual Genome VQA dataset A question-and-answer dataset annotated based on images from the Visual Genome dataset. A subset of this dataset, JA-VG-VQA-500, consisting of 500 questions, is sometimes used as a benchmark for evaluating VLMs. Yahoo

References for Models and Architectures

References for Training Methods

Our Contributors

We love contributors! Feel free to contribute to this project.

contributors

Citation

The summary of this repository is also published as a preprint: Exploring Open Large Language Models for the Japanese Language: A Practical Guide

When referencing this repository, please cite as follows:

@article{awesomeJapanese2024,
    title={{Exploring Open Large Language Models for the Japanese Language: A Practical Guide}},
    author={Kaito Sugimoto},
    doi={10.51094/jxiv.682},
    journal={Jxiv preprint},
    year={2024}
}

Footnotes

  1. Some architectural changes have been made. For details, refer to: 1,000億パラメータ規模の独自LLM「PLaMo-100B」の事前学習

  2. Refer to the following articles: 大規模言語モデルTanuki-8B, 8x8Bの位置づけや開発指針など, 大規模言語モデルを開発するにあたっての事前・事後学習の戦略メモー特に合成データについてー 2

  3. Some performance enhancements have been made to the original Llama model. See here for details.

  4. Details have not been made public but the private dataset includes data from the EleutherAI Polyglot project's Japanese team and from members of Stable Community Japan.

  5. This project conducted evaluation research on using right-to-left generation instead of the usual left-to-right generation, releasing both left-to-right and right-to-left models.

  6. Before conducting Instruction Tuning, a Chat Vector between Llama 3 Instruct and Llama 3 Base is added. 2

  7. After conducting Instruction Tuning, a Chat Vector between Llama 3 Instruct and Llama 3 Base is added. 2

  8. However, if commercial use of KARAKURI LM is desired, direct contact with the developer, KARAKURI Inc., is required.

  9. In Instruction Tuning, because it uses data generated by OpenAI's models, such as GPT-3.5 and GPT-4, for training, there is a possibility that it may violate OpenAI's terms. 2 3 4 5 6 7 8 9 10

  10. Before conducting Instruction Tuning, a Chat Vector between Gemma 2 Instruct and Gemma 2 Base is added.

  11. ○: The model is on the HuggingFace Model Hub and can be loaded in with the AutoModel.from_pretrained() command. △: The model is not on the Model Hub but can be loaded in manually with the HuggingFace transformers library. ✕: The model is not directly loadable with HuggingFace.

  12. By removing Causal Attention from Llama, it is used as an encoder-type model.

  13. This project conducted evaluation research on pre-tokenization morphological analysis and released their best performing model, which used Juman++ and BPE.

  14. However, the maximum sequence length has been extended to 2048, and various architectural changes have been made compared to the original BERT. See the HuggingFace repository README for details.

  15. nlp-waseda/roberta-base-japanese and nlp-waseda/roberta-large-japanese trained using a 128 token context length, but nlp-waseda/roberta-large-japanese-seq512 expanded the context length to 512.

  16. Extended to a 1282 context length from the usual 512.

  17. For details of each model, please refer to Chapter 4 of the authors' paper. Note that the SC-2M-wiki model is strictly not a domain-specific model as it is pre-trained only on Wikipedia.

  18. The "small" model trains on Japanese Wikipedia and the Japanese Financial Corpus simultaneously, while the "base" model takes the TohokuUniversityBERT and conducts additional training on the Japanese Financial Corpus.

  19. ManbyoWordPiece conducts a pre-tokenization step using MeCab (IPA+Manbyo dictionaries) and uses WordPiece for subword tokenization, while the SentencePiece model tokenizes text directly using a unigram model.

  20. The classification of embedding models was referenced from Dense Text Retrieval based on Pretrained Language Models: A Survey (Zhao+, 2022). The Bi-Encoder architecture inputs two separate inputs into the model and vectorizes each, using their dot product or cosine similarity as a measure of their proximity. In contrast, the Cross-Encoder architecture inputs the combined inputs into the model to directly compute their proximity internally. Although Cross-Encoders incur higher computational costs, they are often used as rerankers in information extraction due to their ability to compute input proximity more precisely. Among Bi-Encoders, there are types (e.g., ColBERT) that represent the input as multiple vectors (such as one per token) rather than a single vector, hence further classification into Single-representation bi-encoders and Multi-representation bi-encoders.

  21. However, it calls for consideration for use in research and education. Additionally, be aware that some of the licenses for the source models are not Apache 2.0. 2 3