- A pretrained Japanese BERT base model, trained using chiTra tokenizer.
- Cleaning processes of the NWJC corpus are added.
- Total size after cleaning is 79 GB.
- Vocabulary is rebuilt in the same way.
- Total vocab size is
32597
.
- Total vocab size is
- Sudachi libraries are updated to:
- SudachiPy:
0.6.6
. - SudachiDict:
20220729-core
. - SudachiTra:
0.1.8
.
- SudachiPy:
word_form_type
is changed tonormalized_nouns
.- Total training steps is increased to
20472
.
0.1.8 (2023-03-10)
- Add new
word_format_type
:normalized_nouns
. (#48, #50)- Normalizes morphemes that do not have conjugation form.
- Faster part-of-speech matching (#36)
- Use HuggingFace compatible pretokenizer (#38)
- Fix/Update pretraining scripts and documents (#39, #40, #45, #46)
- Fix github test workflow (#49)
- Enable to save vocab file with duplicated items (#54)
- A pretrained Japanese BERT base model, trained using chiTra tokenizer.
- Model
- chiTra-1.0 is a BERT base model.
- Corpus
- We used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics.
- Cleaning process is explained here.
- Total size after cleaning is 109 GB.
- Vocabulary
- Vocabulary is built on the above corpus, using WordPiece and vocab size 32000.
- We added 常用漢字 and 人名用漢字 to cover usual Japanese text.
- Total vocab size is
32615
.
- Total vocab size is
- Sudachi libraries
- SudachiPy:
0.6.2
- SudachiDict:
20211220-core
- chiTra:
0.1.7
- We used
word_form_type
:normalized_and_surface
.
- We used
- SudachiPy:
- Training Parameters
- See our paper) or pretraining page.
- Total training step is
10236
.