Skip to content

Latest commit

 

History

History
56 lines (44 loc) · 2.15 KB

CHANGELOG.md

File metadata and controls

56 lines (44 loc) · 2.15 KB

chiTra-1.1 model (2023-03-17)

  • A pretrained Japanese BERT base model, trained using chiTra tokenizer.

Updates / Changes

  • Cleaning processes of the NWJC corpus are added.
    • Total size after cleaning is 79 GB.
  • Vocabulary is rebuilt in the same way.
    • Total vocab size is 32597.
  • Sudachi libraries are updated to:
    • SudachiPy: 0.6.6.
    • SudachiDict: 20220729-core.
    • SudachiTra: 0.1.8.
  • word_form_type is changed to normalized_nouns.
  • Total training steps is increased to 20472.

0.1.8 (2023-03-10)

Highlights

  • Add new word_format_type: normalized_nouns. (#48, #50)
    • Normalizes morphemes that do not have conjugation form.

Other

  • Faster part-of-speech matching (#36)
  • Use HuggingFace compatible pretokenizer (#38)
  • Fix/Update pretraining scripts and documents (#39, #40, #45, #46)
  • Fix github test workflow (#49)
  • Enable to save vocab file with duplicated items (#54)

chiTra-1.0 (2022-02-25)

  • A pretrained Japanese BERT base model, trained using chiTra tokenizer.

Details

  • Model
    • chiTra-1.0 is a BERT base model.
  • Corpus
    • We used NINJAL Web Japanese Corpus (NWJC) from National Institute for Japanese Language and Linguistics.
    • Cleaning process is explained here.
      • Total size after cleaning is 109 GB.
  • Vocabulary
    • Vocabulary is built on the above corpus, using WordPiece and vocab size 32000.
    • We added 常用漢字 and 人名用漢字 to cover usual Japanese text.
      • Total vocab size is 32615.
  • Sudachi libraries
    • SudachiPy: 0.6.2
    • SudachiDict: 20211220-core
    • chiTra: 0.1.7
      • We used word_form_type: normalized_and_surface.
  • Training Parameters