Skip to content

Latest commit

 

History

History
78 lines (56 loc) · 7.12 KB

README.md

File metadata and controls

78 lines (56 loc) · 7.12 KB

Gitter GitHub license

German Transformer Training

The goal of this repository is to plan the training of German transformer models.

1. Datasets / Data Sources

Dataset Raw Size /Characters Quality/Filtered? URL Notes/Status Dupe Factor Total = 178 GB
German Wikipedia Dump + Comments 5.4 GB / 5.3b ++ 10 54 GB = 30 %
Oscar Corpus (Common Crawl 2018-47) 145 GB / 21b Words Downlaod ----- ------
FB cc_net (Common Crawl 2019-09 ) Head 75 GB + Code More broadly filtered versions middle&tail available too 1 75 GB : 42 %
EU Book Shop 2.3 GB / 2.3b + 5 11.5 GB: 6.5 %
News 2018 4.3 GB / 4.3b + 5 20 GB: 11 %
Wortschatz Uni Leipzig > 20 * 200 mb Part of News 2018??? Code ---- ----
Paracrawl 3.2 GB / 3.2b -- --- ----
Open Subtitles 1.3 GB / 288m Tokens o 2 2.6 GB : 1.5 %
Open Legal Dump 3.6 GB / 3.5b + Announcment Used by Deepset 5 15 GB: 8.4 %
Corpus of German-Language Fiction (txt) 2735 Prose Works Download Old (1510-1940)

https://ofai.github.io/million-post-corpus/

Additional Sources

Data Preperation

  1. Clean Files
  2. Split in distinct sentences (2.2 Create Vocab)
  3. Tokenize

2. Training

Training

NLP Libs

Training Runs from scratch

Name Steps Result URL Training Time Code Paper
RoBERTa Base RoBERTa
BERT Large Github BERT

TPU Infos

3. Evaluation Metrics

Comparison to other German & Multilingual Models

Name Steps Result URL Training Time Code Metrics
Deepset German Bert Base 810k (1024 SL) + 30k (512 SL) Deepset 9 Days TPU v2-8
Ddmdz German Bert Base 1500k (512 SL) dbmdz dbmdz stefan-it
Europeana BERT dbmdz Europeana-bert

4. Contact