The goal of this repository is to plan the training of German transformer models.
- Germeval 2017: https://sites.google.com/view/germeval2017-absa/data
Dataset | Raw Size /Characters | Quality/Filtered? | URL | Notes/Status | Dupe Factor | Total = 178 GB |
---|---|---|---|---|---|---|
German Wikipedia Dump + Comments | 5.4 GB / 5.3b | ++ | 10 | 54 GB = 30 % | ||
Oscar Corpus (Common Crawl 2018-47) | 145 GB / 21b Words | Downlaod | ----- | ------ | ||
FB cc_net (Common Crawl 2019-09 ) | Head 75 GB | + | Code | More broadly filtered versions middle&tail available too | 1 | 75 GB : 42 % |
EU Book Shop | 2.3 GB / 2.3b | + | 5 | 11.5 GB: 6.5 % | ||
News 2018 | 4.3 GB / 4.3b | + | 5 | 20 GB: 11 % | ||
Wortschatz Uni Leipzig | > 20 * 200 mb | Part of News 2018??? | Code | ---- | ---- | |
Paracrawl | 3.2 GB / 3.2b | -- | --- | ---- | ||
Open Subtitles | 1.3 GB / 288m Tokens | o | 2 | 2.6 GB : 1.5 % | ||
Open Legal Dump | 3.6 GB / 3.5b | + | Announcment | Used by Deepset | 5 | 15 GB: 8.4 % |
Corpus of German-Language Fiction (txt) | 2735 Prose Works | Download | Old (1510-1940) |
https://ofai.github.io/million-post-corpus/
- Originally meant for translations tasks: WMT 19
- Mabe Identical to News 2018??? Leipzig Corpus Collection
- Huge German Corpus (HGC)
- Clean Files
- Split in distinct sentences (2.2 Create Vocab)
- Tokenize
Training
- Pre-training SmallBERTa - A tiny model to train on a tiny dataset: https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b
- Pretraining RoBERTa using your own data(Fairseq): https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md
- How to train a new language model from scratch using Transformers and Tokenizers: https://huggingface.co/blog/how-to-train
- Language model training: https://github.com/huggingface/transformers/tree/master/examples/language-modeling
- How to train a new language model from scratch using Transformers and Tokenizers: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
- Fairseq - GitHub
- Hugging Face - GitHub
- FARM - GitHub
- DeepSpeed: Speeding Up BERT Training @ Microsoft Github
Name | Steps | Result URL | Training Time | Code | Paper |
---|---|---|---|---|---|
RoBERTa Base | RoBERTa | ||||
BERT Large | Github | BERT |
- Overview preemtible TPUs TPU Unicorn
Name | Steps | Result URL | Training Time | Code | Metrics |
---|---|---|---|---|---|
Deepset German Bert Base | 810k (1024 SL) + 30k (512 SL) | Deepset | 9 Days TPU v2-8 | ||
Ddmdz German Bert Base | 1500k (512 SL) | dbmdz | dbmdz | stefan-it | |
Europeana BERT | dbmdz | Europeana-bert |