This release contains cleaned datasets we used in transformer-based Thai language model pre-training (WangchanBERTa; wangchanberta-base-att-spm-uncased).
The cleaned datasets is only partially available since data from Wisesight, Pantip, and TNC is not under explicit open source licenses.