Digitizing Vietnamese historical documents with DL

Demo: https://youtu.be/o5xpfwalEWw
Demo source: https://github.com/ds4v/NomNaSite

I. Overview

1. Introduction

The Vietnamese language, with its extremely diverse phonetics and the most robust writing system in East Asia, has undergone a journey from Confucian, Sino (Hán) characters to the Nôm script, and finally, the Quốc Ngữ alphabet based on the Latin writing system. Accompanying each type of script are glorious chapters of the nation's history.

After ending Thousand Years of Chinese Domination, our ancestors, with a consciousness of linguistic self-determination, created the Nôm script, an ideographic script based on the Hán characters to represent Vietnamese speech. Along with these Hán characters, the Nôm script was used to record the majority of Vietnamese documents for about 10 centuries. However, this heritage is currently at risk of extinction by the shift to the modern Vietnamese script (Quốc Ngữ).

"Today, less than 100 scholars world-wide can read Nôm. Much of Việt Nam's vast,
written history is, in effect, inaccessible to the 80 million speakers of the language"

(Vietnamese Nôm Preservation Foundation – VNPF)

To use this vast source of knowledge, it needs to be digitized and translated into modern Quốc Ngữ. Due to the difficulty and time-consuming nature of translation, along with a limited number of experts, these efforts cannot be accomplished in a short time.

👉 To accelerate this digitization process, Optical Character Recognition (OCR) techniques are key to making all major works in Sino-Nom available online.

2. Achievements

My teammate Nguyễn Đức Duy Anh and I have been working on this project for nearly 8 months under the dedicated guidance of Dr. Do Trong Hop (Faculty of Information Science and Engineering - VNUHCM UIT) and have obtained specific achievements:

Successful build of the NomNaOCR dataset:
- Serving 2 OCR problems of Text Detection and Text Recognition for historical documents written in Sino-Nom.
- The biggest dataset for Sino-Nom script at the moment, with 2953 Pages and 38318 Patches.
Successful build of an OCR pipeline on Sino-Nom text using Deep Learning.
Implement and experiment models on sequence level. This not only saves the cost for annotation but also helps us retain the semantics in the sequence, instead of just processing individual character like most previous works. Take a look at the following open-source projects if you need implementations on character level:
- https://github.com/trhgquan/OCR_chu_nom
- https://www.kaggle.com/competitions/kuzushiji-recognition

👉 You can take a look at this thesis_en.pdf file for a summary of the models used in this project.

II. The NomNaOCR Dataset

Dataset: https://www.kaggle.com/datasets/quandang/nomnaocr
Paper: https://ieeexplore.ieee.org/document/10013842

Note: You should use the NomNaTong font to be able to read the Sino-Nom content in the best way.

1. Data Collection Process

VNPF has digitized many famous Sino-Nom works with high historical value. To make use of these invaluable resources, I used Automa to create an automatic collection flow to collect:

Images and their URLs.
Phonetics with their digital characters and Vietnamese translation (if any).

Automa.mp4

a. Collection Instructions

I was lazy to write code at this step, so I did it a bit manually 😅.

Import workflow.json into Automa.
Choose the New tab block and Edit => enter the URLs of the Sino-Nom works you want to collect
Edit the To number field of the Loop Data block to specify the number of pages to be collected.
Edit CSS Selector of following blocks:
- Element exists: check if the page is empty.
- Blocks group: get the image URL and the text of the current page.
Click Execute to start collecting.
Run automa2txt.py to parse the obtained automa.json into 3 files:
- url.txt: contains the image URLs of the historical work.
- nom.txt: contains Sino-Nom text.
- modern.txt: contains the translated phonetics corresponding to nom.txt.

[*] For downloading images, I just simply use the Batch Download feature of Internet Download Manager.

b. Collected Historical Works

Document Name	Number of Pages
Lục Vân Tiên	104
Tale of Kiều ver 1866	100
Tale of Kiều ver 1871	136
Tale of Kiều ver 1872	163
ĐVSKTT Quyển Thủ	107
ĐVSKTT Ngoại kỷ toàn thư	178
ĐVSKTT Bản kỷ toàn thư	933
ĐVSKTT Bản kỷ thực lục	787
ĐVSKTT Bản kỷ tục biên	448
Total	2956

[*] ĐVSKTT: abbreviation of Đại Việt Sử Ký Toàn Thư (History of Greater Vietnam).

2. Labeling process

We used PPOCRLabel from the PaddleOCR ecosystem to assign bounding boxes automatically. This tool, by default, uses DBNet to detect text, which is also the model we planned to experiment with for Text Detection. Here, we have divided this tool into 2 versions:

annotators.zip: For labelers, I removed unnecessary features like Auto annotation, ... to avoid mistakes due to over-clicking during labeling and to make installation easier and less error-prone.
composer.zip: For guideline builders (who I'll call Composer) to run Auto annotation, with quite fully functional compared to the original PPOCRLabel. I removed the Auto recognition computation when running Auto annotation and used TEMPORARY as the label for text. Additionally, I also implemented image rotation to match the input of the Recognition models when running the Export Recognition Result feature.

👉 Annotators will replace the TEMPORARY labels according to the guidelines for poetry and prose guidelines. Finally, they will map the actual labels collected from VNPF.

However, with images in NomNaOCR, PPOCRLabel will mainly detect text areas in horizontal orientation, so we rotated images at 90-degree angles to detect boxes:

Depending on the documents, Composers chose to rotate the images by ±90 degrees or both directions.
Run rotated_generator.py to generate the rotated images.
Then, input them into PPOCRLabel to predict the bounding boxes.
When the prediction is complete, run unrotated_convertor.py to rotate the bounding boxes vertically again.

After the actual implementation, the NomNaOCR dataset obtained 2953 Pages (excluding 1 error-scanned page and 2 blank pages). By semi-manually annotating, we got additional 38,318 patches (1 Patch was ignored). Then, we used the formula of the IHR-NomDB dataset to acquire a similar distribution between the Train and Validate sets to split the Recognition data. The Synthetic Nom String set of this dataset was also used to perform Pretraining for Recognition models.

Subset	Number of Records	Character Intersection
Train set	30654	93.24%
Validate set	7664	64.41%

III. Approaches

1. Training Process

For Detection, I used PaddleOCR for training with corresponding config files in the Text detection folder.
For Recognition, during the PreTraining phase on the Synthetic Nom String set of IHR-NomDB, we found that when performing Skip Connection (SC) for the feature map with a layer X that has the same shape and is located as far away from this feature map as possible, it will significantly improve model performance. Therefore, we also experimented 2 fundamental Skip Connection methods: Addition and Concatenation for the most feasible models (those contain the aforementioned layer X).

👉 Download the weights of models here.

2. Evaluation Process

Metrics for evaluating Text Detection and End-to-End: We used a new method called CLEval to evaluate the effectiveness of both stages of Text Detection and Recognition (End-to-End). Moreover, this method can also evaluate Text Detection only, so depending on the problem, CLEval will vary in its computational components.
Metrics for evaluating Text Recognition only: We used similar metrics to previous related works at sequence level, including: Sequence Accuracy, Character Accuracy, and Character Error Rate (CER).
Additionally, for Recognition, I only keep the output of notebooks or models that have the best results on the Validate set of NomNaOCR, including:
- CRNNxCTC.ipynb: has the highest Sequence Accuracy.
- SC-CNNxTransformer_finetune.ipynb: has the highest Character Accuracy and CER.

👉 Check thesis_en.pdf for more information.

IV. Experimental Results

1. Text Detection

2. Text Recognition

a. PreTraining results

b. Fine-tuning and ReTraining results

3. End-to-End

V. Many Thanks to

Members of the labeling team, who generously sacrificed a portion of their time to participate in this research and help us complete a high-quality dataset:

Members (VNUHCM UIT)	Email	GitHub
Ngô Đức Vũ	20520950@gm.uit.edu.vn	vungods
Trịnh Thế Hiển	20521310@gm.uit.edu.vn	HienTheTrinh
Phan Châu Thắng	20520929@gm.uit.edu.vn
Nguyễn Hùng Trung Hiếu	20521323@gm.uit.edu.vn	hellofromtheothersky
Châu Tấn	20520926@gm.uit.edu.vn	TomatoFT
Nguyễn Minh Trí	20522052@gm.uit.edu.vn
Phạm Phú Phước	18521031@gm.uit.edu.vn

My friend, Nguyễn Ngọc Thịnh (Oriental Studies - VNUHCM USSH) for helping me answer questions about linguistic aspects of Sino-Nom characters in this project.
Mr. Nguyễn Đạt Phi, the founder of the HÙNG CA SỬ VIỆT channel, who has instilled in me a passion for our nation's history, which served as an inspiration for me to pursue this project. The stories about our ancestors, narrated with his emotive voice, have become an indispensable spiritual nourishment for me.
Finally, heartfelt thanks to VNPF for their wonderful works and contribution to the preservation of our national historical and cultural heritage.

VI. TODO

Use Beam search or even further, a Language model to decode the output for Text Recognition, referencing projects by Harald Scheidl.
NomNaOCRpp: Experiment with more recent models or state-of-the-art (SOTA) models on famous benchmark datasets such as ICDAR 2013 and 2015.
NomNaSite: Develop a WebApp to apply implemented solutions in practical scenarios.
NomNaNMT: Develop the following 2 machine translation tasks.
- Translate Sino-Nom phonetics into Quốc Ngữ script: Already deployed by HCMUS.
- From the above Quốc Ngữ text, further translate into contemporary Vietnamese.

Record errors on VNPF into a file. During dataset creation, we discovered several errors in VNPF's translations, such as some translations not matching the current Page, incorrect translations compared to the image, translations with extra or missing words, ... Below are a few examples:

Error Description	Work	Page	Location in Image	Note
The character 揆 in the dictionary does not mean "cõi"	Tale of Kieu ver 1866	1	Sentence 1
The character 別 is different from the image	Tale of Kieu ver 1866	9	Sentence 22	Variant of "别", appeared mostly in versions before 1902
The character 𥪞 is different from the image	Tale of Kieu ver 1866	55	Sentence 15
The character 󰁳 is different from the image	Tale of Kieu ver 1866	55	Sentence 15
There are 21 lines > 20 in the image	Lục Vân Tiên	6	-
There are 19 lines < 20 in the image	Lục Vân Tiên	7	-
The 5th character is displayed as [?]	Lục Vân Tiên	7	Sentence 10

VII. References

Dive into Deep Learning book.
OCR articles by Phạm Bá Cường Quốc.
OCR articles by TheAILearner.
OCR articles by Nanonets:
OCR articles by Label Your Data:
OCR articles by Gidi Shperber:
- Part 1 - A gentle introduction to OCR.
- Part 2 - OCR 101: All you need to know.
Additionally, in *.ipynb and *.py files, there are references noted for the corresponding implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Digitizing Vietnamese historical documents with DL

I. Overview

1. Introduction

2. Achievements

II. The NomNaOCR Dataset

1. Data Collection Process

a. Collection Instructions

b. Collected Historical Works

2. Labeling process

III. Approaches

1. Training Process

2. Evaluation Process

IV. Experimental Results

1. Text Detection

2. Text Recognition

a. PreTraining results

b. Fine-tuning and ReTraining results

3. End-to-End

V. Many Thanks to

VI. TODO

VII. References

Members (VNUHCM UIT)	Email	GitHub
Ngô Đức Vũ	[email protected]	vungods
Trịnh Thế Hiển	[email protected]	HienTheTrinh
Phan Châu Thắng	[email protected]
Nguyễn Hùng Trung Hiếu	[email protected]	hellofromtheothersky
Châu Tấn	[email protected]	TomatoFT
Nguyễn Minh Trí	[email protected]
Phạm Phú Phước	[email protected]

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Digitizing Vietnamese historical documents with DL

I. Overview

1. Introduction

2. Achievements

II. The NomNaOCR Dataset

1. Data Collection Process

a. Collection Instructions

b. Collected Historical Works

2. Labeling process

III. Approaches

1. Training Process

2. Evaluation Process

IV. Experimental Results

1. Text Detection

2. Text Recognition

a. PreTraining results

b. Fine-tuning and ReTraining results

3. End-to-End

V. Many Thanks to

VI. TODO

VII. References