Skip to content

Tools for preprocessing and AI-driven analysis of images with Tibetan text.

License

Notifications You must be signed in to change notification settings

CodexAITeam/TibetanOCR

 
 

Repository files navigation

Tibetan Column Detection

Overview

This Python project focuses on generating training data for detecting columns or text blocks of tibetan texts by embedding Tibetan text into images.

Validation results

Validation results

It includes functions to create lorem ipsum-like Tibetan text, read random Tibetan text files from a directory, and calculate and embed text within specified bounding boxes in images. The project effectively handles Tibetan script, ensuring proper display and formatting within the images.

Features

  • Automated Data Generation: Simplifies the process of generating training data for Tibetan NLP tasks.
  • Customizable Input: Allows users to specify various input parameters like images, labels, directories for backgrounds and corporate images, etc.
  • Image Processing: Utilizes the PIL library for image manipulation.
  • Bounding Box Preparation: Includes a utility function prepare_bbox_string for handling bounding boxes.
  • Multiprocessing Support: Leverages multiprocessing for efficient data processing.
  • Debugging Mode: Includes a debug mode for troubleshooting and ensuring correct data processing.

Getting Started

Prerequisites

  • Python 3.x
  • PIL (Python Imaging Library)
  • YOLO utilities (for bounding box handling)
  • Additional Python libraries: numpy, tqdm, yaml

Installation

Clone the repository to your local machine:

git clone https://github.com/nih23/Tibetan-NLP.git
cd Tibetan-NLP

Command-line Arguments

The script supports various command-line arguments to customize the data generation process:

  • --background_train: Folder with background images for training (default: './ext/TibetanOCR/data/background_images_train/')
  • --background_val: Folder with background images for validation (default: './ext/TibetanOCR/data/background_images_val/')
  • --dataset_folder: Folder for the generated YOLO dataset (default: './data/yolo_tibetan/')
  • --corpora_folder: Folder with Tibetan tibetan numbers corpora (default: './data/corpora/UVA Tibetan Spoken Corpus/')
  • --train_samples: Number of training samples to generate (default: 2)
  • --val_samples: Number of validation samples to generate (default: 1)
  • --no_cols: Number of text columns to generate [1....5] (default: 1)
  • --font_path: Path to a font file that supports Tibetan characters (default: 'ext/Microsoft Himalaya.ttf')
  • --single_label: Use a single label "tibetan" for all files instead of using filenames as labels (flag, no value required)

Generating training data

Training data is generated by simply running generate_training_data.py. Make sure to update folders for background images. ⚠️ Make sure to set the path to Microsoft Himalaya.ttf correctly as this font renders Tibetan correctly.

python generate_training_data.py --font_path "ext/Microsoft Himalaya.ttf" --single_label

Train YOLOv8n

Training of YOLOv8n is done by a CLI call to Ultralytics.

yolo detect train data=data/yolo_tibetan/tibetan_yolo.yml epochs=1000 imgsz=1024

The model is then converted into a torchscript for inference:

yolo detect export model=runs/detect/train9/weights/best.pt 

Inference

We can now employ our trained model for recognition and classification of tibetan text blocks as follows:

yolo predict task=detect model=runs/detect/train9/weights/best.torchscript imgsz=1024 source=data/my_inference_data/*.jpg

The results are then saved to folder runs/detect/predict

Contributions

Contributions to this project are welcome! Please fork the repository and submit a pull request with your proposed changes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Tools for preprocessing and AI-driven analysis of images with Tibetan text.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%