docTR training reference datasets #1654

felixdittrich92 · 2024-06-21T13:41:56Z

felixdittrich92
Jun 21, 2024
Maintainer

The provided link contains reference datasets:
NOTE: train and val contains the same data. You should split your custom dataset and avoid duplications.

detection_task: docTR detection training
recognition_task: docTR recognition training

Reference datasets: Datasets

Docs: Training documentation
Recognition: README
Detection: README

drgrey87 · 2024-07-29T15:16:47Z

drgrey87
Jul 29, 2024

Hi guys! Thank for your excellent work! Could you provide an example how to train detection model for example using internal datasets? I see that train_tensorflow.py expects a little bit different format from training set.

Internal dataset training set:
img, target = train_set[0]

Expected:
labels.json

{
    "sample_img_01.png" = {
        'img_dimensions': (900, 600),
        'img_hash': "theimagedumpmyhash",
        'polygons': [[[x1, y1], [x2, y2], [x3, y3], [x4, y4]], ...]
     },
     "sample_img_02.png" = {
        'img_dimensions': (900, 600),
        'img_hash': "thisisahash",
        'polygons': [[[x1, y1], [x2, y2], [x3, y3], [x4, y4]], ...]
     }
     ...
}

5 replies

felixdittrich92 Aug 5, 2024
Maintainer Author

Hi @drgrey87 The dataset format (detection & recognition) are the same between both backends (TensorFlow and PyTorch) :)
Or do you mean the CLI commands to start the training ?

drgrey87 Aug 5, 2024

@felixdittrich92 i mean that format of internal dataset is different than train_tensorflow.py expects and i'd like to see how to train detection/recognition model using internal datasets:

internal dataset:

train_set (IIIT5K(train=True))
img, target = train_set[0]

train_tensorflow.py expects:

├── images
│   ├── sample_img_01.png
│   ├── sample_img_02.png
│   ├── sample_img_03.png
│   └── ...
└── label.json

felixdittrich92 Aug 15, 2024
Maintainer Author

HHi @drgrey87 👋,

Excuse the late response.
Currently we don't have integrated the built in datasets in the train scripts. (I know we should do this - Do you want to open a issue for this maybe ? :) )

For the moment i can provide a quick and dirty script to transform the built in datasets into doctr format:
(You need to modify the paths)

import os
import json
from doctr.datasets import CORD, FUNSD, SROIE, DetectionDataset
import hashlib
from PIL import Image
import numpy as np
from torchvision.transforms import ToPILImage
from tqdm import tqdm

os.environ['USE_TORCH'] = '1'


def get_hash(file_path):
    with open(file_path, "rb") as f:
        hash = hashlib.sha256(f.read()).hexdigest()
    return hash


def to_absolute(width, height, label):
    polygons = np.empty(label['boxes'].shape, dtype=np.int32)
    polygons[..., 0] = label['boxes'][..., 0] * width
    polygons[..., 1] = label['boxes'][..., 1] * height
    return polygons


train_path = "/home/felix/Desktop/Other/text_detection_doc_funsd/train/images/"
train_label_path = "/home/felix/Desktop/Other/text_detection_doc_funsd/train/"
val_path = "/home/felix/Desktop/Other/text_detection_doc_funsd/val/images/"
val_label_path = "/home/felix/Desktop/Other/text_detection_doc_funsd/val/"

cord_train = CORD(train=True, download=True, use_polygons=True)
cord_test = CORD(train=False, download=True, use_polygons=True)
funsd_train = FUNSD(train=True, download=True, use_polygons=True)
funsd_test = FUNSD(train=False, download=True, use_polygons=True)
sroie_train = SROIE(train=True, download=True, use_polygons=True)
sroie_test = SROIE(train=False, download=True, use_polygons=True)



train_labels = dict()
val_labels = dict()
train_cord_count = 0
train_funsd_count = 0
train_sroie_count = 0
val_cord_count = 0
val_funsd_count = 0
val_sroie_count = 0

for img, label in tqdm(iterable=cord_train, total=len(cord_train)):
    img = ToPILImage()(img).convert('RGB')
    width, height = img.size
    img.save(train_path + 'cord_' + str(train_cord_count) + '.png')
    polygons = to_absolute(width, height, label)
    train_labels['cord_' + str(train_cord_count) + '.png'] = {
        'img_dimensions': tuple((width, height)),
        'img_hash': get_hash(train_path + 'cord_' + str(train_cord_count) + '.png'),
        'polygons': polygons.tolist(),
        'labels': label['labels']
    }
    train_cord_count += 1

for img, label in tqdm(iterable=cord_test, total=len(cord_test)):
    img = ToPILImage()(img).convert('RGB')
    width, height = img.size
    img.save(val_path + 'cord_' + str(val_cord_count) + '.png')
    polygons = to_absolute(width, height, label)
    val_labels['cord_' + str(val_cord_count) + '.png'] = {
        'img_dimensions': tuple((width, height)),
        'img_hash': get_hash(val_path + 'cord_' + str(val_cord_count) + '.png'),
        'polygons': polygons.tolist(),
        'labels': label['labels']
    }
    val_cord_count += 1


for img, label in tqdm(iterable=sroie_train, total=len(sroie_train)):
    img = ToPILImage()(img).convert('RGB')
    width, height = img.size
    img.save(train_path + 'sroie_' + str(train_sroie_count) + '.png')
    polygons = to_absolute(width, height, label)
    train_labels['sroie_' + str(train_sroie_count) + '.png'] = {
        'img_dimensions': tuple((width, height)),
        'img_hash': get_hash(train_path + 'sroie_' + str(train_sroie_count) + '.png'),
        'polygons': polygons.tolist(),
        'labels': label['labels']
    }
    train_sroie_count += 1

for img, label in tqdm(iterable=sroie_test, total=len(sroie_test)):
    img = ToPILImage()(img).convert('RGB')
    width, height = img.size
    img.save(val_path + 'sroie_' + str(val_sroie_count) + '.png')
    polygons = to_absolute(width, height, label)
    val_labels['sroie_' + str(val_sroie_count) + '.png'] = {
        'img_dimensions': tuple((width, height)),
        'img_hash': get_hash(train_path + 'sroie_' + str(val_sroie_count) + '.png'),
        'polygons': polygons.tolist(),
        'labels': label['labels']
    }
    val_sroie_count += 1

for img, label in tqdm(iterable=funsd_train, total=len(funsd_train)):
    img = ToPILImage()(img).convert('RGB')
    width, height = img.size
    img.save(train_path + 'funsd_' + str(train_funsd_count) + '.png')
    polygons = to_absolute(width, height, label)
    train_labels['funsd_' + str(train_funsd_count) + '.png'] = {
        'img_dimensions': tuple((width, height)),
        'img_hash': get_hash(train_path + 'funsd_' + str(train_funsd_count) + '.png'),
        'polygons': polygons.tolist(),
        'labels': label['labels']
    }
    train_funsd_count += 1

for img, label in tqdm(iterable=funsd_test, total=len(funsd_test)):
    img = ToPILImage()(img).convert('RGB')
    width, height = img.size
    img.save(val_path + 'funsd_' + str(val_funsd_count) + '.png')
    polygons = to_absolute(width, height, label)
    val_labels['funsd_' + str(val_funsd_count) + '.png'] = {
        'img_dimensions': tuple((width, height)),
        'img_hash': get_hash(train_path + 'funsd_' + str(val_funsd_count) + '.png'),
        'polygons': polygons.tolist(),
        'labels': label['labels']
    }
    val_funsd_count += 1



print(
    f"CORD: {train_cord_count} train images, {val_cord_count} val images\n"
    f"FUNSD: {train_funsd_count} train images, {val_funsd_count} val images\n"
    f"SROIE: {train_sroie_count} train images, {val_sroie_count} val images\n"
)


with open(train_label_path + 'labels.json', 'w') as f:
    json.dump(train_labels, f, ensure_ascii=False)

with open(val_label_path + 'labels.json', 'w') as f:
    json.dump(val_labels, f, ensure_ascii=False)

train = DetectionDataset(img_folder=train_path, label_path=train_label_path + 'labels.json')
val = DetectionDataset(img_folder=val_path, label_path=val_label_path + 'labels.json')

felixdittrich92 Aug 15, 2024
Maintainer Author

So you can adjust the script to the built in datasets you want to use ftm :)

drgrey87 Aug 30, 2024

@felixdittrich92 thanks, noted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docTR training reference datasets #1654

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

docTR training reference datasets #1654

felixdittrich92 Jun 21, 2024 Maintainer

Replies: 1 comment · 5 replies

drgrey87 Jul 29, 2024

felixdittrich92 Aug 5, 2024 Maintainer Author

drgrey87 Aug 5, 2024

felixdittrich92 Aug 15, 2024 Maintainer Author

felixdittrich92 Aug 15, 2024 Maintainer Author

drgrey87 Aug 30, 2024

felixdittrich92
Jun 21, 2024
Maintainer

Replies: 1 comment 5 replies

drgrey87
Jul 29, 2024

felixdittrich92 Aug 5, 2024
Maintainer Author

felixdittrich92 Aug 15, 2024
Maintainer Author

felixdittrich92 Aug 15, 2024
Maintainer Author