Question Regarding Model Evaluation vs. Inference Performance on the Validation Dataset #1718

ep0p · 2024-09-05T13:32:29Z

ep0p
Sep 5, 2024

Hello,

I have fine-tuned a Doctr model for text recognition, which you can find here (model). During training and evaluation, I achieved very high validation scores, with near-perfect metrics, as shown below:

To further validate the model's performance, I generated a PDF file containing words solely from the validation dataset. However, when I run inference on this PDF, the model's performance is significantly worse than expected. Based on the high validation scores, I anticipated full recognition of the words, but this was not the case.

For example:

1. Single word case:

Here is the content of my PDF with the detected word:
And here is the model's prediction:

2. Multiple words case:

PDF content with the detected words:
And the model's prediction:

These detections are the most accurate I could achieve by setting the bin_thresh = 0.1 . Higher values resulted in worse predictions.

I also noticed the model adds extra punctuation marks. Initially, I thought this was due to overlapping detection boxes, but the issue persisted even when testing with a PDF containing one word per line.

Am I wrong to expect 99% accuracy during inference on the validation dataset, given the near-perfect validation scores achieved during training?

Answered by felixT2K

Sep 12, 2024

@ep0p Update :)
I tweaked a bit around with your dataset and the model seems to bias hardly on the punctuation samples additional the dataset is really easy so it seems to overfit slightly with to much samples.

It's still not 100% solved but goes in the right direction.
I trained only for one epoch to check.

dirty code i used to clean only the train data (val data unchanged)

import json
from string import punctuation
from collections import defaultdict
from tqdm import tqdm

new = {}
unique_words = set()
word_count = defaultdict(int)
count_with_ending_dot = 0

with open("/home/felix/Desktop/doctr_test_data/X_TEST/Recognition_ds/train_val_archive/train/original_labels.json", "r", encoding=…

View full answer

felixdittrich92 · 2024-09-06T07:37:56Z

felixdittrich92
Sep 6, 2024
Maintainer

Hi @ep0p 👋,

Yeah that's correct.
Would you be able to share the dataset (private?) ?
I'm really running out of ideas, so I'd take a look at it myself if possible.

In the meanwhile you can check out the dataset i used to fine tune the multilingual model: synth_multilingual_dataset
Or check out https://github.com/felixdittrich92/synthtiger/tree/doctr-modified (branch: doctr-modified) which i used to create the synth dataset

13 replies

felixT2K Sep 9, 2024

Hi @ep0p 👋🏼,

Got it.
Sounds good 👍🏼
I will check your dataset and start a debug run to hopefully identify the issue asap.

felixdittrich92 Sep 10, 2024
Maintainer

Hi @ep0p 👋,

Short update i was able to reproduce the behaviour.
Will try to step a bit deeper into your dataset the next days.

ep0p Sep 10, 2024
Author

Hi @felixdittrich92 ,

Alright, thanks a lot for the update.

felixT2K Sep 12, 2024

@ep0p Update :)
I tweaked a bit around with your dataset and the model seems to bias hardly on the punctuation samples additional the dataset is really easy so it seems to overfit slightly with to much samples.

It's still not 100% solved but goes in the right direction.
I trained only for one epoch to check.

dirty code i used to clean only the train data (val data unchanged)

import json
from string import punctuation
from collections import defaultdict
from tqdm import tqdm

new = {}
unique_words = set()
word_count = defaultdict(int)
count_with_ending_dot = 0

with open("/home/felix/Desktop/doctr_test_data/X_TEST/Recognition_ds/train_val_archive/train/original_labels.json", "r", encoding="utf-8") as f:
    data = json.load(f)
    print(len(data))

    for k, v in tqdm(data.items(), total=len(data)):
        # Check if the last character is in punctuation
        if v[-1] in punctuation:
            # If the word hasn't been added yet, add it only once
            if word_count[v] == 0 and count_with_ending_dot < 100:  # This is lazy maybe better to check for a better balance
                word_count[v] += 1
                count_with_ending_dot += 1
                unique_words.add(v)
                new[k] = v
        elif word_count[v] < 4:
            word_count[v] += 1
            unique_words.add(v)
            new[k] = v
        else:
            continue

print(len(new))

with open("/home/felix/Desktop/doctr_test_data/X_TEST/Recognition_ds/train_val_archive/train/labels.json", "w", encoding="utf-8") as f:
    json.dump(new, f, indent=4, ensure_ascii=False)

command:

USE_TORCH=1 python3 /home/felix/Desktop/doctr/references/recognition/train_pytorch.py parseq --train_path=/home/felix/Desktop/doctr_test_data/X_TEST/Recognition_ds/train_val_archive/train --val_path=/home/felix/Desktop/doctr_test_data/X_TEST/Recognition_ds/train_val_archive/val --pretrained

Model can be downloaded (7days from now :D) : https://fileport.io/NfwWvnyNAMhy

Answer selected by ep0p

ep0p Sep 12, 2024
Author

Hi @felixdittrich92 ,

Appreciate your efforts!
I’ve also noticed the model's bias toward punctuation. When I trained on samples without punctuation, the inference results were noticeably better, though it predictably struggles with punctuated samples.

And yes, you're right—the dataset is quite simple. Compared to the one you shared, mine is way too easy :)
I’ll need to extract more diverse data to build a proper ds.

Will check the link in 7 days then.

felixT2K Sep 12, 2024

@ep0p It's only available for 7 days you should download it as soon as possible 😅

ep0p Sep 12, 2024
Author

@felixdittrich92 That makes much more sense 🤣

ep0p Sep 16, 2024
Author

Hi @felixdittrich92 ,

Small question, on which vocab did you train the model that you provided?

    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PARSeq:
        size mismatch for head.weight: copying a param with shape torch.Size([122, 384]) from checkpoint, the shape in current model is torch.Size([127, 384]).
        size mismatch for head.bias: copying a param with shape torch.Size([122]) from checkpoint, the shape in current model is torch.Size([127]).
        size mismatch for embed.embedding.weight: copying a param with shape torch.Size([124, 384]) from checkpoint, the shape in current model is torch.Size([129, 384]).

I get the same error with my own trained model even tough i have trained it on a frenchplus vocab that looks like this:

VOCABS["frenchplus"] = "".join(
    dict.fromkeys(
        VOCABS["french"]
        + VOCABS["portuguese"]
        + VOCABS["spanish"]
        + VOCABS["german"]
        + VOCABS["czech"]
        + VOCABS["polish"]
        + VOCABS["dutch"]
        + VOCABS["italian"]
        + VOCABS["norwegian"]
        + VOCABS["danish"]
        + VOCABS["finnish"]
        + VOCABS["swedish"]
        + "§"
        + VOCABS["ukrainian"]
        + VOCABS["arabic"]
        + VOCABS["vietnamese"]
        + VOCABS["hebrew"]
        + VOCABS["hindi"]
        + VOCABS["bangla"]
        + VOCABS["ancient_greek"]        
        + "ª―φ½…½…¬´¾țΑ ‹›™▪►●◦❖➔➢𝑇𝑉𝑎𝑐𝑑𝑒𝑓𝑙𝑛𝑜𝑝𝑟©✅®·∑⋅ˆℎé́π′■Σ⇒ þ☐ ‐±1Äìαα„ﬁηÞЛАШНИμÿΔ−»«m²§•·–’œ≥◊×≤½æ‰‘Œ♦µ/βγ¹”º…öð—¤’¼¦äíáå¨“m³'"
    )
)

Trainwise speaking, my model overfits badly, with only 4 repetition per distinct word.... i guess i will have to increase the threshold and run more tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question Regarding Model Evaluation vs. Inference Performance on the Validation Dataset #1718

{{title}}

Replies: 1 comment 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Question Regarding Model Evaluation vs. Inference Performance on the Validation Dataset #1718

ep0p Sep 5, 2024

Replies: 1 comment · 13 replies

felixdittrich92 Sep 6, 2024 Maintainer

felixT2K Sep 9, 2024

felixdittrich92 Sep 10, 2024 Maintainer

ep0p Sep 10, 2024 Author

felixT2K Sep 12, 2024

ep0p Sep 12, 2024 Author

felixT2K Sep 12, 2024

ep0p Sep 12, 2024 Author

ep0p Sep 16, 2024 Author

ep0p
Sep 5, 2024

Replies: 1 comment 13 replies

felixdittrich92
Sep 6, 2024
Maintainer

felixdittrich92 Sep 10, 2024
Maintainer

ep0p Sep 10, 2024
Author

ep0p Sep 12, 2024
Author

ep0p Sep 12, 2024
Author

ep0p Sep 16, 2024
Author