Unexpected Poor Inference Despite High Evaluation Scores: Validation Loss 0.0031 (Exact: 99.70% | Partial: 99.74%) #1677

ep0p · 2024-07-24T08:58:42Z

ep0p
Jul 24, 2024

I have fine-tuned the recognition model crnn_mobilenet_v3_large with a dataset of 3.5 million words, which I have split into three parts: train/val/test. My training command was as follows:
python references/recognition/train_pytorch.py crnn_mobilenet_v3_large --train_path /home/epop/DATASET/ds_3p5_rec_v1/train --val_path /home/epop/DATASET/ds_3p5_rec_v1/val --name crnn_mobilenet_v3_large_v1 --epochs 10 --batch_size 32 --pretrained --vocab="frenchplus" --device 1

I then ran the evaluation on my test dataset:
python references/recognition/evaluate_pytorch.py crnn_mobilenet_v3_large --test_path /home/epop/DATASET/ds_3p5_rec_v1/test --batch_size 32 --vocab="frenchplus" --device 1 --resume crnn_mobilenet_v3_large_v1.pt

The evaluation yielded a high score:
Validation loss: 0.00308553 (Exact: 99.70% | Partial: 99.74%)

However, when running inference, the results are very disappointing, and I am wondering if I missed a step somewhere. Here is the original document:

And here is the result of the inference:

Answered by felixdittrich92

Aug 26, 2024

Hi @ep0p 👋,

What i see directly is --lr 0.01 what was the reason to use such a high value ? 😅

Mh.. we pretrained the models with the default args (only epochs increased to 20 - so maybe with less data set to 40 or 50) from scratch on ~10M samples (3M are definitely enough especially for fine tuning).

I stuck a bit because i have fine tuned parseq also (https://huggingface.co/Felix92/doctr-torch-parseq-multilingual-v1) only on ~1M synth samples without trouble. (with --pretrained and 50 epochs other args unchanged)

View full answer

felixdittrich92 · 2024-08-20T09:13:22Z

felixdittrich92
Aug 20, 2024
Maintainer

@ep0p Short question the dataset contains "real" word crops or synth generated ?

21 replies

ep0p Aug 23, 2024
Author

Hi @felixdittrich92 ,

Since I’m encountering performance issues with detection as well, I’ll add this here, but let me know if you’d prefer me to create a separate post.

I trained two detection models: db_resnet50, db_mobilenet_v3_large (with and without freeze-backbone), and during inference, they all exhibit the same behavior: detection boxes that are much larger than those of the original models. Here is an example:

fine-tuned: db_mobilenet_v3_large

python references/detection/train_pytorch.py /home/epop/DATASET/ds_1000_det/train /home/epop/DATASET/ds_1000_det/val db_mobilenet_v3_large --name db_mobilenet_v3_large_v1 --wb --epochs 20 --batch_size 32 --freeze-backbone --pretrained --device 1 --lr 0.01 --amp --early-stop --early-stop-epochs 5 --early-stop-delta 0.01

original: db_mobilenet_v3_large

felixdittrich92 Aug 23, 2024
Maintainer

Hi @ep0p 👋,
Instead of db_resnet50 i suggest to use fast_base instead generalizes much better mostly same inference latency and requires less ressources.
db_mobilenet_v3_large is fine as lightweight model.

From my experience it worked best training from scratch if your docs for inference are close to each other (Model does not need to generalize over each thinkable doc format).
With only 300 samples i have trained a well suitable model to predict on docs which are comparable to the train data.
--freeze-backbone is not recommended here (for detection training) because the feature extractor will not learn from your new data.

ep0p Aug 26, 2024
Author

Hi @felixdittrich92 ,

I ran the recognition experiments as suggested:

Fine-tuned parseq with only --pretrained: The results are very poor, as shown in the attached graphs.
python references/recognition/train_pytorch.py parseq --train_path /home/epop/DATASET/ds_1000_rec_v2/train --val_path /home/epop/DATASET/ds_1000_rec_v2/val --name parseq_optim_v3 --wb --epochs 10 --batch_size 128 --pretrained --vocab="frenchplus" --device 1 --lr 0.01 --early-stop --early-stop-epochs 5 --early-stop-delta 0.01
Parseq trained from scratch: The performance is similarly poor
python references/recognition/train_pytorch.py parseq --train_path /home/epop/DATASET/ds_1000_rec_v2/train --val_path /home/epop/DATASET/ds_1000_rec_v2/val --name parseq_optim_v_train_from_scratch --wb --epochs 10 --batch_size 128 --vocab="frenchplus" --device 1 --lr 0.01 --early-stop --early-stop-epochs 5 --early-stop-delta 0.01

Both versions yield poor inference results.
The only improvement I see is when I use the --freeze-backbone option. However, as I mentioned earlier, this approach introduces too many unnecessary punctuation marks, which degrades the output quality.

felixdittrich92 Aug 26, 2024
Maintainer

Hi @ep0p 👋,

What i see directly is --lr 0.01 what was the reason to use such a high value ? 😅

Mh.. we pretrained the models with the default args (only epochs increased to 20 - so maybe with less data set to 40 or 50) from scratch on ~10M samples (3M are definitely enough especially for fine tuning).

I stuck a bit because i have fine tuned parseq also (https://huggingface.co/Felix92/doctr-torch-parseq-multilingual-v1) only on ~1M synth samples without trouble. (with --pretrained and 50 epochs other args unchanged)

Answer selected by ep0p

ep0p Aug 26, 2024
Author

You’re right, I chose the learning rate based on the find-lr graph.

I could have gone with something between 0.001 and 0.01, but I went straight to the upper limit 🙈—like an idiot!
Thanks for the insight on the pretraining setup.

I’ve just started another fine-tuning process with the default LR. This time, I’m hoping for better results. 🤞

felixdittrich92 Aug 26, 2024
Maintainer

Let's hope 😅

ep0p Aug 28, 2024
Author

Hi @felixdittrich92 ,

I wanted to share the results from my latest parseq fine-tuning experiment:

Dataset: 3 million words
Vocab : 'frenchplus' (VOCABS["frenchplus"] = VOCABS["french"] + "»«m²§•·–’œ≥◊×≤½æ‰‘Œ♦µ/βγ¹”º…öð—¤’¼¦äíáå¨")
Other parameters: Default settings
Epochs: 20

The model achieved high scores comparable to the previous runs.

But unfortunately, the inference results are still poor.

I tried the model you provided, and it works perfectly.
At this point, I'm wondering if the issue might be related to the vocabulary.

felixdittrich92 Aug 28, 2024
Maintainer

@ep0p I have a guess, could you please share the code you use for the inference?

ep0p · 2024-09-03T09:17:10Z

ep0p
Sep 3, 2024
Author

Hi @felixdittrich92 ,

After fixing my dataset and fine-tuning parseq again, I'm still not getting good results (words are messed up, and there are a lot of punctuation marks):

python references/recognition/train_pytorch.py parseq --train_path /home/epop/DATASET/ds_3mil_rec/train --val_path /home/epop/DATASET/ds_3mil_rec/val --name parseq_newDS_P --wb --epochs 20 --batch_size 128 --pretrained --vocab="frenchplus" --device 1

I have also created a character map for my dataset:
char_count_map.json

The only custom parameter I've used is the vocab.
I believe 20 epochs should be sufficient for a dataset of 3 million words, I don't see any other mistake that i may be doing

I was wondering if there is a cache or something of the models and maybe i am tuning a bad model (that was badly tuned before) ... or some other crazy explanation of why this is not working for me.

My next, and last step in all this problem would be creating a synthetic ds to test.
If possible, could you share the namespace log of the model that you provided (https://huggingface.co/Felix92/doctr-torch-parseq-multilingual-v1) ?

Edit: Or can it be because of the size of the image in the dataset? here is an example:

At this stage i am doubting everything ...

1 reply

felixdittrich92 Sep 6, 2024
Maintainer

Hey @ep0p 👋,

The models are cached in ~/.cache/doctr/models but we trained only 1 version of parseq :)

Args i used:

"args": [
        "parseq",
        "--train_path=/home/felix/Desktop/synthtiger/800k_multilingual_v2_train",
        "--val_path=/home/felix/Desktop/synthtiger/200k_multilingual_v2_val",
        "-b",
        "128",
        "--pretrained",
        "--vocab=multilingual",
        "--epochs=50",
    ],

About the word crop size it doesn't matter because it's resized to 32x128 by keeping aspect ratio.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Poor Inference Despite High Evaluation Scores: Validation Loss 0.0031 (Exact: 99.70% | Partial: 99.74%) #1677

{{title}}

Replies: 2 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unexpected Poor Inference Despite High Evaluation Scores: Validation Loss 0.0031 (Exact: 99.70% | Partial: 99.74%) #1677

ep0p Jul 24, 2024

Replies: 2 comments · 22 replies

felixdittrich92 Aug 20, 2024 Maintainer

ep0p Aug 23, 2024 Author

felixdittrich92 Aug 23, 2024 Maintainer

ep0p Aug 26, 2024 Author

felixdittrich92 Aug 26, 2024 Maintainer

ep0p Aug 26, 2024 Author

felixdittrich92 Aug 26, 2024 Maintainer

ep0p Aug 28, 2024 Author

felixdittrich92 Aug 28, 2024 Maintainer

ep0p Sep 3, 2024 Author

felixdittrich92 Sep 6, 2024 Maintainer

ep0p
Jul 24, 2024

Replies: 2 comments 22 replies

felixdittrich92
Aug 20, 2024
Maintainer

ep0p Aug 23, 2024
Author

felixdittrich92 Aug 23, 2024
Maintainer

ep0p Aug 26, 2024
Author

felixdittrich92 Aug 26, 2024
Maintainer

ep0p Aug 26, 2024
Author

felixdittrich92 Aug 26, 2024
Maintainer

ep0p Aug 28, 2024
Author

felixdittrich92 Aug 28, 2024
Maintainer

ep0p
Sep 3, 2024
Author

felixdittrich92 Sep 6, 2024
Maintainer