I trained a model for 50 epochs but the AP was still 0.002 #323

devedse · 2020-11-12T16:45:53Z

Introduction

For a personal project I'd like to detect cars, busses, trucks and their license plates. To accomplish this I wanted to train a custom model using YoloV4 with this repository.

I followed the guidance described on this page: https://github.com/Tianxiaomo/pytorch-YOLOv4/blob/master/Use_yolov4_to_train_your_own_data.md

Based on this I've created a train.txt and val.txt file with the following content:

train.txt (255 lines)

20190413_184109.jpg 2391,883,2704,972,0 1824,470,2983,1309,1 2680,460,2955,629,1 1674,451,2019,771,1 2148,445,2299,510,1
20190413_184118.jpg 1906,1742,2190,1817,0 2789,1231,3468,1512,1 1467,1207,2451,1960,3
20190413_184131.jpg 1672,1834,2409,2102,0 804,-28,3589,2588,3 2913,2234,4035,3023,1
...

val.txt (64 lines)

20190413_184125.jpg 2346,1507,2708,1608,0 1402,1020,3017,2084,1 2880,1089,4018,1966,1 1328,1208,1409,1304,1
20190413_184207.jpg 2524,1097,3268,1495,0 324,7,3656,2275,1 2886,3,3758,277,1 4,-2,898,577,1
20190413_184320.jpg 721,2371,1758,2707,0 81,799,2608,3017,1 2561,905,3233,1138,1 3887,837,4043,1042,1 -8,1126,476,2282,1
...

When we look at the first image, it's tagged as follows:

There's 4 cars tagged and 1 license plate.

Preparation

To start the actual training process I first changed some values in cfg.py:

# I changed this to false because I saw that only then the `cfg.pretrained` was being used.
Cfg.use_darknet_cfg = False

# I changed this to 32 because I was running into GPU Memory errors
Cfg.subdivisions = 32

I copied the val.txt file to pytorch-YOLOv4\data\val.txt.

Training

After this I ran the following command:

python train.py --gpu 0 -pretrained C:\git\pytorch-YOLOv4\yolov4.conv.137.pth -classes 4 -train_label_path C:\ml\data\train.txt -dir C:\ml\images

After 50 epochs though, the results and AP where still really low (also there seems to be a logging error, I haven't looked into that):

Epoch 51/300:  38%|▍| 98/255 [00:41<00:39,  4.02im--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\******\.conda\envs\******\lib\logging\__init__.py", line 1028, in emit
    stream.write(msg + self.terminator)
  File "C:\Users\******\.conda\envs\******\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uff0c' in position 169: character maps to <undefined>
Call stack:
  File "train.py", line 626, in <module>
    device=device, )
  File "train.py", line 411, in train
    scheduler.get_lr()[0] * config.batch))
Message: 'Train step_6400: loss : 34547.26171875,loss xy : 42.09270095825195,loss wh : 7.28180456161499,loss obj : 34455.48046875，loss cls : 42.40758514404297,loss l2 : 12746.814453125,lr : 1.6000000000000004e-06'
Arguments: ()
Epoch 51/300: 100%|▉| 254/255 [01:24<00:00,  4.21iin function convert_to_coco_api...
creating index...
index created!
Accumulating evaluation results...
DONE (t=0.12s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.002
2020-11-12 16:49:27,147 train.py[line:446] INFO: Created checkpoint directory
2020-11-12 16:49:27,590 train.py[line:451] INFO: Checkpoint 51 saved !
Epoch 51/300: 100%|▉| 254/255 [02:09<00:00,  1.96i
Epoch 52/300: 100%|▉| 254/255 [01:22<00:00,  4.35iin function convert_to_coco_api...

When I tried to use the model on an image the results also came out strange:

Also if we look at the TensorBoard we see the same issues:

I'm not sure what I'm missing, but for some reason it seems the model is not training

More investigation

I also tried with Cfg.use_darknet_cfg = True and Cfg.classes = 4, however when I do this I keep getting the following error:

2020-11-12 17:47:24,605 train.py[line:611] INFO: Using device cuda
convalution havn't activate linear
convalution havn't activate linear
convalution havn't activate linear
2020-11-12 17:47:26,003 train.py[line:327] INFO: Starting training:
        Epochs:          300
        Batch size:      64
        Subdivisions:    32
        Learning rate:   0.001
        Training size:   255
        Validation size: 64
        Checkpoints:     True
        Device:          cuda
        Images size:     608
        Optimizer:       adam
        Dataset classes: 4
        Train label path:C:\*****\train.txt
        Pretrained:

Epoch 1/300:   0%|       | 0/255 [00:16<?, ?img/s]
Traceback (most recent call last):
  File "train.py", line 626, in <module>
    device=device, )
  File "train.py", line 380, in train
    loss, loss_xy, loss_wh, loss_obj, loss_cls, loss_l2 = criterion(bboxes_pred, bboxes)
  File "C:\Users\******\.conda\envs\******\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "train.py", line 242, in forward
    output = output.view(batchsize, self.n_anchors, n_ch, fsize, fsize)
RuntimeError: shape '[2, 3, 9, 76, 76]' is invalid for input of size 2945760

The text was updated successfully, but these errors were encountered:

huqiyii · 2020-11-15T08:56:48Z

i have the same problem

devedse · 2020-11-30T12:10:34Z

Is someone tracking these issues?

@ersheng-ai ,@Tianxiaomo , @KelvinCPChiu , could you maybe chip in?

lifan-ake · 2020-12-08T06:27:23Z

Is someone tracking these issues?

@ersheng-ai ,@Tianxiaomo , @KelvinCPChiu , could you maybe chip in?

If you are using Cfg.use_darknet_cfg = True, you should try this solution #138 (comment).

ErlingLie · 2020-12-16T15:26:06Z

Did you find any solution to this? I also try to train with a custom dataset. I tried both with CFG.use_darknet_cfg = True and False. No matter what I do I get AP = 0.000 in evaluation after one EPOC of 5000 images. And my loss is in the thousands.

devedse · 2020-12-16T15:30:00Z

Nope, still waiting...

781458112 · 2020-12-22T02:04:41Z

i have the same problem,and my initial loss is 200,000，In addition, although Loss has been decreasing, it was still at 12,000 in 140 epoch,The convergence rate is quite slow，And I found that the losses were just focused on the regression losses，So I think there may be some problems with the calculation of loss

missFuture · 2021-04-11T13:27:05Z

hello, did you solve this problem ? i have trained for hundreds epoch, AR and AR are almost zero when evaluate the model. but when i test single image using models.py, refer to #413 (comment). could you give me some advice? please thank you

devedse · 2021-04-11T17:13:43Z

Nope I've given up

Haofulong123 · 2021-07-09T01:33:15Z

Nope，I've given up.

zyg519 · 2021-07-15T15:26:11Z

maybe this: In the function : Yolo_loss.build_target(), the label mismach the pred, one is offset value, one is not

engyasin · 2021-10-29T08:14:27Z

I had the same problem.
What work for me, is using Darknet again with:
cfg.use_darknet = True
and loading your last weights using pytorch default method: load_state_dict not the method provided with the class (i think it should be deleted)
so you have to edit the train.py script.

drumzhang · 2021-12-20T08:09:00Z

I have the same problem, have anyone solved?

ljl1302924199 · 2022-01-11T07:02:17Z

I have the same problem, have anyone solved?

ElHouas · 2022-02-24T11:17:31Z

Hi,
I have also the same problem with my custom dataset, after training for 300 epochs the AP is zero and when running inference with the obtained model I cannot detect anything even in the train set.
Any idea?

ErlingLie · 2022-02-24T11:34:09Z

I just abandoned this repo all together and used the original darknet yolov4 instead. Got it to work much quicker and better.

drumzhang · 2022-02-24T11:47:10Z

I changed the learning rate and batch size, it works~

…

---Original--- From: ***@***.***> Date: Thu, Feb 24, 2022 19:17 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [Tianxiaomo/pytorch-YOLOv4] I trained a model for 50 epochs but the AP was still 0.002 (#323) Hi, I have also the same problem with my custom dataset, after training for 300 epochs the AP is zero and when running inference with the obtained model I cannot detect anything even in the train set. Any idea? — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: ***@***.***>

YoungjaeDev · 2022-03-03T22:57:38Z

@Pigdrum
For example, what kind of change did you make?

YoungjaeDev · 2022-03-03T22:59:20Z

@devedse
When using Cfg.use_darknet_cfg = True, the number of filters must be modified to (5+class)*3 in case of Yolov3

YoungjaeDev · 2022-03-04T01:09:18Z

When debugging, iou=0 keeps coming out of bbox loss. I think there's an issue in anchor code or between custom data and anchor box size don't match.

changzhilei · 2022-03-07T15:31:13Z

I have the same problem, have anyone solved?

YoungjaeDev · 2022-03-07T22:58:13Z

Candidate groups are well drawn between gt and anchor, but nothing matches (iou>0) between pred and anchor. There seems to be a problem with the model output.

lee666-lee · 2023-02-07T03:26:44Z

Hey guys, I think the reason for AP or AR is quite low might be caused by the inappropriate Cfg.burn_in in cfg.py.

When I use the default Cfg.burn_in=1000 as Tianxiaomo did and train on coins dataset, the AP and AR is extra small. Later I notice the lr curve and find the lr is some kind like e-7 magnitude growing but very slow, which I guess may further leads to the same extremely slow updating of model weights, resulting in small AP and AR but high loss even after hundreds of epochs.
Then I boldly change Cfg.burn_in=1, set -l 0.001 and epoch=100, it turns out my guess is right (at least it works for me, and i hope it also works for you :D!!!)

NOTES: At first I just run for around 20 epochs, the metrics are still extra small. So I decide try running more epochs to see what will happen (at least it won't hurt, why don't give it a shot hahhhhh), and the result begins to seem reasonable.

GeorgeTsio · 2024-06-17T07:42:06Z

Hey guys, I think the reason for AP or AR is quite low might be caused by the inappropriate Cfg.burn_in in cfg.py.

When I use the default Cfg.burn_in=1000 as Tianxiaomo did and train on coins dataset, the AP and AR is extra small. Later I notice the lr curve and find the lr is some kind like e-7 magnitude growing but very slow, which I guess may further leads to the same extremely slow updating of model weights, resulting in small AP and AR but high loss even after hundreds of epochs.

Then I boldly change Cfg.burn_in=1, set -l 0.001 and epoch=100, it turns out my guess is right (at least it works for me, and i hope it also works for you :D!!!)

NOTES: At first I just run for around 20 epochs, the metrics are still extra small. So I decide try running more epochs to see what will happen (at least it won't hurt, why don't give it a shot hahhhhh), and the result begins to seem reasonable.

Can you show us what format your train.txt has

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I trained a model for 50 epochs but the AP was still 0.002 #323

I trained a model for 50 epochs but the AP was still 0.002 #323

devedse commented Nov 12, 2020 •

edited

Loading

huqiyii commented Nov 15, 2020

devedse commented Nov 30, 2020 •

edited

Loading

lifan-ake commented Dec 8, 2020

ErlingLie commented Dec 16, 2020

devedse commented Dec 16, 2020

781458112 commented Dec 22, 2020

missFuture commented Apr 11, 2021

devedse commented Apr 11, 2021

Haofulong123 commented Jul 9, 2021

zyg519 commented Jul 15, 2021 •

edited

Loading

engyasin commented Oct 29, 2021

drumzhang commented Dec 20, 2021

ljl1302924199 commented Jan 11, 2022

ElHouas commented Feb 24, 2022

ErlingLie commented Feb 24, 2022

drumzhang commented Feb 24, 2022 via email

YoungjaeDev commented Mar 3, 2022

YoungjaeDev commented Mar 3, 2022 •

edited

Loading

YoungjaeDev commented Mar 4, 2022

changzhilei commented Mar 7, 2022

YoungjaeDev commented Mar 7, 2022

lee666-lee commented Feb 7, 2023

GeorgeTsio commented Jun 17, 2024

I trained a model for 50 epochs but the AP was still 0.002 #323

I trained a model for 50 epochs but the AP was still 0.002 #323

Comments

devedse commented Nov 12, 2020 • edited Loading

Introduction

train.txt (255 lines)

val.txt (64 lines)

Preparation

Training

More investigation

huqiyii commented Nov 15, 2020

devedse commented Nov 30, 2020 • edited Loading

lifan-ake commented Dec 8, 2020

ErlingLie commented Dec 16, 2020

devedse commented Dec 16, 2020

781458112 commented Dec 22, 2020

missFuture commented Apr 11, 2021

devedse commented Apr 11, 2021

Haofulong123 commented Jul 9, 2021

zyg519 commented Jul 15, 2021 • edited Loading

engyasin commented Oct 29, 2021

drumzhang commented Dec 20, 2021

ljl1302924199 commented Jan 11, 2022

ElHouas commented Feb 24, 2022

ErlingLie commented Feb 24, 2022

drumzhang commented Feb 24, 2022 via email

YoungjaeDev commented Mar 3, 2022

YoungjaeDev commented Mar 3, 2022 • edited Loading

YoungjaeDev commented Mar 4, 2022

changzhilei commented Mar 7, 2022

YoungjaeDev commented Mar 7, 2022

lee666-lee commented Feb 7, 2023

GeorgeTsio commented Jun 17, 2024

devedse commented Nov 12, 2020 •

edited

Loading

devedse commented Nov 30, 2020 •

edited

Loading

zyg519 commented Jul 15, 2021 •

edited

Loading

YoungjaeDev commented Mar 3, 2022 •

edited

Loading