Training with other configurations. #3

mkocabas · 2018-07-15T09:52:12Z

Thanks for the great implementation. I'm eager collaborate with you to test other configurations. I have 2 x 1080 and 2 x 1080ti. I can borrow more if needed. Looking forward to your response!

The text was updated successfully, but these errors were encountered:

GengDavid · 2018-07-15T11:44:23Z

Hi @mkocabas ,

Thanks for your interest in my implementation.
There may be at least two configurations to be tested, ResNet-50+384x288 and ResNet-101+384x288. Which one do you prefer to test? Or do you want to test both of them?

I've modified the codes a little, so please clone/pull the latest version before you run it. Please follow README to configure the environment.

You can train a ResNet-50+384x288 model directly in 384.288.model dir. by running train.py
You may need to modify batch size in config.py, and use -g to specify the number of GPU you use. For example, you may set batch_size = 12 and run python3 train.py -g 2 when you use 2 x 1080 gpu to train the model.

To train a ResNet-101+384x288 model, you need to set model='CPN101' in config.py, and then follow the same way to train the model.

If you have any questions, feel free to contact me. You can also mail me at [email protected] or [email protected].

mkocabas · 2018-07-15T11:59:33Z

Cool, so I can start with ResNet-50+384x288. After that I can try ResNet-101.

I'll use 2 x 1080ti with the default hyperparameters as in config. Am I correct?

mkocabas · 2018-07-15T12:20:33Z

@GengDavid we have a little problem. 1080tis have 11GB memory. batch_size=6 barely fits the memory. This means that we can train with batch_size=12 using 2 gpus. What do you think?

GengDavid · 2018-07-15T12:47:56Z

If you are using 1080tis, I think you can set batch_size more than 12 with 2 gpus while running ResNet-50+384x288 model.

GengDavid · 2018-07-15T12:51:05Z

@mkocabas ResNet-50+384x288 model with batch_size=12 takes about 8G memory in my experiment.

mkocabas · 2018-07-15T12:53:08Z

I'm consistently getting OOM error, but let me check. I'll restart the computer, maybe there are some blocking processes. I'll inform you about the progress.

mkocabas · 2018-07-15T13:05:07Z

@GengDavid, restarting solved the problem. Thanks for pointing out! I'll update this issue as training continues.

How many epochs did you train the 256x192 model?

GengDavid · 2018-07-15T13:13:31Z

@mkocabas About 25 epoch. I don't remember the exact figure.

mkocabas · 2018-07-15T13:15:46Z

I see, so probably it'll take 4 days to converge.

GengDavid · 2018-07-15T13:17:58Z

Fine, thanks.

mkocabas · 2018-07-16T19:53:58Z

Epoch 6 (tested with GT bboxes)

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.688
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.894
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.750
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.654
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.742
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.719
 Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.904
 Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.776
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.681
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.777

Epoch 13 (tested with GT bboxes)

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.726
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.914
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.785
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.690
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.781
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.754
 Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.924
 Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.810
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.716
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.812

mkocabas · 2018-07-18T06:08:54Z

@GengDavid do you have the weights of 5th epoch of ResNet50-256x192 model?

GengDavid · 2018-07-18T13:40:44Z

Yes, I do have saved the 5th epoch pre-trained model.
But I'm sorry to tell you that there's something different from the original paper in my code just as @Tiamo666 mentioned in issue #4.
The results seem very close, but I'm still going to modify the network and then re-test it.

mkocabas · 2018-07-18T13:44:13Z

Yeah I saw the discussion. Please let me know about the results after modification. If you don't have enough GPUs, I can test the corrected model.

GengDavid · 2018-07-18T15:22:14Z

I'll let you know the results but it may take a little long time since I only have 1*1080 free to run the code. May be you can test test the ResNet-50+384x288 model first.
Thanks!

mkocabas · 2018-07-18T17:29:21Z

I've started to train fixed ResNet-50+384x288 on a Titan V w batch-size=24

GengDavid · 2018-07-26T01:41:53Z

Hi, @mkocabas
I've updated the ResNet-50+256*192 results. Have got some results?
Thx.

Tiamo666 · 2018-08-13T02:39:55Z

Due to the limit of network, I can not download the person detections results on COCO successfully, So I just use the ground truth.

GengDavid · 2018-08-14T07:35:46Z

@Tiamo666 Great job!
Can you provide the pre-trained model so that I can test it with detection results?
I think you can open a PR with the a link on it to download pre-trained model.

GengDavid · 2018-08-14T08:17:30Z

@Tiamo666 Or if you do not want to open a RP, could you just provide a link to download the model? Google Drive, Onedrive, Dropbox and Baidu Yun are all fine.

Tiamo666 · 2018-08-14T10:33:05Z

OK，I guess Baidu yun is a good choice. I will try to share the pretrained model on it and provide you the link as soon as I uploaded model

Tiamo666 · 2018-08-15T02:14:34Z

hi，David, I've already uploaded the model on BaiduYun.
Here is the link:
https://pan.baidu.com/s/1fdy5_0HQm63QtlOzxKbpuw

GengDavid · 2018-08-15T05:07:18Z

Great! I'll test it and update the result later.

GengDavid · 2018-08-15T06:33:30Z

@Tiamo666 I've updated the results.

Tiamo666 · 2018-08-27T06:33:20Z

That's cool!
I'll have time to train with Resnet101+384*288, I'll share the model after finishing training

GengDavid · 2018-08-27T07:56:14Z

@Tiamo666 That's great! If you have any problem, feel free to contact me.

Tiamo666 · 2018-09-06T02:57:52Z

Hi, David. I've uploaded the model of cpn384*288 with Resnet101 on Baidu Yun.
Here is the link:
https://pan.baidu.com/s/1toikUHSqHhHP3DkIOkNctA

GengDavid · 2018-09-06T08:45:33Z

@Tiamo666 Great! Thanks a lot. I'll update the results soon.

Tiamo666 · 2018-09-10T06:22:39Z

Hello, David, I've just found that I trained with the old code which has "Color Normalized bug" last week. I feel sorry for that, I could retrain the model this week.

GengDavid · 2018-09-18T07:55:32Z

Cool. @Tiamo666 Could you please tell me the results you got before and after the fine-tune process(using gt bbox)?

Tiamo666 · 2018-09-19T02:21:32Z

GengDavid · 2018-09-22T10:29:56Z

@Tiamo666 Thanks! I'm a little busy these days, I'll update the results and model soon.

mingloo · 2018-09-27T01:45:37Z

Hi @GengDavid @Tiamo666

I've used the commit 8e85af2 to train ResNet50+ 256x192 model with GT bbox input and default parameter setting from scratch when epoch is set to 32 and the overall result 70.8 as below shown is slightly worse than the reported one 71.2:

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.708
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.905
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.782
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.683
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.749
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.740
Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.918
Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.804
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.710
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.786

How many epochs do you set to achieve 71.2 for ResNet50+ 256x192 ?

As for ResNet50+ 384x288 model with GT bbox input and default parameter setting training from scratch, the epoch=32 result is slightly better than the reported 73.7 as follows:

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.741
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.925
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.805
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.706
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.795
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.768
Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.932
Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.825
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.730
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.826

GengDavid · 2018-09-27T08:11:52Z

@mkocabas Sorry about that I have not updated the results yet. 71.2 is the old result.
It is strange that the results that after fixing bugs are lower than the results before. I'll update all the results this weekends, but I still do not figure out the reason. Maybe we need to adjust the parameter setting since this parameter setting is setting for the old codes.

mingloo · 2018-09-27T08:43:47Z

@GengDavid Now, the ResNet 50 + 256x192 with detection GT bboxes is slightly worse than the old result, but the ResNet 50 + 384x288 is slightly better than the old result.

GengDavid · 2018-09-27T09:02:35Z

Cool, so I think it is allowable to have some slight differences. And could you provide your pre-trained ResNet 50 + 384x288 with us? It would be great.

mingloo · 2018-09-27T09:17:31Z

@GengDavid Please see my comments #3 (comment)

GengDavid · 2018-09-27T10:21:48Z

Sorry, I don't clearly understand what you mean by referencing comment-424928303😳

mingloo · 2018-09-27T13:31:39Z

@GengDavid
Sorry, I misunderstand your comment.
The trained model for ResNet50+ 384x288 can be found at GoogleDrive.

GengDavid · 2018-09-30T16:23:08Z

Hi @Tiamo666 @mingloo
I've updated all the pre-trained models and results.
Sorry for taking a long time to update. Thanks for your great work!

GengDavid · 2018-09-30T16:23:35Z

However, it is a little confusing that the CPN-101-384x288 model perform even worse than CPN-50-384x288.
@Tiamo666 Could you show me the parameter setting you used to fine-tune the model? Thanks!
Have a good National Day.

mingloo · 2018-10-01T06:19:23Z

@GengDavid @Tiamo666 Thanks for updating the result.
I'll try to train CPN-ResNet101-384x288 from scratch on my side.

GengDavid · 2018-10-02T08:41:49Z

@mingloo Great! Thanks.

Tiamo666 · 2018-10-09T02:34:23Z

@GengDavid , Thanks a lot, I just come back from my holiday. I didn't change any other parameters, I just modified the learning rate scheduler with pytorch built-in package optim.lr_scheduler, here is my code:

fine tune

        for k, v in pretrained_dict.items():
            if k in ['module.global_net.upsamples.0.1.bias',
            'module.global_net.upsamples.1.1.bias',
            'module.global_net.upsamples.2.1.bias']:
                continue
            new_dict[k] = v
        model.load_state_dict(new_dict)

adjust lr rate

scheduler = lr_scheduler.MultiStepLR(optimizer, milestones = cfg.lr_dec_epoch, gamma=cfg.lr_gamma)
for epoch in range(args.start_epoch, args.epochs):
#lr = adjust_learning_rate(optimizer, epoch, cfg.lr_dec_epoch, cfg.lr_gamma)
scheduler.step(epoch)
lr = optimizer.state_dict()['param_groups'][0]['lr']
print('\nEpoch: %d | LR: %.8f' % (epoch + 1, lr))
The following is part of my log.txt, I fine tuned from epoch32, and the total epoch is 35:

30.000000 0.000031 102.073177
31.000000 0.000016 101.399609
32.000000 0.000016 101.165480
33.000000 0.000016 101.801196
34.000000 0.000016 101.328027
35.000000 0.000016 101.059933

Tiamo666 · 2018-10-09T02:48:02Z

GengDavid · 2018-10-10T01:35:57Z

@Tiamo666 Thanks! So the number of the epoch is the point.

mingloo · 2018-10-12T06:56:03Z

@GengDavid @Tiamo666

I've trained the CPN101-384x288 model from scratch. The model can be downloaded from GoogleDrive.

The evaluation result is as follows:

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.740
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.924
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.815
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.710
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.787
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.770
Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.934
Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.832
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.736
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.822

Tiamo666 · 2018-10-12T07:54:15Z

@mingloo great job!
Could you please tell me that how many epoch did you take?

mingloo · 2018-10-12T08:55:48Z

@Tiamo666
I trained the CPN101-384x288 model from scratch on single 1080ti GPU with epoch=32.
One key difference is the batch_size is set to 18.

And it takes almost 9 days for training from scratch.

One more thing to be noted is I use the GT bbox for training the above model.

Tiamo666 · 2018-10-12T09:46:47Z

@mingloo Thanks a lot, I got it.

mingloo · 2018-10-12T13:29:26Z

@Tiamo666
Sorry. I've double checked the CPN101-384x288 model that trained from scratch is using default parameter setting. So please ignore the previous #3 (comment).

GengDavid · 2018-10-26T04:24:35Z

@mingloo Thanks a lot.
Wonder that have you tested trained model on different epochs or just the last epoch(32)?

mingloo · 2018-10-26T07:38:22Z

@GengDavid
What I've tested is all for epoch=32.

Liz66666 · 2018-11-10T04:20:54Z

@GengDavid Hi, I have meet some problems about training....... Can you share your log file about ResNet 50+256x192? Thanks

GengDavid · 2018-11-10T13:50:12Z

@YoungZiyu
Sure, you can find training log here

leonshek · 2019-01-09T08:04:48Z

@Tiamo666 @GengDavid
How to use the models to test one single image?
Is there any inference script?

my-hello-world · 2019-06-21T11:32:57Z

@GengDavid @aidarikako @mingloo
hello,why i got so large loss like:

    Total params: 104.55MB

Epoch: 1 | LR: 0.00050000
iteration 100 | loss: 362.8368835449219, global loss: 246.98593711853027, refine loss: 115.85093688964844, avg loss: 403.03418150042546

i has changed lr=1e-6,but not helps.
any advice?tks

my-hello-world · 2019-06-21T11:33:42Z

@GengDavid @mkocabas @Tiamo666 @mingloo @YoungZiyu
hello,why i got so large loss like:

    Total params: 104.55MB

Epoch: 1 | LR: 0.00050000
iteration 100 | loss: 362.8368835449219, global loss: 246.98593711853027, refine loss: 115.85093688964844, avg loss: 403.03418150042546

i has changed lr=1e-6,but not helps.
any advice?tks

GengDavid added the enhancement New feature or request label Sep 6, 2018

my-hello-world mentioned this issue Jun 21, 2019

@YoungZiyu #27

Open

Training with other configurations. #3

Training with other configurations. #3

Comments

mkocabas commented Jul 15, 2018

GengDavid commented Jul 15, 2018

mkocabas commented Jul 15, 2018

mkocabas commented Jul 15, 2018

GengDavid commented Jul 15, 2018

GengDavid commented Jul 15, 2018 • edited Loading

mkocabas commented Jul 15, 2018

mkocabas commented Jul 15, 2018

GengDavid commented Jul 15, 2018

mkocabas commented Jul 15, 2018

GengDavid commented Jul 15, 2018

mkocabas commented Jul 16, 2018 • edited Loading

mkocabas commented Jul 18, 2018

GengDavid commented Jul 18, 2018

mkocabas commented Jul 18, 2018

GengDavid commented Jul 18, 2018

mkocabas commented Jul 18, 2018

GengDavid commented Jul 26, 2018

Tiamo666 commented Aug 13, 2018

GengDavid commented Aug 14, 2018

GengDavid commented Aug 14, 2018

Tiamo666 commented Aug 14, 2018

Tiamo666 commented Aug 15, 2018

GengDavid commented Aug 15, 2018

GengDavid commented Aug 15, 2018

Tiamo666 commented Aug 27, 2018

GengDavid commented Aug 27, 2018

Tiamo666 commented Sep 6, 2018

GengDavid commented Sep 6, 2018

Tiamo666 commented Sep 10, 2018

GengDavid commented Sep 18, 2018

Tiamo666 commented Sep 19, 2018

GengDavid commented Sep 22, 2018

mingloo commented Sep 27, 2018 • edited Loading

GengDavid commented Sep 27, 2018

mingloo commented Sep 27, 2018

GengDavid commented Sep 27, 2018

mingloo commented Sep 27, 2018

GengDavid commented Sep 27, 2018

mingloo commented Sep 27, 2018 • edited Loading

GengDavid commented Sep 30, 2018

GengDavid commented Sep 30, 2018 • edited Loading

mingloo commented Oct 1, 2018

GengDavid commented Oct 2, 2018

Tiamo666 commented Oct 9, 2018

fine tune

adjust lr rate

Tiamo666 commented Oct 9, 2018

GengDavid commented Oct 10, 2018

mingloo commented Oct 12, 2018

Tiamo666 commented Oct 12, 2018

mingloo commented Oct 12, 2018 • edited Loading

Tiamo666 commented Oct 12, 2018

mingloo commented Oct 12, 2018

GengDavid commented Oct 26, 2018

mingloo commented Oct 26, 2018

Liz66666 commented Nov 10, 2018

GengDavid commented Nov 10, 2018

leonshek commented Jan 9, 2019

my-hello-world commented Jun 21, 2019

my-hello-world commented Jun 21, 2019

GengDavid commented Jul 15, 2018 •

edited

Loading

mkocabas commented Jul 16, 2018 •

edited

Loading

mingloo commented Sep 27, 2018 •

edited

Loading

mingloo commented Sep 27, 2018 •

edited

Loading

GengDavid commented Sep 30, 2018 •

edited

Loading

mingloo commented Oct 12, 2018 •

edited

Loading