Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the performances. #1

Open
D-X-Y opened this issue Apr 27, 2017 · 23 comments
Open

Questions about the performances. #1

D-X-Y opened this issue Apr 27, 2017 · 23 comments

Comments

@D-X-Y
Copy link

D-X-Y commented Apr 27, 2017

Hi,

May I ask your final performance, the curves are a little confusing.
I also implement a different version (https://github.com/D-X-Y/ResNeXt), my results are a little bit lower than the official code, about 0.2 for cifar10 and 1.0 for cifar100.
I really want to what causes the differences.

And I also try training resnet20,32,44,56 , I'm pretty sure the model archieteture is the same as the official code but even obtain a much lower accuracy.

Would you mind to give me some suggestions?

@wangdelp
Copy link

I am also curious about the training performance. BTW, I need to run the training many times with different hyper-parameters, and running 300 epochs takes days even with four titan X. Did you guys tried use less epochs and other learning rate schedule? Please let me know if you have any suggestions. Thank you.

@prlz77
Copy link
Owner

prlz77 commented Apr 28, 2017

@D-X-Y On CIFAR-10 it reaches 96.44%, and on CIFAR-100 81.62%. However, I am not keeping the random seed each run, so it sometimes achieves better than the baseline, and sometimes worse.

As for what would be causing a difference of performance, I talked with the author of the original paper, and he told me (he was right) that since I was using batch_size = 128 instead of 256, the lr should be divided by two. I have checked your code and I see not much difference with mine, so could it be just a matter of finding the correct random seed? Is the initialization of the weights exactly the same as in their code?

@prlz77
Copy link
Owner

prlz77 commented Apr 28, 2017

@wangdelp Using a single TITANX it takes me roughly one day on CIFAR. Which is your batch size and learning rate?

@D-X-Y
Copy link
Author

D-X-Y commented Apr 28, 2017

@prlz77 Thanks for your responses. The initialization is the same and I only train on CIFAT-10 once, so maybe the average performance will be better.

There are two versions of the ResNeXt paper, they change the batchsize for CIFAR from 256 to 128 in the Version2.0.
I notice that your performance on CIFAR-100 is lower than the original paper about 1 point, do you think this is caused by learning rate and multi-gpu?

@prlz77
Copy link
Owner

prlz77 commented Apr 28, 2017

@D-X-Y Since the performance in CIFAR-10 is correct, it is difficult to guess what is happening on CIFAR-100. Some possibilities are:

  • Running it many times with different random seeds might show there is no difference.
  • CUDNN configuration, I don't know if it is the same for the torch and the pytorch implementations.
  • As you said, multi-gpu and learning rate could also be an issue.
  • I have checked line by line but it could also be a difference between the original implementation and mine. However, I don't know if that explains the difference between the two CIFARs.

@prlz77
Copy link
Owner

prlz77 commented Apr 28, 2017

btw, take into account that the results I am providing are for the small net! (cardinality 8, widen factor 4) So it gets 0.1 better on CIFAR10 and 0.6 worse on CIFAR100. When I have some time, I will provide multi-run results to see if it is always like this.

@wangdelp
Copy link

wangdelp commented Apr 28, 2017

@prlz77 I was using batch size 64 since I want to reduce the memory consumption, and distributed among 4 gpus. I am using the default learning rate 0.1 and decay at [0.5, 0.75] * args.epochs, run it with 300 epochs. It sounds like I need two days to complete training on cifar100. Maybe it's due to other lab members are also using GPUs.

Using batchsize 256 would lead to Out of Memory on 12GB GPU. Maybe I should try use 128 batchsize on two gpus.

@prlz77
Copy link
Owner

prlz77 commented Apr 29, 2017

@wangdelp in my experience, bs=128 distributed on two 1080TI takes about one day. bs=128 on only one gpu takes a little bit more. bs=64 takes almost double the time for the same 300epochs. I would suggest you to use bs=128 (note that if ngpu=4, you will be loading 128/4 for gpu, which is a small amount of memory). If GPUs are already in use, that could be causing a performance issue, as you say. Although it is improvable, check that data is not the issue, for instance increase the number of prefetching threads.

@wangdelp
Copy link

@prlz77 Thank you. Should I use initial lr 0.05 when batchsize=128, and lr 0.025 when batchsize=64?

@prlz77
Copy link
Owner

prlz77 commented Apr 30, 2017

@wangdelp Exact!

@Queequeg92
Copy link

Queequeg92 commented Oct 9, 2017

Hi, guys. I have a question about the results reported in the paper. Did they report the median of best test error during training or the median of test error after training?@prlz77 @wangdelp

@prlz77
Copy link
Owner

prlz77 commented Oct 9, 2017

@Queequeg92 I think it is the median of the best test error during training.

@Queequeg92
Copy link

@prlz77 I agree with you since models are likely to be overfitting at the end of training process. I have sent emails to some authors to confirm.

@Queequeg92
Copy link

@prlz77 I think Part D of this paper gives the answer.

@wandering007
Copy link

wandering007 commented May 10, 2018

@D-X-Y @prlz77 I'm faced with the same problem when reproducing the performance of DenseNet-40 on CIFAR100. With the exactly same configuration, the acc of PyTorch version is often 1 point lower than Torch version. I don't think it is caused by random seeds. However, after digging into the implementation details of the two frameworks, I find no differences. I am so confused...

@prlz77
Copy link
Owner

prlz77 commented May 10, 2018

In the past I've noticed up to 1% difference just by using cudnn fastest options due to noise introduced by numerical imprecisions.

@wandering007
Copy link

@prlz77 I set cudnn.benchmark = True and cudnn.deterministic = True. Is that ok?

@prlz77
Copy link
Owner

prlz77 commented May 10, 2018

@wandering007 maybe with cudnn.deterministic = False you get better results.

@wandering007
Copy link

wandering007 commented May 11, 2018

@prlz77 No improvements from my experiments. Thank you anyway.

@prlz77
Copy link
Owner

prlz77 commented May 11, 2018

@wandering007 I'm sorry to hear that, I found this behaviour some years ago, maybe the library has changed or noise is not that important in this model.

@boluoweifenda
Copy link

@wandering007 I'm also confused about the differences between two CIFAR datasets.
I have got similar accuracy with Wide-DenseNet on CIFAR10.
But on CIFAR100 with exactly the same model and training details, the accuracies are always lower than reported in the paper, about 1%.
Do you have any suggestion on that?
BTW, I'm using tensorflow.

@wandering007
Copy link

wandering007 commented Jul 8, 2018

@boluoweifenda I haven't train it via tensorflow. There are a lot of ways to improve performance if you don't care about the fair comparison, like using dropout, a better lr schedule, better data augmentation. Personally, 1% performance difference between two frameworks is acceptable. BTW,same settings for different frameworks are not very fair itself :-)

@boluoweifenda
Copy link

@wandering007 Thanks for your reply~ But I just care about the fair comparison. Maybe I need to dig deeply to find the differences between frameworks. However, I got the same accuracy on CIFAR10 using tensorflow. It's quite strange for the accuracy drop on CIFAR100.
(╯°Д°)╯︵┻━┻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants