Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reduce Runtime] Better utilize GPU resources for PGD with random restarts #98

Open
SohamTamba opened this issue May 1, 2021 · 0 comments

Comments

@SohamTamba
Copy link

SohamTamba commented May 1, 2021

Summary

I'm working on a model that suffers from gradient masking, so I have to use multiple (16) random restarts. I developed an implementation that is faster that the current one while using multiple restarts.

My experiments involve training a BatchNorm Free ResNet-26 on a v100 GPU for CIFAR-10 with a batch size of 64 and 16 random restarts, 20 steps.
My implementation costs 170 minutes per epoch, while the current implementation costs 250 minutes per epoch.
This is a 1.47x speed-up.

The training curves of both implementations match, I'm pretty sure my implementation is correct.

Please let me know if you guys are interested in a Pull Request.


Details

The current implementation for PGD with random restarts works like this:

Input: X, Y, model

output = X.clone()
for restart_i in range(num_random_restarts):
    noise = random_constrained_noise(shape=X.shape) if restart_i > 0 else 0
    pert_X = X + noise
    adv, is_misclass = run_pgd(pert_X, Y, model)
   output[is_misclass] = adv[is_misclass] 
return output

If the user has enough GPU memory - which will often be the case for CIFAR10 -, then the following implementation would improve GPU utilization:

B, C, H, W = X.shape
Y_stack = torch.stack([Y for _ in range(num_random_restarts)])
X_stack = X.unsqueeze(0) # Or torch.stack([X for _ in range(num_random_restarts)])

noise = random_constrained_noise(shape=[num_random_restarts, B, C, H, W])
noise[0, :, :, :, :] = 0
pert_X_stack = X_stack + noise

pert_X_stack = pert_X_stack.view(-1, C, H, W)
Y_stack = Y_stack.view(-1)

adv, is_misclass = run_pgd(pert_X_stack, Y_stack, model)
adv = adv.view(-1, B, C, H, W)
is_misclass = is_misclass.view(-1, B)
return_ind = is_misclass.argmax(axis=0) 
return adv[return_ind]

If num_restarts is so huge that the input of size num_restarts x batch_size x C x H x W does not fit in GPU memory, then the user could also be allowed to specify a mini_num_restarts < num_restarts so that the GPU processes mini_num_restarts batches at a time. i.e. input size is reduced to mini_num_restarts x batch_size x C x H x W.

My only concern is this might negatively affect BatchNorm. But given how computationally intensive adversarial training is, this might be a worthwhile option to provide users.

Reference:

for _ in range(random_restarts):

@SohamTamba SohamTamba changed the title Better utilize GPU resources for PGD with random restarts [Reduce Runtime] Better utilize GPU resources for PGD with random restarts May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant