Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume too large #13

Open
yuehuang2023 opened this issue Oct 15, 2024 · 7 comments
Open

Volume too large #13

yuehuang2023 opened this issue Oct 15, 2024 · 7 comments

Comments

@yuehuang2023
Copy link

yuehuang2023 commented Oct 15, 2024

Hi, I tried to reproduce the results of EMPIAR-10073 with GPU A6000 and set the parameters according to the supplementary. However, Relion reports errors. Any suggestions for solving this error? Thank you.

The run log is

Initializing the particle dataset
Assigning a diameter of 512 angstrom
Number of particles: 138899
Initialized data loaders for half sets of size 62505  and  62505
consensus updates are done every  0  epochs.
box size: 380 pixel_size: 1.400011 virtual pixel_size: 0.0026246719160104987  dimension of latent space:  10
Number of used gaussians: 30000
Optimizing scale only
volume too large: change size of output volumes. (If you want the original box size for the output volumes use a bigger gpu. The size of tensor a (380) must match the size of tensor b (190) at non-singleton dimension 2
Optimizing scale only
Initializing gaussian positions from reference
100%|##########| 50/50 [00:07<00:00,  6.29it/s]
Final error: 5.322801257534593e-07
Optimizing scale only
Initializing gaussian positions from reference
100%|##########| 50/50 [00:08<00:00,  6.12it/s]
Final error: 5.322801257534593e-07
consensus gaussian models initialized
consensus model  initialization finished
mean distance in graph for half 1: 2.4982950687408447 Angstrom ;This distance is also used to construct the initial graph 
mean distance in graph for half 2: 2.4982950687408447 Angstrom ;This distance is also used to construct the initial graph 
Computing half-set indices
100%|##########| 218/218 [00:14<00:00, 15.24it/s]
setting epoch type
generating graphs
100%|#########9| 217/218 [00:32<00:00,  6.77it/s]
Index tensor must have the same number of dimensions as self tensor

The run error is

/.conda/envs/relion-5.0/lib/python3.10/site-packages/dynamight/models/decoder.py:235: UserWarning: Using a target size (torch.Size([190, 190, 190])) that is different to the input size (torch.Size([380, 380, 380])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = torch.nn.functional.mse_loss(
@huwjenkins
Copy link

Yes box size of 360 px is hardcoded at multiple places in the code:

if reference_volume.shape[-1] > 360:

if reference_volume.shape[-1] > 360:

if self.box_size > 360:

if self.box_size > 360:

I couldn't find this mentioned in the Nature Methods paper and as @yuehuang2023 points out one of the example datasets used a box size of 380 px. @schwabjohannes, @scheres - why is 360 px hardcoded as a limit? The message:

If you want the original box size for the output volumes use a bigger gpu

seems a bit disingenuous when 360 px appears to be a hard-coded limit?

I also encountered the same message when running on one of my datasets with 384 px box.

@scheres
Copy link

scheres commented Nov 5, 2024 via email

@huwjenkins
Copy link

Yes you are correct the message is triggered by running out of GPU memory. Sorry I should have looked more carefully. I was running on an A40 with 48 GB which I thought was quite a big GPU!

However, the volume will still be downscaled by 2 with a 384 px box. Should I crop the particles to 360 px?

@yuehuang2023
Copy link
Author

I used the GPU A6000 with the same configuration mentioned in the supplementary, but this error was still raised.
image

@huwjenkins
Copy link

I got DynaMight running on a H100 and with my dataset (384 px box) I got the same errors:

box size: 384 pixel_size: 0.825 virtual pixel_size: 0.0025974025974025974  dimension of latent space:  6
Number of used gaussians: 10000
Optimizing scale only
volume too large: change size of output volumes. (If you want the original box size for the output volumes use a bigger gpu. The size of tensor a (384) must match the size of tensor b (192) at non-singleton dimension 2

and

/xxx/miniforge/envs/relion-5.0/lib/python3.10/site-packages/dynamight/models/decoder.py:235: UserWarning: Using a target size (torch.Size([192, 192, 192])) that is different to the input size (torch.Size([384, 384, 384])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = torch.nn.functional.mse_loss(

As I don't have access to a bigger gpu I made the following change:

--- decoder.py.orig	2024-11-06 09:02:03.000000000 +0000
+++ decoder.py	2024-11-06 09:02:26.000000000 +0000
@@ -224,7 +224,7 @@
         print('Optimizing scale only')
         optimizer = torch.optim.Adam(
             [self.image_smoother.A], lr=100*lr)
-        if reference_volume.shape[-1] > 360:
+        if reference_volume.shape[-1] > 384:
             reference_volume = torch.nn.functional.avg_pool3d(
                 reference_volume.unsqueeze(0).unsqueeze(0), 2)
             reference_volume = reference_volume.squeeze()

and the errors went away. I think my earlier apology was premature.

@huwjenkins
Copy link

The job with the modified dynamight/models/decoder.py is still running and is currently using ~21 GB of the 80 GB on the H100 GPU.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               Off | 00000000:21:00.0 Off |                    0 |
| N/A   74C    P0             221W / 310W |  21301MiB / 81559MiB |     79%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

@huwjenkins
Copy link

So I believe the underlying bug is the failure to update self.vol_box around here:

--- dynamight/models/decoder.py.orig	2024-11-06 09:02:03.000000000 +0000
+++ dynamight/models/decoder.py	2024-11-06 16:35:26.000000000 +0000
@@ -228,6 +228,7 @@
             reference_volume = torch.nn.functional.avg_pool3d(
                 reference_volume.unsqueeze(0).unsqueeze(0), 2)
             reference_volume = reference_volume.squeeze()
+            self.vol_box//=2

         for i in range(n_epochs):
             optimizer.zero_grad()

which is then used in generate_consensus_volume() here:

def generate_consensus_volume(self):
scaling_fac = self.box_size/self.vol_box
self.batch_size = 2
p2v = PointsToVolumes(self.vol_box, self.n_classes,
self.grid_oversampling_factor)
amplitudes = torch.stack(
2 * [self.amp*torch.nn.functional.softmax(self.ampvar, dim=0)], dim=0
)

However, I don't think that this is the most optimal way to deal with large boxes. If DynaMight has a cliff edge limit of 360 px then this should be documented and users advised to crop/downscale their particles appropriately. I could easily trim 12px from the edges of my particle boxes and other users with > 360 px boxes might also prefer to downsample to this size over automatic 2x downsampling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants