Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory exhaustion when scaling up training data and MCMC sample sizes #39

Open
grahamgower opened this issue Oct 1, 2021 · 0 comments

Comments

@grahamgower
Copy link
Member

The following command runs for a couple of iterations then fails with an out of memory error (with 768gb available).

(genomcmcgan) [srx907@gpu01-snm-willerslev genomcmcgan]$ /usr/bin/time python genomcmcgan.py -k randomwalk -p 72 --seed 593141 -e 1 -n 1000 -b 1000 -t 8 -N 50000 -r 100 -i 5 genob-n50000-f64.pkl
genob.num_reps = 50000
generating 50000 genotype matrices with fixed params using msprime
generating 50000 genotype matrices with randomised params using msprime
X data shape is: (100000, 1, 198, 64)
Using 2 GPUs
Initializing weights of the model
Demographic model for inference - bottleneck
r inferable: False
mu inferable: False
seqerr inferable: False
N0 inferable: True
T1 inferable: False
N1 inferable: True
T2 inferable: False
N2 inferable: True
Starting the MCMC sampling chain for iteration 1
Training discriminator
[1 | 2800] TRAINING: loss: 0.128 | acc: 0.957
        VALIDATION: loss: 0.100 - acc: 0.968
genob.num_reps = 100
Selected mcmc kernel is randomwalk
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉| 9991/10000 [47:16<00:02,  3.52it/s]
sampling finished
Acceptance probability is: 0.324
/home/srx907/miniconda3/envs/genomcmcgan/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `d
isplot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/home/srx907/miniconda3/envs/genomcmcgan/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `d
isplot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/home/srx907/miniconda3/envs/genomcmcgan/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `d
isplot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
N0 samples with median 8787.98828125 and std 9131.1787109375
N1 samples with median 7062.2998046875 and std 9357.9697265625
N2 samples with median 10235.05859375 and std 2396.896728515625
genob.num_reps = 50000
generating 50000 genotype matrices with fixed params using msprime
generating 50000 genotype matrices with randomised params using msprime
X data shape is: (100000, 1, 198, 64)
A single iteration of the MCMC-GAN took 3041.8686463832855 seconds
In total, it has been running for 3041.868728876114 seconds
Starting the MCMC sampling chain for iteration 2
Training discriminator
[1 | 2800] TRAINING: loss: 0.333 | acc: 0.859
        VALIDATION: loss: 0.309 - acc: 0.869
genob.num_reps = 100
Selected mcmc kernel is randomwalk
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████▉| 9991/10000 [43:51<00:02,  3.80it/s]
sampling finished
Acceptance probability is: 0.28
/home/srx907/miniconda3/envs/genomcmcgan/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `d
isplot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/home/srx907/miniconda3/envs/genomcmcgan/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `d
isplot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/home/srx907/miniconda3/envs/genomcmcgan/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `d
isplot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
N0 samples with median 11369.4423828125 and std 9544.0146484375
N1 samples with median 2730.65185546875 and std 8635.1884765625
N2 samples with median 9256.42578125 and std 2385.8544921875
genob.num_reps = 50000
generating 50000 genotype matrices with fixed params using msprime
generating 50000 genotype matrices with randomised params using msprime
Traceback (most recent call last):
  File "/home/srx907/genomcmcgan/genomcmcgan.py", line 270, in <module>
    run_genomcmcgan(
  File "/home/srx907/genomcmcgan/genomcmcgan.py", line 138, in run_genomcmcgan
    xtrain, xval, ytrain, yval = mcmcgan.genob.generate_data(
  File "/home/srx907/genomcmcgan/genobuilder.py", line 427, in generate_data
    X = np.concatenate((gen1, gen0))
  File "<__array_function__ internals>", line 5, in concatenate
numpy.core._exceptions.MemoryError: Unable to allocate 9.44 GiB for an array with shape (100000, 1, 198, 64) and data type float64
256994.71user 4522.26system 1:43:28elapsed 4212%CPU (0avgtext+0avgdata 42752884maxresident)k
960096inputs+4896outputs (328major+1085479926minor)pagefaults 0swaps

It looks like the main process peaks at 40Gb. Watching top while this runs shows each worker process (running msprime sims) climbs steadily in memory usage, which is never released.

(1) For the main process, there is a big jump in memory use when simulating training data for the disciminator. This appears to be caused by queueing a large number of jobs (some discussion here: https://bugs.python.org/issue34168). At least part of the problem is that the entire set of MCMC samples from the last iteration are passed as a parameter to each job that is queued.

(2) Not sure what is going on with the worker processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant