Potential Memory Leak #27

ddnimara · 2021-03-18T13:46:08Z

First off, great work and thank you for making this publicly available.

I have been experimenting with your code, as I want to adapt it on my own custom environment. For this reason, I adopted a slightly larger NN structure for world modelling and noticed that part of the code seems to be memory leaking. For instance, the save_state method constantly adds nodes in the tensorflow graph which never get removed by the garbage collector. The same thing (to a lesser degree) occurs here. To test this claim, I simply counted the number of nodes and the memory usage before and after calling self._set_state().

I used len(self.sess.graph._nodes_by_name.keys()) to count the number of nodes within the tf graph and resource.getrusage(resource.RSUAGE_SELF).ru_maxrss to compute the RAM usage (in kb). A typical instance of what I obtain is seen below, where each call to this method, increases RAM consumption by around 100 MBs. When training for many epochs, this leads to OOM errors, as one would expect.

In the end, my question is twofold:

Did you experience this sort of memory leak during your experiments? Note that, even though I am testing in a custom environment, this should not have anything to do with this part of the code which I have left untouched.
What exactly is the purpose of the self.state variable? Inspecting the code, other than keeping track of model states, it does not seem to interact with anything else. To me, it just seems like copies of previous models, which unfortunately never get removed from RAM. Am I missing something?

The text was updated successfully, but these errors were encountered:

jannerm · 2021-03-22T05:26:23Z

Thanks for uncovering this! To start with answers to your questions:

I didn't experience any out-of-memory errors in my runs, though if you are right, it sounds like I might if I ran the code for longer. The longest training runs in the paper were for HalfCheetah, which had 400k environment steps.
The self._state variable is used to choose a model snapshot based on the validation loss. Because the ensemble contains multiple models, it needs to keep track of the best validation loss (and weights that achieved it) for each model independently.

In terms of fixing this, it sounds like the current method of setting state via assignment operations augments the graph. If you could return the new keys that are added during the set_state method, that might help to figure out what is going on.

ddnimara · 2021-03-22T09:10:44Z

Thanks for the tips,

I looked into it some more, and it seems that the increment of 20 nodes after every call of set_state is attributed to the following:

The world model has 5 layers, each containing weights and biases, thus generating 10 assign operations as seen below (The tensor dimensions seem correct, as I have an observation space of 808, action space of 13 and reward space of 1)

Running sess.run on these operations generates 20 nodes: 2 for each operation as seen below

It seems like the assign operator does not have the desired behavior, as it generates a new set of 10 operations (which then result in 20 nodes) at each world model training (the next world model training will generate Assign10:0 and so on). It is not entirely clear what the best solution for this is, as, from my understanding, manually deleting nodes from graphs is not good practice (and can be unstable).

jannerm · 2021-03-25T05:48:30Z

Here is the commit that should fix this. It's pretty minimal; I just replaced the tf.assign calls with variable.load to avoid augmenting the graph.

I verified that this correctly loads self._state into the tf variables, and it looks like GPU memory consumption is now constant for the first ~20 epochs. I'm going to hold off on pushing this to master until I test it over a full training run, but in the meantime you can use the fix by checking out the memfix branch. If you use it and have anything to report, let me know.

ddnimara · 2021-03-29T07:45:27Z

It appears that this small commit fixes the issue (at least on my part). Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Memory Leak #27

Potential Memory Leak #27

ddnimara commented Mar 18, 2021

jannerm commented Mar 22, 2021 •

edited

Loading

ddnimara commented Mar 22, 2021

jannerm commented Mar 25, 2021

ddnimara commented Mar 29, 2021

Potential Memory Leak #27

Potential Memory Leak #27

Comments

ddnimara commented Mar 18, 2021

jannerm commented Mar 22, 2021 • edited Loading

ddnimara commented Mar 22, 2021

jannerm commented Mar 25, 2021

ddnimara commented Mar 29, 2021

jannerm commented Mar 22, 2021 •

edited

Loading