Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Memory Leak #27

Open
ddnimara opened this issue Mar 18, 2021 · 4 comments
Open

Potential Memory Leak #27

ddnimara opened this issue Mar 18, 2021 · 4 comments

Comments

@ddnimara
Copy link

First off, great work and thank you for making this publicly available.

I have been experimenting with your code, as I want to adapt it on my own custom environment. For this reason, I adopted a slightly larger NN structure for world modelling and noticed that part of the code seems to be memory leaking. For instance, the save_state method constantly adds nodes in the tensorflow graph which never get removed by the garbage collector. The same thing (to a lesser degree) occurs here. To test this claim, I simply counted the number of nodes and the memory usage before and after calling self._set_state().

I used len(self.sess.graph._nodes_by_name.keys()) to count the number of nodes within the tf graph and resource.getrusage(resource.RSUAGE_SELF).ru_maxrss to compute the RAM usage (in kb). A typical instance of what I obtain is seen below, where each call to this method, increases RAM consumption by around 100 MBs. When training for many epochs, this leads to OOM errors, as one would expect.
mbpoquestionmemory

In the end, my question is twofold:

  1. Did you experience this sort of memory leak during your experiments? Note that, even though I am testing in a custom environment, this should not have anything to do with this part of the code which I have left untouched.
  2. What exactly is the purpose of the self.state variable? Inspecting the code, other than keeping track of model states, it does not seem to interact with anything else. To me, it just seems like copies of previous models, which unfortunately never get removed from RAM. Am I missing something?
@jannerm
Copy link
Owner

jannerm commented Mar 22, 2021

Thanks for uncovering this! To start with answers to your questions:

  1. I didn't experience any out-of-memory errors in my runs, though if you are right, it sounds like I might if I ran the code for longer. The longest training runs in the paper were for HalfCheetah, which had 400k environment steps.
  2. The self._state variable is used to choose a model snapshot based on the validation loss. Because the ensemble contains multiple models, it needs to keep track of the best validation loss (and weights that achieved it) for each model independently.

In terms of fixing this, it sounds like the current method of setting state via assignment operations augments the graph. If you could return the new keys that are added during the set_state method, that might help to figure out what is going on.

@ddnimara
Copy link
Author

Thanks for the tips,

I looked into it some more, and it seems that the increment of 20 nodes after every call of set_state is attributed to the following:

  1. The world model has 5 layers, each containing weights and biases, thus generating 10 assign operations as seen below (The tensor dimensions seem correct, as I have an observation space of 808, action space of 13 and reward space of 1)

operations

  1. Running sess.run on these operations generates 20 nodes: 2 for each operation as seen below

nodes

It seems like the assign operator does not have the desired behavior, as it generates a new set of 10 operations (which then result in 20 nodes) at each world model training (the next world model training will generate Assign10:0 and so on). It is not entirely clear what the best solution for this is, as, from my understanding, manually deleting nodes from graphs is not good practice (and can be unstable).

@jannerm
Copy link
Owner

jannerm commented Mar 25, 2021

Here is the commit that should fix this. It's pretty minimal; I just replaced the tf.assign calls with variable.load to avoid augmenting the graph.

I verified that this correctly loads self._state into the tf variables, and it looks like GPU memory consumption is now constant for the first ~20 epochs. I'm going to hold off on pushing this to master until I test it over a full training run, but in the meantime you can use the fix by checking out the memfix branch. If you use it and have anything to report, let me know.

@ddnimara
Copy link
Author

It appears that this small commit fixes the issue (at least on my part). Thanks!

memoryfixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants