-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential Memory Leak #27
Comments
Thanks for uncovering this! To start with answers to your questions:
In terms of fixing this, it sounds like the current method of setting state via assignment operations augments the graph. If you could return the new keys that are added during the |
Thanks for the tips, I looked into it some more, and it seems that the increment of 20 nodes after every call of
It seems like the assign operator does not have the desired behavior, as it generates a new set of 10 operations (which then result in 20 nodes) at each world model training (the next world model training will generate Assign10:0 and so on). It is not entirely clear what the best solution for this is, as, from my understanding, manually deleting nodes from graphs is not good practice (and can be unstable). |
Here is the commit that should fix this. It's pretty minimal; I just replaced the I verified that this correctly loads |
First off, great work and thank you for making this publicly available.
I have been experimenting with your code, as I want to adapt it on my own custom environment. For this reason, I adopted a slightly larger NN structure for world modelling and noticed that part of the code seems to be memory leaking. For instance, the save_state method constantly adds nodes in the tensorflow graph which never get removed by the garbage collector. The same thing (to a lesser degree) occurs here. To test this claim, I simply counted the number of nodes and the memory usage before and after calling self._set_state().
I used
len(self.sess.graph._nodes_by_name.keys())
to count the number of nodes within the tf graph andresource.getrusage(resource.RSUAGE_SELF).ru_maxrss
to compute the RAM usage (in kb). A typical instance of what I obtain is seen below, where each call to this method, increases RAM consumption by around 100 MBs. When training for many epochs, this leads to OOM errors, as one would expect.In the end, my question is twofold:
The text was updated successfully, but these errors were encountered: