-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for memory leak in Python target #1873
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any problems with this and it matches the description on this PR. Just making sure I understand -- the added Py_XDECREF
on line 191 is correct because serialized_pyobject
is connected to pointerExpression
, which was already used previously by
PyBytes_AsStringAndSize(serialized_pyobject, &"
+ serializedVarName
+ ".buf, &"
+ serializedVarName
+ ".len);\n");
? This would be clearer if the reference count were decremented in generateNetworkSerializerCode
, where serialized_pyobject
is used. Would that work?
Yes, that is correct. Moving |
I may do some rebases in this and in the companion PR to get this up to date with master. This will mess up your version history if you have a divergent version. I think the only reason why the tests failed was that the submodule was not updated. |
e026483
to
bf357d4
Compare
Here is a minimal reproduction for a test failure: target Python {
timeout: 1 sec,
fast: true
}
reactor SendNumber {
output out
reaction(startup) -> out {=
sm = 99
out.set(sm/4.0)
=}
}
reactor Print {
input in_
reaction (in_) {=
print(in_.value)
=}
}
main reactor {
m = new SendNumber()
p = new Print()
m.out -> p.in_
} Without the local variable |
Could this be because
This is supported by the fact that if I add back the line IIRC there are some generated reactions in the federated Python target whose bodies are actually C? Maybe we need to decrement reference counts at the ends of those generated reactions to emulate the behavior that I think CPython must have when local variables go out of scope. |
Hi Peter, your understanding and analysis are correct. Yes, using Also, I believe I've conducted tests on both singleport and multiport in federated execution with the latest commit, and all the tests have successfully passed. The memory leak issue is resolved for federated execution. |
I saw that commit but am still not sure that I understand it. Decreasing the reference count as much as necessary in order to ensure that the object is freed looks like it could be hiding an off-by-one error in the reference count that is introduced elsewhere, and it is not clear to me that it will be compatible with the behavior of the Python runtime, which could be saving references that our runtime is not accounting for. Here is an example of a program that segfaults on my machine (with the latest version of
The output is the following:
Please let me know if you are having trouble reproducing this or a similar error. The problem is that the Python runtime knows that there is a reference to the state variable, but our runtime doesn't know that. Therefore, when our runtime executes the destructor, and it decreases the reference count all the way to zero regardless of what the count was before, it is messing with (the counting of) references that it does not manage. |
If you share any steps required to reproduce the leak, then I would be happy to attempt to help. I'll probably have some time to tinker with it in the next few days. |
I have successfully reproduced the error you mentioned. I agree that we should not decrease the reference count all the way to zero regardless of what the count was before. The segmentation fault occurring in the provided example is now fixed by this commit (lf-lang/reactor-c@825fe82) In the commit above, python_count_decrement is still used as the destructor for the output port, and the value assigned to the port only increments its reference count by one. For reproducing the leak, the following example could be used:
A temporary fix has been implemented in this commit (lf-lang/reactor-c@ec6e4f8). Please let me know if you have trouble reproducing the leak. Thank you for your offer to help :) |
Thanks Jacky! I'm assuming that we are interested in the top entry that is marked An interesting feature of the MRE that you shared that is different from the screenshot at the top of this post is that it doesn't look like the large object is what is being leaked. In your screenshot, the number of objects goes from 20000226 to 24000256 (a large increase), whereas when I run your MRE locally with reactor-c/825fe82 checked out, the number of objects initially jumps to about a million (as expected, since we send an array of length about a million), and then it jumps to 2 million (as I would expect if we are leaking all the million-long arrays). But after that the number only creeps up slowly, from 2000016 to 2000027 to 2000034, and so on, with the total size stuck at around 61.6 MiB (i.e., not increasing substantially). So it seems like we are maybe not leaking the large objects in this MRE, but instead we are leaking a relatively small number of small objects. Furthermore, even using reactor-c/ec6e4f8, I still see very similar behavior, with an initial jump to about 2 million objects and 61.6 MiB and then a slow leak of only a small number of objects in each report. I can't really discern the difference, even though the commit messages say 825fe82 leaks whereas ec6e4f8 doesn't leak. |
Thanks for testing the MRE. It is indeed a bit confusing as to why the total size appears to remain at around 61.6 MB. Based on my observations, the reference count of You are right to point out that small objects seem to be continuously leaking. This could potentially be due to a few lingering temporary As for the commit |
Just making sure this is clear -- I'm still not sure that the MRE actually is an MRE, in the sense that I do not see how it reproduces the leak that is blocking the federated learning project. If the reference count on the big list that is allocated in the reaction body were stuck at 1, then we would be accumulating lots of copies of that big list, which isn't happening. So I still am not sure what I would need to do in order to make sure I have actually reproduced the same bug that is blocking the federated learning project. |
OK, I'll update the submodule and merge in master so that the tests run.
Hm? Unfederated execution does support multiports. Here are the passing test cases that use multiports without federated execution. |
This is expected to result in no segfaults but a leak.
Due to failing enclave tests.
To clarify, what I intended to say is that the unfederated execution only supports single-threaded runtime.
It appears that the recent commit reactor-c/825fe82 no longer has the memory leak issue for the large list. The bug I mentioned seems to be caused by the while loop in the previous commit, where it decreases the reference count all the way to zero regardless of what the count was before. Thank you for your help, and now we should prioritize ensuring that all tests pass successfully. |
Do not decrement the reference count twice. In my most recent meeting with @jackykwok2024 I believe we determined that although there are multiple memory leaks, the most severe one is fixed in the commit being referenced here, and that in any case the other commit in reactor-c is not less leaky than the one referenced here. Besides, I have found that in at least one test program a segfault is fixed by using the commit referenced here (d28a9f5).
This seems to work on DelayArrayWithAfter.
This pull request addresses a memory leak issue observed in the current implementation. As shown in the following screenshot, it is noticeable that with each iteration, the size of Python objects constructed using C extensions continues to grow.
This problem could be attributed to several key issues, which are as follows:
PyObject Reference Count: The reference counts for the majority of
PyObjects
are not being appropriately managed.Port Assignment: Rather than using the
_LF_SET
macro to assign values to the generic_port_instance_struct in the current implementation, the more suitable method would be to uself_set_token
and define a destructor for the token to ensure proper garbage collection.Function Definition: The existing
_lf_free_token_value
function is incorrect and fails to invoke the destructor.Notes:
Py_XINCREF(message_byte_array)
has been removed. This is due to the fact thatPyBytes_FromStringAndSize
already performs the necessary operation to increment the reference count._LF_GARBAGE_COLLECTED
has been renamed to_PYTHON_TARGET_ENABLED
. This modification gives the macro a more indicative name.Other related changes have also been made in the reactor-c repository (LINK).