Fix for memory leak in Python target #1873

jackyk02 · 2023-06-28T20:16:09Z

This pull request addresses a memory leak issue observed in the current implementation. As shown in the following screenshot, it is noticeable that with each iteration, the size of Python objects constructed using C extensions continues to grow.

This problem could be attributed to several key issues, which are as follows:

PyObject Reference Count: The reference counts for the majority of PyObjects are not being appropriately managed.
Port Assignment: Rather than using the _LF_SET macro to assign values to the generic_port_instance_struct in the current implementation, the more suitable method would be to use lf_set_token and define a destructor for the token to ensure proper garbage collection.
Function Definition: The existing _lf_free_token_value function is incorrect and fails to invoke the destructor.

Notes:

The function call Py_XINCREF(message_byte_array) has been removed. This is due to the fact that PyBytes_FromStringAndSize already performs the necessary operation to increment the reference count.
The macro identifier _LF_GARBAGE_COLLECTED has been renamed to _PYTHON_TARGET_ENABLED. This modification gives the macro a more indicative name.

Other related changes have also been made in the reactor-c repository (LINK).

petervdonovan

I don't see any problems with this and it matches the description on this PR. Just making sure I understand -- the added Py_XDECREF on line 191 is correct because serialized_pyobject is connected to pointerExpression, which was already used previously by

PyBytes_AsStringAndSize(serialized_pyobject, &"
            + serializedVarName
            + ".buf, &"
            + serializedVarName
            + ".len);\n");

? This would be clearer if the reference count were decremented in generateNetworkSerializerCode, where serialized_pyobject is used. Would that work?

jackyk02 · 2023-06-29T04:00:29Z

I don't see any problems with this and it matches the description on this PR. Just making sure I understand -- the added Py_XDECREF on line 191 is correct because serialized_pyobject is connected to pointerExpression, which was already used previously by
PyBytes_AsStringAndSize(serialized_pyobject, &"
            + serializedVarName
            + ".buf, &"
            + serializedVarName
            + ".len);\n");
? This would be clearer if the reference count were decremented in generateNetworkSerializerCode, where serialized_pyobject is used. Would that work?

Yes, that is correct. Moving Py_XDECREF(serialized_pyobject) to generateNetworkSerializerCode doesn't work because the serialized_pyobject is used for the sendingFunction.

petervdonovan · 2023-06-29T19:05:47Z

I may do some rebases in this and in the companion PR to get this up to date with master. This will mess up your version history if you have a divergent version.

I think the only reason why the tests failed was that the submodule was not updated.

petervdonovan · 2023-06-29T20:53:15Z

Here is a minimal reproduction for a test failure:

target Python {
  timeout: 1 sec,
  fast: true
}

reactor SendNumber {
  output out

  reaction(startup) -> out {=
    sm = 99

    out.set(sm/4.0)
  =}
}

reactor Print {
  input in_
  reaction (in_) {=
    print(in_.value)
  =}
}

main reactor {
  m = new SendNumber()
  p = new Print()
  m.out -> p.in_
}

Without the local variable sm, the test passes; without dividing it by 4 on line 12, the test passes. With both of those features in SendNumber there is consistently a segmentation fault.

petervdonovan · 2023-06-29T21:18:35Z

Could this be because

a number object is created in the SendNumber reaction
when we set the port to it, we access the number object in the C code without increasing its reference count
then we return to the reaction and reach the end of it, at which point the number object goes out of scope and its reference count is decremented
the number object gets freed and when we try to access it later, in the downstream reaction, it has been overwritten

This is supported by the fact that if I add back the line Py_INCREF(val); to python_port.c, there is no segmentation fault. I think that Py_INCREF(val); is actually the right thing to do if I understand this right.

IIRC there are some generated reactions in the federated Python target whose bodies are actually C? Maybe we need to decrement reference counts at the ends of those generated reactions to emulate the behavior that I think CPython must have when local variables go out of scope.

jackyk02 · 2023-07-01T04:25:17Z

Could this be because

a number object is created in the SendNumber reaction

when we set the port to it, we access the number object in the C code without increasing its reference count

then we return to the reaction and reach the end of it, at which point the number object goes out of scope and its reference count is decremented

the number object gets freed and when we try to access it later, in the downstream reaction, it has been overwritten

This is supported by the fact that if I add back the line Py_INCREF(val); to python_port.c, there is no segmentation fault. I think that Py_INCREF(val); is actually the right thing to do if I understand this right.

IIRC there are some generated reactions in the federated Python target whose bodies are actually C? Maybe we need to decrement reference counts at the ends of those generated reactions to emulate the behavior that I think CPython must have when local variables go out of scope.

Hi Peter, your understanding and analysis are correct. Yes, using Py_INCREF(val); is indeed the right thing to do. I've created a new commit (LINK) where I added back Py_INCREF(val). Additionally, I've defined a different destructor for the output port considering the reference count for the output value has now become 3.

Also, I believe python_count_decrement(port->value) is not necessary anymore, given that the reference count has already been decreased in the destructor. Currently, it works because Py_XDECREF allows the input object to be NULL.

I've conducted tests on both singleport and multiport in federated execution with the latest commit, and all the tests have successfully passed. The memory leak issue is resolved for federated execution.

petervdonovan · 2023-07-01T04:50:29Z

I've created a new commit (LINK) where I added back Py_INCREF(val). Additionally, I've defined a different destructor for the output port considering the reference count for the output value has now become 3.

I saw that commit but am still not sure that I understand it. Decreasing the reference count as much as necessary in order to ensure that the object is freed looks like it could be hiding an off-by-one error in the reference count that is introduced elsewhere, and it is not clear to me that it will be compatible with the behavior of the Python runtime, which could be saving references that our runtime is not accounting for.

Here is an example of a program that segfaults on my machine (with the latest version of memory_leak_fix checked out in reactor-c, and in the memory_leak branch of lingua-franca):

target Python {
  timeout: 1 sec,
  fast: true
}

reactor SendNumber {
  state sm = {= [] =}
  output out

  timer t(0, 100 msec)

  reaction(t) -> out {=

    out.set(self.sm)
  =}
}

reactor Print {
  input in_
  reaction (in_) {=
    print(in_.value)
  =}
}

main reactor {
  m = new SendNumber()
  p = new Print()
  m.out -> p.in_
}

The output is the following:

[]
[]
[[...]]
Segmentation fault (core dumped)

Please let me know if you are having trouble reproducing this or a similar error.

The problem is that the Python runtime knows that there is a reference to the state variable, but our runtime doesn't know that. Therefore, when our runtime executes the destructor, and it decreases the reference count all the way to zero regardless of what the count was before, it is messing with (the counting of) references that it does not manage.

petervdonovan · 2023-07-01T05:05:43Z

If you share any steps required to reproduce the leak, then I would be happy to attempt to help. I'll probably have some time to tinker with it in the next few days.

jackyk02 · 2023-07-04T03:28:15Z

I have successfully reproduced the error you mentioned. I agree that we should not decrease the reference count all the way to zero regardless of what the count was before. The segmentation fault occurring in the provided example is now fixed by this commit (lf-lang/reactor-c@825fe82)

In the commit above, python_count_decrement is still used as the destructor for the output port, and the value assigned to the port only increments its reference count by one.

For reproducing the leak, the following example could be used:

target Python{
    timeout: 10 sec,
    logging: DEBUG
};


preamble {=
    import time
    import tracemalloc
    tracemalloc.start()
=}

reactor ClientReactor {
    input in_parameter
    output out_parameter
    
    reaction(startup) {=
     =}
     
     reaction(in_parameter) -> out_parameter{=
        time.sleep(2)
        param_temp = in_parameter.value
        print(len(param_temp))
        print("RECEIVED-----------------------------------------------------------------------------------------------")
        out_parameter.set(param_temp)
    =}
}

reactor ServerReactor {
    input in_parameter
    output out_parameter
    state large_param
    
    reaction(startup)-> out_parameter {=
        temp = [1.00005]*10
        self.large_param = temp * 100000
        out_parameter.set(self.large_param)
     =}
     
     reaction(in_parameter) -> out_parameter{=
        snapshot = tracemalloc.take_snapshot()
        top_stats = snapshot.statistics("lineno")
        
        print("[ Top 10 ]")
        for stat in top_stats[:10]:
            print(stat)
            
        time.sleep(2)
        param_temp = in_parameter.value
        out_parameter.set(param_temp)
    =}
}

federated reactor{
    client = new ClientReactor()
    server = new ServerReactor()
    server.out_parameter -> client.in_parameter after 0
    client.out_parameter -> server.in_parameter
}

A temporary fix has been implemented in this commit (lf-lang/reactor-c@ec6e4f8). Please let me know if you have trouble reproducing the leak. Thank you for your offer to help :)

petervdonovan · 2023-07-05T04:47:12Z

Thanks Jacky!

I'm assuming that we are interested in the top entry that is marked <unknown>:0:, right? Since that corresponds to allocations in our generated C code? And we see the size and count associated with that entry increasing?

An interesting feature of the MRE that you shared that is different from the screenshot at the top of this post is that it doesn't look like the large object is what is being leaked. In your screenshot, the number of objects goes from 20000226 to 24000256 (a large increase), whereas when I run your MRE locally with reactor-c/825fe82 checked out, the number of objects initially jumps to about a million (as expected, since we send an array of length about a million), and then it jumps to 2 million (as I would expect if we are leaking all the million-long arrays). But after that the number only creeps up slowly, from 2000016 to 2000027 to 2000034, and so on, with the total size stuck at around 61.6 MiB (i.e., not increasing substantially). So it seems like we are maybe not leaking the large objects in this MRE, but instead we are leaking a relatively small number of small objects.

Furthermore, even using reactor-c/ec6e4f8, I still see very similar behavior, with an initial jump to about 2 million objects and 61.6 MiB and then a slow leak of only a small number of objects in each report. I can't really discern the difference, even though the commit messages say 825fe82 leaks whereas ec6e4f8 doesn't leak.

jackyk02 · 2023-07-06T03:12:46Z

Thanks for testing the MRE. It is indeed a bit confusing as to why the total size appears to remain at around 61.6 MB. Based on my observations, the reference count of port->value is 2 during the first iteration. However, in the following iterations, the reference count of that becomes 1. I'm conducting further investigations to identify which Python C API increments the reference count of port->value.

You are right to point out that small objects seem to be continuously leaking. This could potentially be due to a few lingering temporary PyObjects in the current implementation. Ideally, these should be removed through the garbage collection.

As for the commit reactor-c/825fe82, I'm curious to know if all the tests passed successfully with this commit. Furthermore, it appears that unfederated execution only support single-threaded runtime...

petervdonovan · 2023-07-06T03:24:54Z

I can't really discern the difference, even though the commit messages say 825fe82 leaks whereas ec6e4f8 doesn't leak.

You are right to point out that small objects seem to be continuously leaking. This could potentially be due to a few lingering temporary PyObjects in the current implementation.

Just making sure this is clear -- I'm still not sure that the MRE actually is an MRE, in the sense that I do not see how it reproduces the leak that is blocking the federated learning project. If the reference count on the big list that is allocated in the reaction body were stuck at 1, then we would be accumulating lots of copies of that big list, which isn't happening. So I still am not sure what I would need to do in order to make sure I have actually reproduced the same bug that is blocking the federated learning project.

petervdonovan · 2023-07-06T03:29:26Z

I'm curious to know if all the tests passed successfully with this commit.

OK, I'll update the submodule and merge in master so that the tests run.

Furthermore, it appears that unfederated execution doesn't support multiport. Do you happen to have an example that demonstrates it?

Hm? Unfederated execution does support multiports. Here are the passing test cases that use multiports without federated execution.

This is expected to result in no segfaults but a leak.

Due to failing enclave tests.

jackyk02 · 2023-07-06T05:11:09Z

Hm? Unfederated execution does support multiports. Here are the passing test cases that use multiports without federated execution.

To clarify, what I intended to say is that the unfederated execution only supports single-threaded runtime.

So I still am not sure what I would need to do in order to make sure I have actually reproduced the same bug that is blocking the federated learning project.

It appears that the recent commit reactor-c/825fe82 no longer has the memory leak issue for the large list. The bug I mentioned seems to be caused by the while loop in the previous commit, where it decreases the reference count all the way to zero regardless of what the count was before. Thank you for your help, and now we should prioritize ensuring that all tests pass successfully.

Do not decrement the reference count twice. In my most recent meeting with @jackykwok2024 I believe we determined that although there are multiple memory leaks, the most severe one is fixed in the commit being referenced here, and that in any case the other commit in reactor-c is not less leaky than the one referenced here. Besides, I have found that in at least one test program a segfault is fixed by using the commit referenced here (d28a9f5).

This seems to work on DelayArrayWithAfter.

jackyk02 mentioned this pull request Jun 28, 2023

Memory leak in Python target fixed lf-lang/reactor-c#246

Merged

lhstrh requested a review from petervdonovan June 28, 2023 23:16

petervdonovan approved these changes Jun 29, 2023

View reviewed changes

petervdonovan enabled auto-merge June 29, 2023 19:03

jackyk02 and others added 3 commits June 29, 2023 12:06

memory leak fix for federated execution

e310022

added comments

d46d507

Update submodule.

bf357d4

petervdonovan force-pushed the memory_leak branch from e026483 to bf357d4 Compare June 29, 2023 19:11

petervdonovan disabled auto-merge June 29, 2023 20:14

petervdonovan added 3 commits July 5, 2023 20:32

Update submodule to reactor-c/825fe82

6c499a8

This is expected to result in no segfaults but a leak.

Merge branch 'master' into memory_leak

698bab0

Update submodule to rebased ref.

2e80fd2

Due to failing enclave tests.

petervdonovan marked this pull request as draft July 10, 2023 20:04

Try to fix segfaults with after delays.

4072f89

This seems to work on DelayArrayWithAfter.

petervdonovan marked this pull request as ready for review July 10, 2023 23:43

petervdonovan mentioned this pull request Jul 11, 2023

Possible memory leaks in Python federated programs #1899

Open

petervdonovan requested a review from edwardalee July 11, 2023 00:05

lhstrh changed the title ~~Memory Leak during Federated Execution~~ Memory leak in Python target fixed Jul 11, 2023

lhstrh approved these changes Jul 11, 2023

View reviewed changes

Update submodule.

6d1a763

lhstrh enabled auto-merge July 11, 2023 00:26

lhstrh added this pull request to the merge queue Jul 11, 2023

Merged via the queue into master with commit 045c429 Jul 11, 2023
42 checks passed

lhstrh deleted the memory_leak branch July 11, 2023 04:41

petervdonovan added the bugfix label Aug 26, 2023

petervdonovan changed the title ~~Memory leak in Python target fixed~~ Fix for memory leak in Python target Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for memory leak in Python target #1873

Fix for memory leak in Python target #1873

jackyk02 commented Jun 28, 2023

petervdonovan left a comment

jackyk02 commented Jun 29, 2023

petervdonovan commented Jun 29, 2023

petervdonovan commented Jun 29, 2023

petervdonovan commented Jun 29, 2023

jackyk02 commented Jul 1, 2023 •

edited

Loading

petervdonovan commented Jul 1, 2023 •

edited

Loading

petervdonovan commented Jul 1, 2023

jackyk02 commented Jul 4, 2023 •

edited

Loading

petervdonovan commented Jul 5, 2023 •

edited

Loading

jackyk02 commented Jul 6, 2023 •

edited

Loading

petervdonovan commented Jul 6, 2023

petervdonovan commented Jul 6, 2023

jackyk02 commented Jul 6, 2023 •

edited

Loading

Fix for memory leak in Python target #1873

Fix for memory leak in Python target #1873

Conversation

jackyk02 commented Jun 28, 2023

petervdonovan left a comment

Choose a reason for hiding this comment

jackyk02 commented Jun 29, 2023

petervdonovan commented Jun 29, 2023

petervdonovan commented Jun 29, 2023

petervdonovan commented Jun 29, 2023

jackyk02 commented Jul 1, 2023 • edited Loading

petervdonovan commented Jul 1, 2023 • edited Loading

petervdonovan commented Jul 1, 2023

jackyk02 commented Jul 4, 2023 • edited Loading

petervdonovan commented Jul 5, 2023 • edited Loading

jackyk02 commented Jul 6, 2023 • edited Loading

petervdonovan commented Jul 6, 2023

petervdonovan commented Jul 6, 2023

jackyk02 commented Jul 6, 2023 • edited Loading

jackyk02 commented Jul 1, 2023 •

edited

Loading

petervdonovan commented Jul 1, 2023 •

edited

Loading

jackyk02 commented Jul 4, 2023 •

edited

Loading

petervdonovan commented Jul 5, 2023 •

edited

Loading

jackyk02 commented Jul 6, 2023 •

edited

Loading

jackyk02 commented Jul 6, 2023 •

edited

Loading