You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In collective_create() (around about line 303) we attempt to acquire a child request handle. If we can't get a handle for some reason, we do NOT clean up and exit (as we do with most other error conditions). Instead, we log an error and set the pointer to NULL, but otherwise continue processing. If no other errors occur, then the function will return a pointer to a coll_request struct, but that struct will contain an unexpected NULL pointer.
Describe how to reproduce the problem
The bug is in the error handling for this function, so you'd have to trigger the specific error. I'm not sure how you'd do that (and I rather suspect it's never happened on a production system).
Include any warning or errors or releveant debugging data
This was discovered during a code inspection as part of the work on #758. For reference, Michael Brim's comment about it is reproduced below:
Regarding the TODO, the reason we don't return immediately in that error case is because that would leak the allocated coll_request structure and its member array allocations. However, it looks like we're missing the logic to notice that we failed to get one of the child handles and need to cleanup the structure before returning NULL. Unfortunately, it appears it's not as easy as just calling collective_cleanup(), since that would prematurely free the original group rpc request handle, which we use in the callers to collective_create() to send back an error. This is complicated enough that we probably should open an issue and address in a separate PR.
The text was updated successfully, but these errors were encountered:
System information
Describe the problem you're observing
In collective_create() (around about line 303) we attempt to acquire a child request handle. If we can't get a handle for some reason, we do NOT clean up and exit (as we do with most other error conditions). Instead, we log an error and set the pointer to NULL, but otherwise continue processing. If no other errors occur, then the function will return a pointer to a coll_request struct, but that struct will contain an unexpected NULL pointer.
Describe how to reproduce the problem
The bug is in the error handling for this function, so you'd have to trigger the specific error. I'm not sure how you'd do that (and I rather suspect it's never happened on a production system).
Include any warning or errors or releveant debugging data
This was discovered during a code inspection as part of the work on #758. For reference, Michael Brim's comment about it is reproduced below:
The text was updated successfully, but these errors were encountered: