Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix data race in node upsert #24127

Merged
merged 1 commit into from
Oct 4, 2024
Merged

fix data race in node upsert #24127

merged 1 commit into from
Oct 4, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Oct 3, 2024

While testing with agents built with the race-detection option enabled, I encountered a data race while draining a node.

When we upsert a node we copy the NodeResources struct and then perform a fixup for backwards compatibility of the topology struct. This fixup was being executed on the original struct and not the copy, which means we're uselessly fixing up the wrong struct and we're corrupting the state store in the process (albeit harmlessly, I suspect).

Fix the data race by calling the method on the correct pointer.

@tgross tgross added theme/cgroups cgroups issues type/bug labels Oct 3, 2024
@tgross tgross added backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line labels Oct 3, 2024
Comment on lines 3265 to 3266
// COMPAT remove in 1.9+
// apply compatibility fixups covering node topology
Copy link
Member Author

@tgross tgross Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoenig: is this instead safe to simply remove in 1.9 as suggested by this comment? It looks like we introduced this in 1.6.x. We'll need to be safe for folks to upgrade from 1.6.x to 1.9 but not earlier than that.

We introduced this in 1.7. We should hold off removing until 1.10 LTS. I've updated the associated comments in this PR.

@tgross tgross marked this pull request as ready for review October 3, 2024 20:24
While testing with agents built with the race-detection option enabled, I
encountered a data race while draining a node.

When we upsert a node we copy the `NodeResources` struct and then perform a
fixup for backwards compatibility of the topology struct. This fixup was being
executed on the original struct and not the copy, which means we're uselessly
fixing up the wrong struct and we're corrupting the state store in the
process (albeit harmlessly, I suspect).

Fix the data race by calling the method on the correct pointer.
Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌚 (;new_moon_with_face:)

@tgross tgross merged commit 7531b7a into main Oct 4, 2024
26 checks passed
@tgross tgross deleted the data-race-node-resources-compat branch October 4, 2024 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line theme/cgroups cgroups issues type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants