Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

policy: Do not store policy reference in Cilium socket option #1010

Merged
merged 3 commits into from
Nov 26, 2024

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Nov 21, 2024

Remove the member initial_policy_ from the Cilium SocketOption class, as that reference was possibly kept after the policy had already been deleted. This made it possible for the policy to be destructed from the worker thread, which can lead to Envoy crash.

initial_policy_ was stored as we already did the policy lookup and the same policy is needed in other Cilium filters. Have the cilium.network and cilium.tls_wrapper filters do their own policy lookups instead, so that we do not need to keep the reference in SocketOption.

Worker threads do the policy lookup from their own thread local maps, which hold a reference to the policy. These thread local references are relinquished from post function during policy update, which will release worker thread's last reference to the policy at the time. The main thread is the last one to update or delete policy, so the policy instance destruction happens from the main thread. For this to work it is imperative to only hold policy references in these thread local maps.

Note that even keeping a weak reference accessible to the worker threads outside of the policy maps is risky, as then the worker thread could convert that weak reference to a shared pointer while the reference has already been relinquished from the thread's local policy map. In this situation the other threads could also relinquish their references concurrently to the worker thread holding the shared pointer, which would then become the last reference and destruction would happen from the worker thread again.

@jrajahalme jrajahalme added the wip work-in-progress, no need to review label Nov 21, 2024
@jrajahalme jrajahalme requested a review from a team as a code owner November 21, 2024 13:34
@jrajahalme jrajahalme requested review from sayboras and removed request for a team November 21, 2024 13:34
@jrajahalme jrajahalme force-pushed the debug-build-for-testing branch 4 times, most recently from 07d4662 to 080d03a Compare November 22, 2024 14:57
@jrajahalme jrajahalme marked this pull request as draft November 22, 2024 14:58
@jrajahalme jrajahalme force-pushed the debug-build-for-testing branch from 080d03a to 2962755 Compare November 22, 2024 16:02
@jrajahalme jrajahalme changed the title bazel: debug build policy: Remove policy reference from SocketOption Nov 22, 2024
@jrajahalme jrajahalme added bug Something isn't working needs-backport/1.30 and removed wip work-in-progress, no need to review labels Nov 22, 2024
@jrajahalme jrajahalme marked this pull request as ready for review November 22, 2024 16:33
@jrajahalme jrajahalme marked this pull request as draft November 24, 2024 07:57
@jrajahalme jrajahalme force-pushed the debug-build-for-testing branch from c6111e0 to 63d0b4a Compare November 24, 2024 08:20
@jrajahalme jrajahalme changed the title policy: Remove policy reference from SocketOption SocketOption: Only store a weak policy reference Nov 24, 2024
Include also sni test not having "l7" in it's name, and check the logs
for errors as well.

Test with embedded cilium-envoy so that Envoy error and warning logs will
also be checked for.

Signed-off-by: Jarno Rajahalme <[email protected]>
Add a runtime assert that the policy destruction happens from the main or
test thread. The "test thread" case also covers the destruction from the
initial thread, which happens for the static members of
'AllowAllEgressPolicyInstanceImpl' when running `cilium-envoy version',
for example.

Signed-off-by: Jarno Rajahalme <[email protected]>
@jrajahalme jrajahalme force-pushed the debug-build-for-testing branch from 63d0b4a to 1810b93 Compare November 24, 2024 09:35
@jrajahalme jrajahalme changed the title SocketOption: Only store a weak policy reference policy: Do not store policy reference in socket_option Nov 24, 2024
@jrajahalme jrajahalme force-pushed the debug-build-for-testing branch from 1810b93 to a94e8f2 Compare November 24, 2024 09:38
@jrajahalme jrajahalme marked this pull request as ready for review November 24, 2024 09:40
@jrajahalme jrajahalme force-pushed the debug-build-for-testing branch from a94e8f2 to 2824f9e Compare November 24, 2024 09:53
Remove the member 'initial_policy_' from the Cilium SocketOption class,
as that reference was possibly kept after the policy had already been
deleted. This made it possible for the policy to be destructed from the
worker thread, which can lead to Envoy crash.

'initial_policy_' was stored as we already did the policy lookup and the
same policy is needed in other Cilium filters. Have the cilium.network
and cilium.tls_wrapper filters do their own policy lookups instead, so
that we do not need to keep the reference in SocketOption.

Worker threads do the policy lookup from their own thread local maps,
which hold a reference to the policy. These thread local references are
relinquished from post function during policy update, which will release
worker thread's last reference to the policy at the time. The main thread
is the last one to update or delete policy, so the policy instance
destruction happens from the main thread. For this to work it is
imperative to only hold policy references in these thread local maps.

Note that even keeping a weak reference accessible to the worker threads
outside of the policy maps is risky, as then the worker thread could
convert that weak reference to a shared pointer while the reference has
already been relinquished from the thread's local policy map. In this
situation the other threads could also relinquish their references
concurrently to the worker thread holding the shared pointer, which would
then become the last reference and destruction would happen from the
worker thread again.

Signed-off-by: Jarno Rajahalme <[email protected]>
@jrajahalme jrajahalme force-pushed the debug-build-for-testing branch from 2824f9e to 78b3394 Compare November 24, 2024 09:54
@jrajahalme jrajahalme changed the title policy: Do not store policy reference in socket_option policy: Do not store policy reference in Cilium socket option Nov 24, 2024
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have verified the stack trace before and after the changes, nice find on socket option 🎖️.

Two comments as per below, but not a blocker.

cilium/network_policy.cc Show resolved Hide resolved
@jrajahalme jrajahalme added this pull request to the merge queue Nov 26, 2024
Merged via the queue into main with commit f09ed99 Nov 26, 2024
5 checks passed
@jrajahalme jrajahalme deleted the debug-build-for-testing branch November 26, 2024 06:03
@sayboras sayboras mentioned this pull request Dec 3, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-pending/1.30 bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants