Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.30 Backports 2024-12-03 #1022

Open
wants to merge 11 commits into
base: v1.30
Choose a base branch
from
Open

Conversation

sayboras
Copy link
Member

@sayboras sayboras commented Dec 3, 2024

Note: This backport PR consolidates all recent fixes related to SDS, we can push new commit if required.

Upstream testing is done in cilium/cilium#36925

Once this PR is merged, a GitHub action will update the labels of these PRs:

 1010 1020 1021 1016 1023 1082

@sayboras sayboras requested a review from jrajahalme December 3, 2024 01:31
@sayboras sayboras added the dont-merge/preview-only DON'T MERGE label Dec 3, 2024
@sayboras sayboras force-pushed the pr/v1.30-backport-2024-12-03-12-04 branch 2 times, most recently from e8071c1 to 7aa0126 Compare December 3, 2024 01:46
cilium/network_policy.h Outdated Show resolved Hide resolved
Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to get #1023 in as well

@sayboras
Copy link
Member Author

sayboras commented Dec 3, 2024

Need to get #1023 in as well

Yup, I will pick up the change after it is merged into main (just to get main commit sha)

@jrajahalme
Copy link
Member

Need to get #1023 in as well

Yup, I will pick up the change after it is merged into main (just to get main commit sha)

Merged :-)

@sayboras
Copy link
Member Author

sayboras commented Dec 3, 2024

Merged :-)

Picked it up 👍

@sayboras sayboras force-pushed the pr/v1.30-backport-2024-12-03-12-04 branch from cce5182 to 5373c55 Compare December 11, 2024 07:32
[ upstream commit 4879439 ]

Include also sni test not having "l7" in it's name, and check the logs
for errors as well.

Test with embedded cilium-envoy so that Envoy error and warning logs will
also be checked for.

Signed-off-by: Jarno Rajahalme <[email protected]>
Signed-off-by: Tam Mach <[email protected]>
[ upstream commit db5924e ]

Add a runtime assert that the policy destruction happens from the main or
test thread. The "test thread" case also covers the destruction from the
initial thread, which happens for the static members of
'AllowAllEgressPolicyInstanceImpl' when running `cilium-envoy version',
for example.

Signed-off-by: Jarno Rajahalme <[email protected]>
Signed-off-by: Tam Mach <[email protected]>
[ upstream commit f09ed99 ]

Remove the member 'initial_policy_' from the Cilium SocketOption class,
as that reference was possibly kept after the policy had already been
deleted. This made it possible for the policy to be destructed from the
worker thread, which can lead to Envoy crash.

'initial_policy_' was stored as we already did the policy lookup and the
same policy is needed in other Cilium filters. Have the cilium.network
and cilium.tls_wrapper filters do their own policy lookups instead, so
that we do not need to keep the reference in SocketOption.

Worker threads do the policy lookup from their own thread local maps,
which hold a reference to the policy. These thread local references are
relinquished from post function during policy update, which will release
worker thread's last reference to the policy at the time. The main thread
is the last one to update or delete policy, so the policy instance
destruction happens from the main thread. For this to work it is
imperative to only hold policy references in these thread local maps.

Note that even keeping a weak reference accessible to the worker threads
outside of the policy maps is risky, as then the worker thread could
convert that weak reference to a shared pointer while the reference has
already been relinquished from the thread's local policy map. In this
situation the other threads could also relinquish their references
concurrently to the worker thread holding the shared pointer, which would
then become the last reference and destruction would happen from the
worker thread again.

Signed-off-by: Jarno Rajahalme <[email protected]>
Signed-off-by: Tam Mach <[email protected]>
[ upstream commit 7b1dc6c ]

Log an error instead of crashing when Cilium NetworkPolicy resources are
destructed in a worker thread. Remove stacktrace that is not useful at
all with release builds.

Prior to enabling use of SDS secrets running NetworkPolicy destructors in
a worker thread did not cause visible problems, even though we had taken
measures to not do that a long time ago. Now that references to the
policy have been removed from the connection metadata (Cilium
SocketOption) this should no longer trigger. Nonetheless, crashing Envoy
is too drastic as the event may be survivable (as it apparently has been
for a long time when running without SDS references in the policy).

Enable trace level logging for troubleshooting if this error ever occurs.

Signed-off-by: Jarno Rajahalme <[email protected]>
Signed-off-by: Tam Mach <[email protected]>
[upstream commit 16e43f5]

Signed-off-by: Jarno Rajahalme <[email protected]>
Signed-off-by: Tam Mach <[email protected]>
Remove the build-tests-only argument to include all needed targets in the
tests archive. Add upstream tcp integration test in envoy-test-deps.

Signed-off-by: Jarno Rajahalme <[email protected]>
Upstream tcp_proxy is missing a nullptr check in ASSERT that crashes
tests when run with config that has ASSERT enabled.

Signed-off-by: Jarno Rajahalme <[email protected]>
Update tools/stack_decode.py from upstream Envoy.

Comment says it works with gcc debug builds, and indeed I had no luck
getting it working with clang debug builds. Did not try gcc. We are
currently dependent on clang for crosscompilation for arm64.

Signed-off-by: Jarno Rajahalme <[email protected]>
Change error logs to IS_ENVOY_BUG to get a stack trace when a policy is
destructed from a worker thread.

Signed-off-by: Jarno Rajahalme <[email protected]>
Add patch to Envoy to release thread local slots first in the worker
threads, and last in the main thread. This makes sure that destructors of
shared resources are called from the main thread instead of a random
worker thread. This fixes the issue where Cilium Network Policy Map was
destructed from a main thread, possibly interfering with the Secret xDS
gRPC protocol.

Signed-off-by: Jarno Rajahalme <[email protected]>
@sayboras sayboras force-pushed the pr/v1.30-backport-2024-12-03-12-04 branch from 5373c55 to 12f7899 Compare January 10, 2025 00:42
@sayboras sayboras removed the dont-merge/preview-only DON'T MERGE label Jan 10, 2025
@sayboras sayboras marked this pull request as ready for review January 10, 2025 01:44
@sayboras sayboras requested a review from a team as a code owner January 10, 2025 01:44
@sayboras sayboras requested review from mhofstetter and jrajahalme and removed request for a team January 10, 2025 01:44
@sayboras sayboras removed the request for review from mhofstetter January 10, 2025 01:46
@sayboras sayboras added the dont-merge/preview-only DON'T MERGE label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants