Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(containerd-shim-wasm): add OpenTelemetry tracing library and feature #582

Merged
merged 22 commits into from
Jul 23, 2024

Conversation

Mossaka
Copy link
Member

@Mossaka Mossaka commented May 6, 2024

This PR introduces OpenTelemetry feature and new APIs on the core crate to add tracing capabilities to the shim.

  • Added OpenTelemetry tracing to the main entry point of the shim. It now parses the OTEL_EXPORTER_OTLP_ENDPOINT environment variable to determine if the shim should be started with OpenTelemetry tracing.
  • Enable that in the wasmtime shim

It builds on top of #564

@Mossaka Mossaka marked this pull request as draft May 6, 2024 04:53
@Mossaka Mossaka force-pushed the otel-wasmtime branch 2 times, most recently from 8a3f861 to a00f739 Compare May 6, 2024 05:03
@Mossaka Mossaka force-pushed the otel-wasmtime branch 2 times, most recently from 4d578ab to d9f1fe7 Compare May 16, 2024 22:42
@Mossaka Mossaka changed the title [DO NOT MERGE] WIP: wasmtime shim OpenTelemetry tracing support wasmtime shim OpenTelemetry tracing support May 16, 2024
@Mossaka Mossaka requested a review from jsturtevant May 16, 2024 22:44
@Mossaka Mossaka marked this pull request as ready for review May 16, 2024 22:44
@Mossaka Mossaka changed the title wasmtime shim OpenTelemetry tracing support feat(containerd-shim-wasm): add OpenTelemetry tracing library and feature May 18, 2024
@Mossaka
Copy link
Member Author

Mossaka commented May 20, 2024

Could you please take a look on this PR 🙏🏻 @devigned @jsturtevant @jprendes @cpuguy83

docs/opentelemetry.md Outdated Show resolved Hide resolved
crates/containerd-shim-wasm/src/sandbox/shim/otel.rs Outdated Show resolved Hide resolved
crates/containerd-shim-wasm/src/sandbox/shim/otel.rs Outdated Show resolved Hide resolved
crates/containerd-shim-wasm/src/sandbox/shim/otel.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jsturtevant jsturtevant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, awesome to see that this has already been used to solve our startup perf!

docs/opentelemetry.md Show resolved Hide resolved
docs/opentelemetry.md Show resolved Hide resolved
docs/image.png Outdated Show resolved Hide resolved
docs/image-1.png Outdated Show resolved Hide resolved
docs/opentelemetry.md Outdated Show resolved Hide resolved
docs/opentelemetry.md Outdated Show resolved Hide resolved
crates/containerd-shim-wasm/src/sandbox/cli.rs Outdated Show resolved Hide resolved
use containerd_shim_wasmtime::WasmtimeInstance;

fn main() {
shim_main::<WasmtimeInstance>("wasmtime", version!(), revision!(), "v1", None);
#[tokio::main]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I thought we were injecting the async runtime into the .install_batch(runtime::Tokio). Similar to how we do it for the containerd client code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately opentelemetry SDK in rust all require tokio runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar enough with rust async, will this cause any issues with shim or is it ok to sprinkle async in various areas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: I think it's fine, but it makes me feel a bit uncomfortable.

This could be a problem with runtimes that also use tokio. I believe it is not as it is here, but we need to be careful.

When I attempted to convert everythign to async, I had the problem that the wasm runtime was starting a tokio runtime. But that code was already running inside shim's tokio runtime. Tokio does not allow nesting runtimes. The solution to this is to spawn a new thread before calling the run_wasi method and block/wait for that thread to finish. The new thread get a clean new stack and can initialize a new tokio runtime. This is ok because the new thread is not managed by tokio (hence, not nested tokio runtimes), and is a documented behaviour. But we pay the cost of spawning a thread.

When the shim runs in async, and when the process forked by youki, the executed code is in a state where it believes it's inside an async runtime (hence we can't start a runtime again), but this is not actually true. After the forking, the code believe everything stays the same, but it's become the entry-point of a new process and can't be suspended anymore. I think we could add support for this at a youki level (don't quote me, I haven't looked too much into it).

Now, I believe in this case it's OK, as we are using the sync version of containerd-shim, which creates a thread pool to handle the ttrpc requests, which has the same result as the workaround I mentioned before.

@Mossaka
Copy link
Member Author

Mossaka commented Jun 5, 2024

Did a rebase of the PR and resolved all the comments. Could you please take another look, @cpuguy83 , @jsturtevant 🙏?

@Mossaka Mossaka closed this Jun 6, 2024
@Mossaka Mossaka reopened this Jun 6, 2024
@Mossaka Mossaka requested review from devigned and jprendes June 8, 2024 21:16
@Mossaka Mossaka closed this Jun 8, 2024
@Mossaka Mossaka reopened this Jun 8, 2024
@Mossaka Mossaka requested a review from jprendes July 19, 2024 17:40
jprendes
jprendes previously approved these changes Jul 19, 2024
Copy link
Collaborator

@jprendes jprendes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

this commit adds otel collector APIs and a new opentelemetry feature to the wasm shim

Signed-off-by: jiaxiao zhou <[email protected]>
Signed-off-by: jiaxiao zhou <[email protected]>
this commit adds a new env var OTEL_EXPORTER_OTLP_PROTOCOL to configure different types of OTLP protocols such as grpc and http/protobuf

by default, it uses http/protobuf

Signed-off-by: jiaxiao zhou <[email protected]>
Signed-off-by: jiaxiao zhou <[email protected]>
Signed-off-by: jiaxiao zhou <[email protected]>
this commit adds the traces specific env vars to the otel module. It also refactors the code and adds unit tests.

Signed-off-by: jiaxiao zhou <[email protected]>
this commit does a few things to refactor the opentelemetry codebase
1. rename otel funcitons
2. bump deps of otel
3. move tokio runtime to the crate

Signed-off-by: jiaxiao zhou <[email protected]>
this commit removes ConfigBuilder and adds build_from_env to Config and renames shim_main_with_otel to shim_main

Signed-off-by: jiaxiao zhou <[email protected]>
Signed-off-by: jiaxiao zhou <[email protected]>
Signed-off-by: jiaxiao zhou <[email protected]>
@Mossaka
Copy link
Member Author

Mossaka commented Jul 23, 2024

Going to merge this PR in and I will work on the follow ups

  1. splitting long-running function to 60s interval spans feat(otel): split long-running wait function to child spans #653
  2. tracing setup as an injected configuration
  3. figure out how to integrate with Spin

Thanks everyone for reviewing and commenting! 🥂

@Mossaka Mossaka merged commit e037d6a into containerd:main Jul 23, 2024
52 checks passed
@Mossaka Mossaka mentioned this pull request Jul 24, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants