From dc06f8dca8e522b77e291ef1102e4487fa9ad732 Mon Sep 17 00:00:00 2001 From: Jan Provaznik Date: Thu, 19 Dec 2024 18:14:34 +0100 Subject: [PATCH 1/3] write up --- documentation/specs/proposed/VS-OpenTelemetry | 123 ++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 documentation/specs/proposed/VS-OpenTelemetry diff --git a/documentation/specs/proposed/VS-OpenTelemetry b/documentation/specs/proposed/VS-OpenTelemetry new file mode 100644 index 00000000000..af6626ea9e5 --- /dev/null +++ b/documentation/specs/proposed/VS-OpenTelemetry @@ -0,0 +1,123 @@ +# Telemetry via OpenTelemetry design + +VS OTel provide packages compatible with ingesting data to their backend if we instrument it via OpenTelemetry traces (System.Diagnostics.Activity). +VS OTel packages are not open source so we need to conditionally include them in our build only for VS and MSBuild.exe + +Onepager: [telemetry onepager by JanProvaznik · Pull Request #11013 · dotnet/msbuild](https://github.com/dotnet/msbuild/pull/11013) + +## Requirements + + +### Performance + +- If not sampled, no infra initialization overhead. +- Avoid allocations when not sampled. +- Has to have no impact on Core without opting into tracing, small impact on Framework + +### Privacy + +- Hashing data points that could identify customers (e.g. names of targets) +- Opt out capability + +### Security + +- Providing a method for creating a hook in Framework MSBuild +- document the security implications of hooking custom telemetry Exporters/Collectors in Framework +- other security requirements (transportation, rate limiting, sanitization, data access) are implemented by VS Telemetry library or the backend + +### Data handling + +- Implement head [Sampling](https://opentelemetry.io/docs/concepts/sampling/) with the granularity of a MSBuild.exe invocation/VS instance. +- VS Data handle tail sampling in their infrastructure not to overwhelm storage with a lot of build events. + +#### Data points + +The data sent via VS OpenTelemetry is neither a subset neither a superset of what is sent to SDK telemetry and it is not a purpose of this design to unify them. + +##### Basic info + +- Build duration +- Host +- Build success/failure +- Version +- Target (hashed) + +##### Evnironment + +- SAC (Smart app control) enabled + +##### Features + +- BuildCheck enabled? + +The design allows for easy instrumentation of additional data points. + +## Core `dotnet build` scenario + +- Telemetry should not be collected via VS OpenTelemetry mechanism because it's already collected in sdk. +- There should be an opt in to initialize the ActivitySource to avoid degrading performance. +- [baronfel/otel-startup-hook: A .NET CLR Startup Hook that exports OpenTelemetry metrics via the OTLP Exporter to an OpenTelemetry Collector](https://github.com/baronfel/otel-startup-hook/) and similar enable collecting telemetry data locally by listening to the ActivitySource name defined in MSBuild. + +## Standalone MSBuild.exe scenario + +- Initialize and finalize in Xmake.cs + - ActivitySource, TracerProvider, VS Collector + - overhead of starting VS collector is fairly big (0.3s on Devbox)[JanProvaznik/VSCollectorBenchmarks](https://github.com/JanProvaznik/VSCollectorBenchmarks) + - head sampling should avoid initializing if not sampled + +## VS scenario + +- VS can call `BuildManager` in a thread unsafe way the telemetry implementation has to be mindful of [BuildManager instances acquire its own BuildTelemetry instance by rokonec · Pull Request #8444 · dotnet/msbuild](https://github.com/dotnet/msbuild/pull/8444) + - ensure no race conditions in initialization + - only 1 TracerProvider with VS defined processing should exist +- Visual Studio should be responsible for having a running collector, we don't want this overhead in MSBuild and eventually many components can use it + +## Implementation and MSBuild developer experience + +### Initialization at entrypoints + +- There are 2 entrypoints: + - for VS in BuildManager.BeginBuild + - for standalone in Xmake.cs Main + +### Exiting +Force flush TracerProvider's exporter in BuildManager.EndBuild. +Dispose collector in Xmake.cs at the end of Main. + +### Configuration + +- Class that's responsible for configuring and initializing telemetry and handles optouts, holding tracer and collector. +- Wrapping source so that it has correct prefixes for VS backend to ingest. + +### Instrumenting + +Instrument areas in code running in the main process. + +```csharp +using (Activity? myActivity = OpenTelemetryManager.DefaultActivitySource?.StartActivity(TelemetryConstants.NameFromAConstantToAvoidAllocation)) +{ +// something happens here + +// add data to the trace +myActivity?.WithTag("SpecialEvent","fail") +} +``` + +Interface for classes holding telemetry data + +```csharp +IActivityTelemetryDataHolder data = new SomeData(); +... +myActivity?.WithTags(data); +``` + +## Looking ahead + +- Create a way of using a "HighPrioActivitySource" which would override sampling and initialize Collector in MSBuild.exe scenario/tracerprovider in VS. + - this would enable us to catch rare events + +## Uncertainties + +- Configuring tail sampling in VS telemetry server side infrastructure to not overflow them with data. +- How much head sampling. +- In standalone we could start the collector async without waiting which would potentially miss some earlier traces (unlikely to miss the important end of build trace though) but would degrade performance less than waiting for it's startup. The performance and utility tradeoff is not clear. From 4c0605b098e29eb2d97ccd9c9465d1c213abdbf8 Mon Sep 17 00:00:00 2001 From: Jan Provaznik Date: Fri, 20 Dec 2024 16:56:22 +0100 Subject: [PATCH 2/3] rename, add some details --- .../{VS-OpenTelemetry => VS-OpenTelemetry.md} | 21 ++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) rename documentation/specs/proposed/{VS-OpenTelemetry => VS-OpenTelemetry.md} (85%) diff --git a/documentation/specs/proposed/VS-OpenTelemetry b/documentation/specs/proposed/VS-OpenTelemetry.md similarity index 85% rename from documentation/specs/proposed/VS-OpenTelemetry rename to documentation/specs/proposed/VS-OpenTelemetry.md index af6626ea9e5..3ee56b8754e 100644 --- a/documentation/specs/proposed/VS-OpenTelemetry +++ b/documentation/specs/proposed/VS-OpenTelemetry.md @@ -3,11 +3,10 @@ VS OTel provide packages compatible with ingesting data to their backend if we instrument it via OpenTelemetry traces (System.Diagnostics.Activity). VS OTel packages are not open source so we need to conditionally include them in our build only for VS and MSBuild.exe -Onepager: [telemetry onepager by JanProvaznik · Pull Request #11013 · dotnet/msbuild](https://github.com/dotnet/msbuild/pull/11013) +[Onepager](https://github.com/dotnet/msbuild/blob/main/documentation/specs/proposed/telemetry-onepager.md) ## Requirements - ### Performance - If not sampled, no infra initialization overhead. @@ -74,6 +73,12 @@ The design allows for easy instrumentation of additional data points. ## Implementation and MSBuild developer experience +### Sampling + +- We need to sample before initalizing infrastructure to avoid overhead. +- Enables opt-in and opt-out for guaranteed sample or not sampled. +- nullable ActivitySource, using `?` when working with them, we can be initialized but not sampled -> it will not reinitialize but not collect telemetry. + ### Initialization at entrypoints - There are 2 entrypoints: @@ -81,6 +86,7 @@ The design allows for easy instrumentation of additional data points. - for standalone in Xmake.cs Main ### Exiting + Force flush TracerProvider's exporter in BuildManager.EndBuild. Dispose collector in Xmake.cs at the end of Main. @@ -91,7 +97,9 @@ Dispose collector in Xmake.cs at the end of Main. ### Instrumenting -Instrument areas in code running in the main process. +2 ways of instrumenting: + +#### Instrument areas in code running in the main process ```csharp using (Activity? myActivity = OpenTelemetryManager.DefaultActivitySource?.StartActivity(TelemetryConstants.NameFromAConstantToAvoidAllocation)) @@ -111,6 +119,10 @@ IActivityTelemetryDataHolder data = new SomeData(); myActivity?.WithTags(data); ``` +#### Add data to activity in EndBuild + +- this activity would always be created at the same point when sdk telemetry is sent in Core and we can add data to it + ## Looking ahead - Create a way of using a "HighPrioActivitySource" which would override sampling and initialize Collector in MSBuild.exe scenario/tracerprovider in VS. @@ -121,3 +133,6 @@ myActivity?.WithTags(data); - Configuring tail sampling in VS telemetry server side infrastructure to not overflow them with data. - How much head sampling. - In standalone we could start the collector async without waiting which would potentially miss some earlier traces (unlikely to miss the important end of build trace though) but would degrade performance less than waiting for it's startup. The performance and utility tradeoff is not clear. +- Can we make collector startup faster? +- We could let users configure sample rate via env variable. +- Do we want to send antivirus state? It seems slow. From 68ecf056fa933042635663259c1342aa37aae14b Mon Sep 17 00:00:00 2001 From: Jan Provaznik Date: Fri, 3 Jan 2025 18:45:32 +0100 Subject: [PATCH 3/3] update with comments --- .../specs/proposed/VS-OpenTelemetry.md | 61 ++++++++++++++----- 1 file changed, 47 insertions(+), 14 deletions(-) diff --git a/documentation/specs/proposed/VS-OpenTelemetry.md b/documentation/specs/proposed/VS-OpenTelemetry.md index 3ee56b8754e..656f27f7ac1 100644 --- a/documentation/specs/proposed/VS-OpenTelemetry.md +++ b/documentation/specs/proposed/VS-OpenTelemetry.md @@ -5,6 +5,18 @@ VS OTel packages are not open source so we need to conditionally include them in [Onepager](https://github.com/dotnet/msbuild/blob/main/documentation/specs/proposed/telemetry-onepager.md) +## Concepts + +It's a bit confusing how things are named in OpenTelemetry and .NET and VS Telemetry and what they do. + +| OTel concept | .NET/VS | Description | +| --- | --- | --- | +| Span/Trace | System.Diagnostics.Activity | Trace is a tree of Spans. Activities can be nested.| +| Tracer | System.Diagnostics.ActivitySource | Creates and listens to activites. | +| Processor/Exporter | VS OTel provided default config | filters and saves telemetry as files in a desired format | +| TracerProvider | OTel SDK TracerProvider | Singleton that is aware of processors, exporters and Tracers (in .NET a bit looser relationship because it does not create Tracers just hooks to them) | +| Collector | VS OTel Collector | Sends to VS backend, expensive to initialize and finalize | + ## Requirements ### Performance @@ -20,8 +32,8 @@ VS OTel packages are not open source so we need to conditionally include them in ### Security -- Providing a method for creating a hook in Framework MSBuild -- document the security implications of hooking custom telemetry Exporters/Collectors in Framework +- Providing or/and documenting a method for creating a hook in Framework MSBuild +- If custom hooking solution will be used - document the security implications of hooking custom telemetry Exporters/Collectors in Framework - other security requirements (transportation, rate limiting, sanitization, data access) are implemented by VS Telemetry library or the backend ### Data handling @@ -47,14 +59,14 @@ The data sent via VS OpenTelemetry is neither a subset neither a superset of wha ##### Features -- BuildCheck enabled? +- BuildCheck enabled The design allows for easy instrumentation of additional data points. ## Core `dotnet build` scenario - Telemetry should not be collected via VS OpenTelemetry mechanism because it's already collected in sdk. -- There should be an opt in to initialize the ActivitySource to avoid degrading performance. +- opt in to initialize the ActivitySource to avoid degrading performance. - [baronfel/otel-startup-hook: A .NET CLR Startup Hook that exports OpenTelemetry metrics via the OTLP Exporter to an OpenTelemetry Collector](https://github.com/baronfel/otel-startup-hook/) and similar enable collecting telemetry data locally by listening to the ActivitySource name defined in MSBuild. ## Standalone MSBuild.exe scenario @@ -73,9 +85,19 @@ The design allows for easy instrumentation of additional data points. ## Implementation and MSBuild developer experience +### ActivitySource names + +... + ### Sampling -- We need to sample before initalizing infrastructure to avoid overhead. +Our estimation from VS and SDK data is that there are 10M-100M build events per day. +For proportion estimation (of fairly common occurence in the builds), with not very strict confidnece (95%) and margin for error (5%) sampling 1:25000 would be enough. + +- this would apply for the DefaultActivitySource +- other ActivitySources could be sampled more frequently to get enough data +- Collecting has a cost, especially in standalone scenario where we have to start the collector. We might decide to undersample in standalone to avoid performance frequent impact. +- We want to avoid that cost when not sampled, therefore we prefer head sampling. - Enables opt-in and opt-out for guaranteed sample or not sampled. - nullable ActivitySource, using `?` when working with them, we can be initialized but not sampled -> it will not reinitialize but not collect telemetry. @@ -119,20 +141,31 @@ IActivityTelemetryDataHolder data = new SomeData(); myActivity?.WithTags(data); ``` -#### Add data to activity in EndBuild +#### Default Build activity in EndBuild + +- this activity would always be created at the same point when sdk telemetry is sent in Core +- we can add data to it that we want in general builds +- the desired count of data from this should control the sample rate of DefaultActivitySource + +#### Multiple Activity Sources -- this activity would always be created at the same point when sdk telemetry is sent in Core and we can add data to it +We can create ActivitySources with different sample rates. Ultimately this is limited by the need to initialize a collector. -## Looking ahead +We potentially want apart from the Default ActivitySource: + +1. Other activity sources with different sample rates (in order to get significant data for rarer events such as custom tasks). +2. a way to override sampling decision - ad hoc starting telemetry infrastructure to catch rare events - Create a way of using a "HighPrioActivitySource" which would override sampling and initialize Collector in MSBuild.exe scenario/tracerprovider in VS. - - this would enable us to catch rare events +- this would enable us to catch rare events + ## Uncertainties -- Configuring tail sampling in VS telemetry server side infrastructure to not overflow them with data. -- How much head sampling. +- Configuring tail sampling in VS telemetry server side infrastructure. +- Sampling rare events details. - In standalone we could start the collector async without waiting which would potentially miss some earlier traces (unlikely to miss the important end of build trace though) but would degrade performance less than waiting for it's startup. The performance and utility tradeoff is not clear. -- Can we make collector startup faster? -- We could let users configure sample rate via env variable. -- Do we want to send antivirus state? It seems slow. +- Can collector startup/shutdown be faster? +- We could let users configure sample rate via env variable, VS profile +- Do we want to send antivirus state? Figuring it out is expensive: https://github.com/dotnet/msbuild/compare/main...proto/get-av ~100ms +- ability to configure the overal and per-namespace sampling from server side (e.g. storing it in the .msbuild folder in user profile if different then default values set from server side - this would obviously have a delay of the default sample rate # of executions)