-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry design #11175
Draft
JanProvaznik
wants to merge
3
commits into
dotnet:main
Choose a base branch
from
JanProvaznik:otel-design
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Telemetry design #11175
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
# Telemetry via OpenTelemetry design | ||
|
||
VS OTel provide packages compatible with ingesting data to their backend if we instrument it via OpenTelemetry traces (System.Diagnostics.Activity). | ||
VS OTel packages are not open source so we need to conditionally include them in our build only for VS and MSBuild.exe | ||
|
||
[Onepager](https://github.com/dotnet/msbuild/blob/main/documentation/specs/proposed/telemetry-onepager.md) | ||
|
||
## Concepts | ||
|
||
It's a bit confusing how things are named in OpenTelemetry and .NET and VS Telemetry and what they do. | ||
|
||
| OTel concept | .NET/VS | Description | | ||
| --- | --- | --- | | ||
| Span/Trace | System.Diagnostics.Activity | Trace is a tree of Spans. Activities can be nested.| | ||
| Tracer | System.Diagnostics.ActivitySource | Creates and listens to activites. | | ||
| Processor/Exporter | VS OTel provided default config | filters and saves telemetry as files in a desired format | | ||
| TracerProvider | OTel SDK TracerProvider | Singleton that is aware of processors, exporters and Tracers (in .NET a bit looser relationship because it does not create Tracers just hooks to them) | | ||
| Collector | VS OTel Collector | Sends to VS backend, expensive to initialize and finalize | | ||
|
||
## Requirements | ||
|
||
### Performance | ||
|
||
- If not sampled, no infra initialization overhead. | ||
- Avoid allocations when not sampled. | ||
- Has to have no impact on Core without opting into tracing, small impact on Framework | ||
|
||
### Privacy | ||
|
||
- Hashing data points that could identify customers (e.g. names of targets) | ||
- Opt out capability | ||
|
||
### Security | ||
|
||
- Providing or/and documenting a method for creating a hook in Framework MSBuild | ||
- If custom hooking solution will be used - document the security implications of hooking custom telemetry Exporters/Collectors in Framework | ||
- other security requirements (transportation, rate limiting, sanitization, data access) are implemented by VS Telemetry library or the backend | ||
|
||
### Data handling | ||
|
||
- Implement head [Sampling](https://opentelemetry.io/docs/concepts/sampling/) with the granularity of a MSBuild.exe invocation/VS instance. | ||
- VS Data handle tail sampling in their infrastructure not to overwhelm storage with a lot of build events. | ||
|
||
#### Data points | ||
|
||
The data sent via VS OpenTelemetry is neither a subset neither a superset of what is sent to SDK telemetry and it is not a purpose of this design to unify them. | ||
|
||
##### Basic info | ||
|
||
- Build duration | ||
- Host | ||
- Build success/failure | ||
- Version | ||
- Target (hashed) | ||
|
||
##### Evnironment | ||
|
||
- SAC (Smart app control) enabled | ||
|
||
##### Features | ||
|
||
- BuildCheck enabled | ||
|
||
The design allows for easy instrumentation of additional data points. | ||
|
||
## Core `dotnet build` scenario | ||
|
||
- Telemetry should not be collected via VS OpenTelemetry mechanism because it's already collected in sdk. | ||
- opt in to initialize the ActivitySource to avoid degrading performance. | ||
- [baronfel/otel-startup-hook: A .NET CLR Startup Hook that exports OpenTelemetry metrics via the OTLP Exporter to an OpenTelemetry Collector](https://github.com/baronfel/otel-startup-hook/) and similar enable collecting telemetry data locally by listening to the ActivitySource name defined in MSBuild. | ||
|
||
## Standalone MSBuild.exe scenario | ||
|
||
- Initialize and finalize in Xmake.cs | ||
- ActivitySource, TracerProvider, VS Collector | ||
- overhead of starting VS collector is fairly big (0.3s on Devbox)[JanProvaznik/VSCollectorBenchmarks](https://github.com/JanProvaznik/VSCollectorBenchmarks) | ||
- head sampling should avoid initializing if not sampled | ||
|
||
## VS scenario | ||
|
||
- VS can call `BuildManager` in a thread unsafe way the telemetry implementation has to be mindful of [BuildManager instances acquire its own BuildTelemetry instance by rokonec · Pull Request #8444 · dotnet/msbuild](https://github.com/dotnet/msbuild/pull/8444) | ||
- ensure no race conditions in initialization | ||
- only 1 TracerProvider with VS defined processing should exist | ||
- Visual Studio should be responsible for having a running collector, we don't want this overhead in MSBuild and eventually many components can use it | ||
|
||
## Implementation and MSBuild developer experience | ||
|
||
### ActivitySource names | ||
|
||
... | ||
|
||
### Sampling | ||
|
||
Our estimation from VS and SDK data is that there are 10M-100M build events per day. | ||
For proportion estimation (of fairly common occurence in the builds), with not very strict confidnece (95%) and margin for error (5%) sampling 1:25000 would be enough. | ||
|
||
- this would apply for the DefaultActivitySource | ||
- other ActivitySources could be sampled more frequently to get enough data | ||
- Collecting has a cost, especially in standalone scenario where we have to start the collector. We might decide to undersample in standalone to avoid performance frequent impact. | ||
- We want to avoid that cost when not sampled, therefore we prefer head sampling. | ||
- Enables opt-in and opt-out for guaranteed sample or not sampled. | ||
- nullable ActivitySource, using `?` when working with them, we can be initialized but not sampled -> it will not reinitialize but not collect telemetry. | ||
|
||
### Initialization at entrypoints | ||
|
||
- There are 2 entrypoints: | ||
- for VS in BuildManager.BeginBuild | ||
- for standalone in Xmake.cs Main | ||
|
||
### Exiting | ||
|
||
Force flush TracerProvider's exporter in BuildManager.EndBuild. | ||
Dispose collector in Xmake.cs at the end of Main. | ||
|
||
### Configuration | ||
|
||
- Class that's responsible for configuring and initializing telemetry and handles optouts, holding tracer and collector. | ||
- Wrapping source so that it has correct prefixes for VS backend to ingest. | ||
|
||
### Instrumenting | ||
|
||
2 ways of instrumenting: | ||
|
||
#### Instrument areas in code running in the main process | ||
|
||
```csharp | ||
using (Activity? myActivity = OpenTelemetryManager.DefaultActivitySource?.StartActivity(TelemetryConstants.NameFromAConstantToAvoidAllocation)) | ||
{ | ||
// something happens here | ||
|
||
// add data to the trace | ||
myActivity?.WithTag("SpecialEvent","fail") | ||
} | ||
``` | ||
|
||
Interface for classes holding telemetry data | ||
|
||
```csharp | ||
IActivityTelemetryDataHolder data = new SomeData(); | ||
... | ||
myActivity?.WithTags(data); | ||
``` | ||
|
||
#### Default Build activity in EndBuild | ||
|
||
- this activity would always be created at the same point when sdk telemetry is sent in Core | ||
- we can add data to it that we want in general builds | ||
- the desired count of data from this should control the sample rate of DefaultActivitySource | ||
|
||
#### Multiple Activity Sources | ||
|
||
We can create ActivitySources with different sample rates. Ultimately this is limited by the need to initialize a collector. | ||
|
||
We potentially want apart from the Default ActivitySource: | ||
|
||
1. Other activity sources with different sample rates (in order to get significant data for rarer events such as custom tasks). | ||
2. a way to override sampling decision - ad hoc starting telemetry infrastructure to catch rare events | ||
|
||
- Create a way of using a "HighPrioActivitySource" which would override sampling and initialize Collector in MSBuild.exe scenario/tracerprovider in VS. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More generaly - sample rate per Activity/namespace (higher even always or even lower or newer) |
||
- this would enable us to catch rare events | ||
|
||
|
||
## Uncertainties | ||
|
||
- Configuring tail sampling in VS telemetry server side infrastructure. | ||
- Sampling rare events details. | ||
- In standalone we could start the collector async without waiting which would potentially miss some earlier traces (unlikely to miss the important end of build trace though) but would degrade performance less than waiting for it's startup. The performance and utility tradeoff is not clear. | ||
- Can collector startup/shutdown be faster? | ||
- We could let users configure sample rate via env variable, VS profile | ||
- Do we want to send antivirus state? Figuring it out is expensive: https://github.com/dotnet/msbuild/compare/main...proto/get-av ~100ms | ||
- ability to configure the overal and per-namespace sampling from server side (e.g. storing it in the .msbuild folder in user profile if different then default values set from server side - this would obviously have a delay of the default sample rate # of executions) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed - we should not prevent ourselves to be able to add (in future versions):