-
Notifications
You must be signed in to change notification settings - Fork 71
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
RFC: Build observability using OpenTelemetry tracing
Signed-off-by: Josh W Lewis <[email protected]>
- Loading branch information
1 parent
822d702
commit e34a900
Showing
1 changed file
with
358 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,358 @@ | ||
# Meta | ||
[meta]: #meta | ||
- Name: Buildpack Observability | ||
- Start Date: 2022-10-05 | ||
- Author(s): @joshwlewis | ||
- Status: Draft <!-- Acceptable values: Draft, Approved, On Hold, Superseded --> | ||
- RFC Pull Request: (leave blank) | ||
- CNB Pull Request: (leave blank) | ||
- CNB Issue: (leave blank) | ||
- Supersedes: (put "N/A" unless this replaces an existing RFC, then link to that RFC) | ||
|
||
# Summary | ||
[summary]: #summary | ||
|
||
This RFC proposes leveraging [OpenTelemetry](https://opentelemetry.io/) to | ||
grant platform operators and buildpack operators more insight into buildpack | ||
performance and behavior. This RFC describes new opt-in functionality | ||
for both pack and the buildpack spec such that OpenTelemetry data may be | ||
exported to the build file system. | ||
|
||
# Definitions | ||
[definitions]: #definitions | ||
|
||
- [OpenTelemetry](https://opentelemetry.io/): A collection of APIs, SDKs, and tools that can be used it to instrument, generate, collect, and export telemetry data. | ||
- [Traces](https://opentelemetry.io/docs/concepts/signals/traces/): Telemetry | ||
category that describes the path of software execution. | ||
|
||
|
||
# Motivation | ||
[motivation]: #motivation | ||
|
||
Buildpack authors and platform operators desire insight into usage, error | ||
scenarios, and performance of builds and buildpacks on their platform. The | ||
following questions are all important for these folks, but difficult to answer: | ||
|
||
- "Which buildpacks commonly fail to compile?" | ||
- "How often does a particular error scenario occur?" | ||
- "How long does each buildpack compile phase take?" | ||
- "How often is a certain buildpack used?" | ||
- "Which versions of Go are being installed?" | ||
- "How long does it take to download node_modules?" | ||
|
||
Instrumenting lifecycle and buildpacks with opt-in OpenTelemetry tracing will | ||
allow platform operators to better understand performance and behavior of their | ||
builds and buildpacks and as a result, provide better service and build | ||
experiences. | ||
|
||
To protect privacy and prevent unnecessary collection of data, this | ||
functionality should be optional and anonymous. | ||
|
||
# What it is | ||
[what-it-is]: #what-it-is | ||
|
||
This RFC aims to provide a solution for two types of OpenTelemetry traces: | ||
|
||
1) Lifecycle tracing: Buildpack-agnostic trace data like which buildpacks were | ||
available, which buildpacks were detected, how long the detect, build, or | ||
export phase took, and so on. This telemetry data may be exported by lifecycle. | ||
2) Buildpack tracing: Telemetry data specific to a buildpack like how long it | ||
took to download a language binary, which language version was selected, and so | ||
on. This telemetry data may be exported by buildpacks. | ||
|
||
Though the sources and contents of the telemetry data differ, both types may | ||
be emitted to the build file system in OpenTelemetry's [File Exporter | ||
Format](https://opentelemetry.io/docs/specs/otel/protocol/file-exporter/). | ||
|
||
In this solution, each lifecycle phase would write a `.jsonl` file with | ||
tracing data for that phase. For example, `lifecycle detector --telemetry` | ||
would write to `/cnb/telemetry/lifecycle-detect.jsonl`. Additionally each | ||
buildpack may also write tracing data to it's own `.jsonl` files (at | ||
`/cnb/telemetry/{BUILDPACK_ID}.jsonl`). | ||
|
||
These `.jsonl` files may be read by platform operators for consumption, | ||
transformation, enrichment, and/or export to an OpenTelemetry backend. Given | ||
that builds may crash or fail at any point, these files must be written to | ||
often and regularly to prevent data loss. | ||
|
||
Platform operators will likely want to view or analyze this data. These | ||
telemetry files are in OTLP compatible format, so may be exported to one or | ||
more OpenTelemetry backends like Honeycomb, Prometheus, and [many | ||
others](https://opentelemetry.io/ecosystem/vendors/). | ||
|
||
|
||
# How it Works | ||
[how-it-works]: #how-it-works | ||
|
||
### Lifecycle telemetry files | ||
|
||
If `lifecycle` is provided the telemetry opt-in flag (such as `--telemetry`), | ||
`lifecycle` phases (such as `detect`, `build`, `export`) may emit an | ||
OpenTelemetry File Export with tracing data to a known location, such as | ||
`/cnb/telemetry/lifecycle-detect.jsonl` with contents like this: | ||
|
||
```json | ||
{ | ||
"resourceSpans": [ | ||
{ | ||
"resource": { | ||
"attributes": [ | ||
{ | ||
"key": "lifecycle.version", | ||
"value": { | ||
"stringValue": "0.17.1" | ||
} | ||
} | ||
] | ||
}, | ||
"scopeSpans": [ | ||
{ | ||
"scope": {}, | ||
"spans": [ | ||
{ | ||
"traceId": "", | ||
"spanId": "", | ||
"parentSpanId": "", | ||
"name": "buildpack-detect", | ||
"startTimeUnixNano": "1581452772000000321", | ||
"endTimeUnixNano": "1581452773000000789", | ||
"droppedAttributesCount": 2, | ||
"events": [ | ||
{ | ||
"timeUnixNano": "1581452773000000123", | ||
"name": "detect-pass" | ||
} | ||
], | ||
"attributes": [ | ||
{ | ||
"key": "buildpack-id", | ||
"value": { | ||
"stringValue": "heroku/nodejs-engine" | ||
} | ||
} | ||
], | ||
"droppedEventsCount": 1 | ||
} | ||
] | ||
} | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
|
||
|
||
### Buildpack telemetry files | ||
|
||
During a buildpack's `detect` or `build` execution, a buildpack may emit | ||
an OpenTelemetry File Export with tracing data to `/cnb/telemetry/#{buildpack-id}.jsonl` | ||
with contents like this: | ||
|
||
```json | ||
{ | ||
"resourceSpans": [ | ||
{ | ||
"resource": { | ||
"attributes": [ | ||
{ | ||
"key": "lifecycle.version", | ||
"value": { | ||
"stringValue": "0.17.1" | ||
} | ||
} | ||
] | ||
}, | ||
"scopeSpans": [ | ||
{ | ||
"scope": {}, | ||
"spans": [ | ||
{ | ||
"traceId": "", | ||
"spanId": "", | ||
"parentSpanId": "", | ||
"name": "buildpack-detect", | ||
"startTimeUnixNano": "1581452772000000321", | ||
"endTimeUnixNano": "1581452773000000789", | ||
"droppedAttributesCount": 2, | ||
"events": [ | ||
{ | ||
"timeUnixNano": "1581452773000000123", | ||
"name": "detect-pass" | ||
} | ||
], | ||
"attributes": [ | ||
{ | ||
"key": "buildpack-id", | ||
"value": { | ||
"stringValue": "heroku/nodejs-engine" | ||
} | ||
} | ||
], | ||
"droppedEventsCount": 1 | ||
} | ||
] | ||
} | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
|
||
### Lifetime | ||
|
||
Telemetry files may be written at any point during the build, so that they | ||
are persisted in cases of failures to detect, failures to build, process | ||
terminations, or crashes. The `jsonl` format allows telemetry libraries to | ||
safely append additional json objects to the end of a telemetry file, so | ||
telemetry data can be flushed to the file frequently. Telemetry files should | ||
not be truncated or deleted so that telemetry processing by a platform can | ||
happen during or after a build. Telemetry files should not be included in the | ||
build result, as they are not relevant, and would likely negatively impact | ||
image size and reproduceability. | ||
|
||
### Access | ||
|
||
The telemetry files should be readable so that they may be analyzed by | ||
the user and/or platform. However, they should be write protected | ||
to prevent malicious buildpacks from injecting tracing data into other | ||
buildpack or lifecycle telemetry files. | ||
|
||
|
||
### Consumption | ||
|
||
This RFC leaves the consumption of telemetry files to the platform operator. | ||
Platform operators choosing to use these metrics need to read them either during | ||
or after the build. This can be done using existing OpenTelemetry libraries. | ||
Platform operators may choose to optionally enrich or modify the tracing data | ||
as they see fit (with data like `instance_id` or `build_id`). Platform | ||
operators will likely want to export this data to an OpenTelemetry backend for | ||
persistence and analysis, and again, this may be done with existing | ||
OpenTelemetry libraries. | ||
|
||
### Viewing and Analyzing | ||
|
||
Once the lifecycle and buildpack traces are exported to an OpenTelemetry | ||
backend, platform operators should be able to (depending on the features of the | ||
backend): | ||
|
||
- View the complete trace for a build | ||
- View or query attributes attached to spans (e.g. `buildpack_id`, | ||
`nodejs_version`) | ||
- View or query span durations | ||
- View or query error types and/or messages | ||
- and more | ||
|
||
# Migration | ||
[migration]: #migration | ||
|
||
No migration neccessary, this is net-new functionality with no backwards | ||
compatibilty concers. | ||
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
### Privacy Concerns | ||
|
||
This RFC outlines functionality that could be percieved as user tracking. To | ||
help remediate those concerns, these are some factors to remember about this | ||
design: | ||
|
||
1) This functionality is opt-in. `lifecycle` and `pack` will not emit telemetry | ||
data unless the `--telemetry` flag is used. | ||
2) This functionality emits telemetry data only to the build file system. For | ||
`pack` users, the telemetry files are stored in docker volumes on the local | ||
machine. Neither `pack` nor `lifecycle` will "phone home" with telemety data. | ||
3) Neither `pack` nor `lifecycle` collect user-identifiable data (no emails, | ||
usernames, IP addresses, etc.), so the telemetry data emitted by `lifecycle` | ||
will also be free of user-identifiaible data. | ||
|
||
### File Export Format Status | ||
|
||
While the [File Exporter | ||
Format](https://opentelemetry.io/docs/specs/otel/protocol/file-exporter/) is | ||
an official format, and matches the OTLP format nearly exactly (and thus seems | ||
unlikely to change), it is listed as experimental status. | ||
|
||
# Alternatives | ||
[alternatives]: #alternatives | ||
|
||
### OpenTelemetry Metrics | ||
|
||
[Metrics](https://opentelemetry.io/docs/concepts/signals/metrics/) are another | ||
category of telemetry data that could be used to answer questions about | ||
build and buildpack behavior and performance. However, metrics are intended to | ||
provide statistical information in aggregate. Since `lifecycle` and `pack` | ||
only run one build at a time, there is no way to aggregate information about | ||
multiple builds in `pack` or `lifecycle`. | ||
|
||
### OTLP | ||
|
||
The [OpenTelemetryProtocol](https://opentelemetry.io/docs/specs/otlp/) is a | ||
network delivery protocol for OpenTelemetry data. Instead of emitting files as | ||
this RFC describes, lifecycle and buildpacks could instead connect to an | ||
OpenTelemetry collector provided by the platform operator. This pattern is | ||
well supported and well known. | ||
|
||
However, there are drawbacks: | ||
|
||
- In local `pack build` scenarios, it's unlikely that users would have an | ||
OpenTelemetry collector running. This RFC solution does not require a | ||
collector. | ||
- lifecycle and buildpacks would need to know where the OpenTelemetry collector | ||
is and how to authenticate with it. Lifecycle and buildpacks that wish to | ||
emit telemetry may not want to deal with the mountain of configuration to | ||
support various collectors. | ||
- Platform operators may have complex network topology that may make supporting | ||
this feature challenging (e.g. a firewall between lifecycle and the collector | ||
may still be perceived as a lifecycle malfunction). | ||
|
||
There is an [RFC for this alternative](https://github.com/buildpacks/rfcs/pull/300). | ||
|
||
# Prior Art | ||
[prior-art]: #prior-art | ||
|
||
|
||
- [Feature Request](https://github.com/buildpacks/lifecycle/issues/1208) | ||
- [Slack | ||
Discussion](https://cloud-native.slack.com/archives/C033DV8D9FB/p1695144574408979) | ||
Discuss prior art, both the good and bad. | ||
|
||
# Unresolved Questions | ||
[unresolved-questions]: #unresolved-questions | ||
|
||
- What file paths should be used for lifecycle telemetry? | ||
- Does `lifecycle` emit files in other places that should be matched? | ||
|
||
- What file paths should be used for buildpack telemetry? | ||
- `/layers` paths are not availble during detect, but `detect` tracing is | ||
desirable. | ||
- `/workspace` may not make sense, since telemetry files probably | ||
shouldn't be a part of the build result image. | ||
|
||
|
||
# Spec. Changes (OPTIONAL) | ||
[spec-changes]: #spec-changes | ||
|
||
Buildpack tracing file locations and format should be added to the [buildpack | ||
specification](https://github.com/buildpacks/spec/blob/main/buildpack.md#build). | ||
|
||
# History | ||
[history]: #history | ||
|
||
<!-- | ||
## Amended | ||
### Meta | ||
[meta-1]: #meta-1 | ||
- Name: (fill in the amendment name: Variable Rename) | ||
- Start Date: (fill in today's date: YYYY-MM-DD) | ||
- Author(s): (Github usernames) | ||
- Amendment Pull Request: (leave blank) | ||
### Summary | ||
A brief description of the changes. | ||
### Motivation | ||
Why was this amendment necessary? | ||
---> |