-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cmd/opampsupervisor]: Implement PackagesAvailable for upgrading agent #35503
base: main
Are you sure you want to change the base?
[cmd/opampsupervisor]: Implement PackagesAvailable for upgrading agent #35503
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on, its no small task. I'm still looking through this, but wanted to give early feedback on a few points.
require.Equal(t, &protobufs.PackageStatuses{ | ||
Packages: map[string]*protobufs.PackageStatus{ | ||
"": { | ||
// TODO: Should initital version be filled in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be the version/hash of the Collector binary, right? I think the version can be the one obtained during bootstrapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that makes sense. I think the mental hurdle I'm having here is the sort of disconnect between the artifact the agent needs to end up with and what is available from the releases.
Because the release artifact is a tarball, but the agent executable is not, it means that the artifact offered by the server and the one the agent ends up with have completely different hashes. This initial status is kind of a weird edge case.
return nil | ||
} | ||
|
||
// TODO: Certificate paths? The certificate can be specified via SIGSTORE_ROOT_FILE for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we link to an issue here? It would make it easier to track and provides some context.
agentHash := h.Sum(nil) | ||
|
||
// Load persisted package state, if it exists | ||
// TODO: use packagesStatusPath method somehow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay given that it's the only other usage of this and is a very simple call to filepath.Join
.
The other options I can think of aren't worth the additional complexity:
- Extract the
packagesStatusPath
functionality to a function and call the function in the method as well. Too much abstraction for a simple call. - Instantiate
packageManager
above and callpathagesStatusPath
here, then fill in other details on the struct afterward. Splits the initialization too much for fairly minimal gains.
return err | ||
} | ||
|
||
type packageState struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to instead use the persistent state file to store this? I see how the purpose of each file is distinct, but having two small-ish YAML files also feels a little odd.
This may also be premature, but I think abstracting state storage through the persistent state struct may keep things cleaner in the long run.
@@ -275,20 +297,20 @@ func (s *Supervisor) loadConfig(configFile string) error { | |||
// an OpAMP extension, obtains the agent description, then | |||
// shuts down the Collector. This only needs to happen | |||
// once per Collector binary. | |||
func (s *Supervisor) getBootstrapInfo() (err error) { | |||
func (s *Supervisor) getBootstrapInfo() (agentVersion string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is another place where consolidating our storage to the persistent state may make things cleaner. We have two different data flows for the instance ID vs. the agent version in this function, when I think we should probably store the two next two each other so all of the agent's information is recorded in the same place. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify on what's being proposed here? I don't think we should store the agent version anywhere, we should always treat the reported version from the agent as the source of truth (as best we can, anyways).
} | ||
|
||
// overwrite agent process | ||
startAgent, err := p.am.stopAgentProcess(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would substantially complicate things if we limited agent process management to the Supervisor struct and only deal with managing the binary file here? My first impression is that we should keep those responsibilities separated if possible and not have the package manager know anything about the Collector process, even if the current implementation isn't too complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was that we need to stop the agent process to write the binary in the first place.
We could write the binary somewhere temporarily, and send a message to the agent goroutine to do the replacement, which could be cleaner (better separation of concerns). The package manager would still need to have some handle to signal that, but it wouldn't necessarily be tied to the collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we be able to accomplish this with synchronous, one-way communication? I was thinking maybe we could do something vaguely like this in the Supervisor struct:
oldVersion := 0
newVersion := 1
s.commander.stopAgent()
s.packageManager.SwapInVersion(newVersion)
if err := s.commander.startAgent(); err != nil {
s.packageManager.SwapInVersion(oldVersion)
s.commander.startAgent()
s.reportFailedInstallation()
return
}
s.packageManager.DeleteVersion(oldVersion)
s.reportSuccessfulInstallation()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we could do something like that. I think this also sets us up nicely for restoring the previous version if the new one fails. I'll look into making those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so I'm looking a bit deeper and the reason I'm a little iffy on this is because it seems like there will be some odd consequences with doing this outside of the package syncer.
Essentially, the package syncer will run it's logic, then it will report the status off every package at the end. So it's possible that the package downloads and verifies fine, so the package syncer reports the package as installed - but then there is an error when swapping in the version, which causes an error.
I would prefer if we didn't report the package is installed before it's actually successfully installed, basically. Which means we'd have to do it in the package manager during the UpdateContent
call, as far as I can tell. This would require 2 way communication, one way from (packagemanager) -> (agent goroutine) to do the swap, then one from (agent goroutine) -> (package manager) to indicate success/failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PackageSyncer's design and API makes this difficult we should feel free to redesign it. Please don't feel constrained by it. This was the most basic design I could come up with when implementing the example and there may very well be a better design. Feel free to suggest one.
if err != nil { | ||
s.logger.Error("Failed to sync PackagesAvailable message", zap.Error(err)) | ||
} | ||
// TODO: Should we wait for the sync to be done somehow? Should it be in a separate goroutine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I'm seeing, the Done
channel only communicates that the PackageSyncer is done communicating package statuses to the server, and doesn't track whether the packages have been downloaded: https://github.com/open-telemetry/opamp-go/blob/main/client/internal/packagessyncer.go#L51. It may be worth waiting for this before communicating any additional updates about package statuses to the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually really confused reading that code. It looks like once Sync is done, that channel is closed, so waiting on Done here wouldn't actually add any extra synchronization.
Good call at looking at that impl, I definitely had a different idea of its purpose in my head.
@tigrannajaryan I'd appreciate you taking a look at this, even if just to review conformance to the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reviewing but got confused by the many moving parts of signatures and verifications.
I think we need a design doc or some other form of documentation that explains all the involved parties (cosign, Collector releases, OpAMP Server, Supervisor) and how they dance together to make sure the downloaded binary is what it pretends to be and is safe to use.
Some sort of a sequence diagram that starts with the Collector build on opentelemetry-collector-releases and ends with Supervisor launching the new executable and shows all the steps in between would be great. If we number the steps we could even refer to them in the code.
} | ||
|
||
// TODO: Certificate paths? The certificate can be specified via SIGSTORE_ROOT_FILE for now | ||
type AgentSignature struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document AgentSignature and AgentSignatureIdentity.
return splitSignature[0], splitSignature[1], nil | ||
} | ||
|
||
// sig is the decoded signature of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete comment?
} | ||
|
||
func parsePackageSignature(signature []byte) (b64Cert, b64Signature []byte, err error) { | ||
splitSignature := bytes.SplitN(signature, []byte(" "), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this signature format documented somewhere? This imposes a requirement on the OpAMP server to generate signatures in a specific way, right?
Yeah, good call out. I will work on this and get something in our spec that explains everything here. |
by the OpAMP Server, if the AcceptsPackages capability is enabled. | ||
|
||
### Overview | ||
![Supervisor architecture diagram](agent-upgrades-e2e.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent diagram and workflow description! Thank you. This is very heplful.
the URL that the OpAMP server gave it for downloading. After | ||
downloading, it verifies the content hash matches the server | ||
provided content hash from the PackagesAvailable message. | ||
9. The supervisor verifies the certificate via the root certificates downloaded in step 7. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what is the difference between "verifies the certificate via the root certificates" and checks against the Rekor transparency log to verify the certificate hasn't been tampered with". I assume certificate is signed by a root certificate (the CA), so if it was tampered it would not pass the first verification step. How is possible that it passes the first check but fails the second? Are we thinking about a scenario that the root certificate is tampered with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually a little iffy on this myself. I think the transparency log is related to the time of signing in some way, I'll dig a little more to see if I can understand better.
* This field may be explicitly set empty to skip verifying the repository field | ||
in the certificate | ||
* A set of [Fulcio] root certificates. | ||
* These certificates are retrieved from "https://tuf-repo-cdn.sigstore.dev" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to explore the possibility of offline certificate verification. We could embed the root certificates in the Supervisor at build time, but I don't know if the verification process itself can be done without connecting to Rekor.
This can be a future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an option to skip the Rekor/transparency log verification. I couldn't get it to work as I was developing, but I figure it's probably something simple I was missing. We should definitely be able to get it working in some follow-up.
c535e23
to
f7d67cc
Compare
@andykellr or @dpaasman00 do you plan to take over this PR? |
@tigrannajaryan Yes, I will be taking this PR over! |
f7d67cc
to
7e25bbb
Compare
7e25bbb
to
3750f52
Compare
@evan-bradley @tigrannajaryan Are you able to take another look when you have a chance? I believe feedback you left for Brandon was addressed, so let me know if there's anything that was missed! |
type AgentSignature struct { | ||
// TODO: The Fulcio root certificate can be specified via SIGSTORE_ROOT_FILE for now | ||
// But we should add it as a config option. | ||
// https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/3593 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong link?
// You can read more about Cosign and signing here. | ||
// https://docs.sigstore.dev/cosign/signing/overview/ | ||
type AgentSignature struct { | ||
// TODO: The Fulcio root certificate can be specified via SIGSTORE_ROOT_FILE for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to embed (//go:embed) the root certificate into the executable? How often is the root certificate updated and how long is it valid for? Embedding could simplify things for the user (one less thing to obtain and supply via config) if it is valid for long enough.
// github_workflow_repository defines the expected repository field | ||
// on the sigstore certificate. | ||
CertGithubWorkflowRepository string `mapstructure:"github_workflow_repository"` | ||
|
||
// Identities is a list of valid identities to use when verifying the agent. | ||
// Only one needs to match the identity on the certificate for signature | ||
// verification to pass. | ||
Identities []AgentSignatureIdentity `mapstructure:"identities"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason that these need to be configurable? Is it so that non-official Collector builds can be supported?
// https://github.com/open-telemetry/opamp-spec/blob/main/specification.md#packages | ||
agentPackageKey = "" | ||
|
||
lastPackageStatusFileName = "last-reported-package-statuses.proto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: .proto extension is normally used for Protobuf source code (IDL) files. .binpb is recommended extension for wire format: https://protobuf.dev/programming-guides/techniques/#suffixes
topLevelVersion string | ||
|
||
storageDir string | ||
agentPath string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the executable file path or a directory path where executable is contained? Perhaps agentExePath
or agentFilePath
to make it clear?
} | ||
|
||
// Create reader for new agent | ||
gzipReader, err := gzip.NewReader(bytes.NewBuffer(by)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know the content is gzip? Is this specified somewhere?
return fmt.Errorf("read tarball for collector: %w", err) | ||
} | ||
|
||
for h.Name != "otelcol-contrib" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know the part name is "otelcol-contrib"? It would nice to document this and the fact that it is gziped tarball.
defer agentFile.Close() | ||
|
||
// Copy to backup | ||
_, err = io.Copy(backupFile, agentFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible alternate instead of copying the files is to rename them. Delete backup, rename original to backup, write to the original name. Skips the copying and is likely faster.
} | ||
defer restoreFile.Close() | ||
|
||
if _, err := io.Copy(restoreFile, backupFile); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep the backup file or we can rename instead of copying? Renaming likely is less susceptible to out-of-disk-space issues.
|
||
return &packageManager{ | ||
persistentState: persistentState, | ||
topLevelHash: agentHash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit confusing that we have both AllPackagesHash
and topLevelHash
. Are they the same or different? If different it would be nice to explain in a comment somewhere.
I am most concerned about the hash/signature verification parts. We want to make sure we don't have any vulnerabilities here and don't introduce any in the future. I think it is important that we have good test coverage for the verification code. Would you be able to add some tests? I can't see coverage results for this PR in codecov. Ideally we want to have full branch coverage for UpdateContent and funcs it calls. |
@dpaasman00 ping me when you want me to do another round. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Planning to get back and address review by the end of the next week, December 13. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Description:
Implements ReportPackageStatuses and taking PackagesAvailable for upgrading the agent.
The agent will only accept a top-level package with an empty name. This agent must be signed using cosign's keyless signing method (this is how the opentelemetry-collector-releases repository signs its releases).
The signature field must be populated with the resultant base64 encoded cert and signature, both being space separated (signature = b64_cert + " " + b64_signature).
This first implementation only allows online verification; In order to verify the certificate, it must reach out to the public rekor transparency log instance. It also fetches certificates from the internet. Some of these things are configurable through environment variables, but I figure we can parse out how offline signature verification works in a follow-up PR. This basic setup should allow for signature verification for agents that have access to the internet.
This PR also does not revert the collector if it is unhealthy. This will need to be done in a follow up PR. I think we should do it after #34907 is merged, as I imagine the logic will overlap here.
Link to tracking Issue: Closes #34734, Partially solves #33947
Testing: