Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cmd/opampsupervisor]: Implement PackagesAvailable for upgrading agent #35503

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

BinaryFissionGames
Copy link
Contributor

Description:
Implements ReportPackageStatuses and taking PackagesAvailable for upgrading the agent.

The agent will only accept a top-level package with an empty name. This agent must be signed using cosign's keyless signing method (this is how the opentelemetry-collector-releases repository signs its releases).

The signature field must be populated with the resultant base64 encoded cert and signature, both being space separated (signature = b64_cert + " " + b64_signature).

This first implementation only allows online verification; In order to verify the certificate, it must reach out to the public rekor transparency log instance. It also fetches certificates from the internet. Some of these things are configurable through environment variables, but I figure we can parse out how offline signature verification works in a follow-up PR. This basic setup should allow for signature verification for agents that have access to the internet.

This PR also does not revert the collector if it is unhealthy. This will need to be done in a follow up PR. I think we should do it after #34907 is merged, as I imagine the logic will overlap here.

Link to tracking Issue: Closes #34734, Partially solves #33947

Testing:

  • Unit testing, e2e testing

@BinaryFissionGames BinaryFissionGames marked this pull request as ready for review October 1, 2024 08:07
@BinaryFissionGames BinaryFissionGames requested a review from a team as a code owner October 1, 2024 08:07
Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on, its no small task. I'm still looking through this, but wanted to give early feedback on a few points.

require.Equal(t, &protobufs.PackageStatuses{
Packages: map[string]*protobufs.PackageStatus{
"": {
// TODO: Should initital version be filled in?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be the version/hash of the Collector binary, right? I think the version can be the one obtained during bootstrapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that makes sense. I think the mental hurdle I'm having here is the sort of disconnect between the artifact the agent needs to end up with and what is available from the releases.

Because the release artifact is a tarball, but the agent executable is not, it means that the artifact offered by the server and the one the agent ends up with have completely different hashes. This initial status is kind of a weird edge case.

return nil
}

// TODO: Certificate paths? The certificate can be specified via SIGSTORE_ROOT_FILE for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we link to an issue here? It would make it easier to track and provides some context.

agentHash := h.Sum(nil)

// Load persisted package state, if it exists
// TODO: use packagesStatusPath method somehow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay given that it's the only other usage of this and is a very simple call to filepath.Join.

The other options I can think of aren't worth the additional complexity:

  1. Extract the packagesStatusPath functionality to a function and call the function in the method as well. Too much abstraction for a simple call.
  2. Instantiate packageManager above and call pathagesStatusPath here, then fill in other details on the struct afterward. Splits the initialization too much for fairly minimal gains.

return err
}

type packageState struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to instead use the persistent state file to store this? I see how the purpose of each file is distinct, but having two small-ish YAML files also feels a little odd.

This may also be premature, but I think abstracting state storage through the persistent state struct may keep things cleaner in the long run.

@@ -275,20 +297,20 @@ func (s *Supervisor) loadConfig(configFile string) error {
// an OpAMP extension, obtains the agent description, then
// shuts down the Collector. This only needs to happen
// once per Collector binary.
func (s *Supervisor) getBootstrapInfo() (err error) {
func (s *Supervisor) getBootstrapInfo() (agentVersion string, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is another place where consolidating our storage to the persistent state may make things cleaner. We have two different data flows for the instance ID vs. the agent version in this function, when I think we should probably store the two next two each other so all of the agent's information is recorded in the same place. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify on what's being proposed here? I don't think we should store the agent version anywhere, we should always treat the reported version from the agent as the source of truth (as best we can, anyways).

}

// overwrite agent process
startAgent, err := p.am.stopAgentProcess(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would substantially complicate things if we limited agent process management to the Supervisor struct and only deal with managing the binary file here? My first impression is that we should keep those responsibilities separated if possible and not have the package manager know anything about the Collector process, even if the current implementation isn't too complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that we need to stop the agent process to write the binary in the first place.

We could write the binary somewhere temporarily, and send a message to the agent goroutine to do the replacement, which could be cleaner (better separation of concerns). The package manager would still need to have some handle to signal that, but it wouldn't necessarily be tied to the collector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be able to accomplish this with synchronous, one-way communication? I was thinking maybe we could do something vaguely like this in the Supervisor struct:

oldVersion := 0
newVersion := 1

s.commander.stopAgent()
s.packageManager.SwapInVersion(newVersion)

if err := s.commander.startAgent(); err != nil {
	s.packageManager.SwapInVersion(oldVersion)
	s.commander.startAgent()
	s.reportFailedInstallation()
	return
}

s.packageManager.DeleteVersion(oldVersion)
s.reportSuccessfulInstallation()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we could do something like that. I think this also sets us up nicely for restoring the previous version if the new one fails. I'll look into making those changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so I'm looking a bit deeper and the reason I'm a little iffy on this is because it seems like there will be some odd consequences with doing this outside of the package syncer.

Essentially, the package syncer will run it's logic, then it will report the status off every package at the end. So it's possible that the package downloads and verifies fine, so the package syncer reports the package as installed - but then there is an error when swapping in the version, which causes an error.

I would prefer if we didn't report the package is installed before it's actually successfully installed, basically. Which means we'd have to do it in the package manager during the UpdateContent call, as far as I can tell. This would require 2 way communication, one way from (packagemanager) -> (agent goroutine) to do the swap, then one from (agent goroutine) -> (package manager) to indicate success/failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PackageSyncer's design and API makes this difficult we should feel free to redesign it. Please don't feel constrained by it. This was the most basic design I could come up with when implementing the example and there may very well be a better design. Feel free to suggest one.

if err != nil {
s.logger.Error("Failed to sync PackagesAvailable message", zap.Error(err))
}
// TODO: Should we wait for the sync to be done somehow? Should it be in a separate goroutine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I'm seeing, the Done channel only communicates that the PackageSyncer is done communicating package statuses to the server, and doesn't track whether the packages have been downloaded: https://github.com/open-telemetry/opamp-go/blob/main/client/internal/packagessyncer.go#L51. It may be worth waiting for this before communicating any additional updates about package statuses to the server.

Copy link
Contributor Author

@BinaryFissionGames BinaryFissionGames Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually really confused reading that code. It looks like once Sync is done, that channel is closed, so waiting on Done here wouldn't actually add any extra synchronization.

Good call at looking at that impl, I definitely had a different idea of its purpose in my head.

@evan-bradley
Copy link
Contributor

@tigrannajaryan I'd appreciate you taking a look at this, even if just to review conformance to the spec.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing but got confused by the many moving parts of signatures and verifications.

I think we need a design doc or some other form of documentation that explains all the involved parties (cosign, Collector releases, OpAMP Server, Supervisor) and how they dance together to make sure the downloaded binary is what it pretends to be and is safe to use.

Some sort of a sequence diagram that starts with the Collector build on opentelemetry-collector-releases and ends with Supervisor launching the new executable and shows all the steps in between would be great. If we number the steps we could even refer to them in the code.

}

// TODO: Certificate paths? The certificate can be specified via SIGSTORE_ROOT_FILE for now
type AgentSignature struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document AgentSignature and AgentSignatureIdentity.

return splitSignature[0], splitSignature[1], nil
}

// sig is the decoded signature of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete comment?

}

func parsePackageSignature(signature []byte) (b64Cert, b64Signature []byte, err error) {
splitSignature := bytes.SplitN(signature, []byte(" "), 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this signature format documented somewhere? This imposes a requirement on the OpAMP server to generate signatures in a specific way, right?

@BinaryFissionGames
Copy link
Contributor Author

I think we need a design doc or some other form of documentation that explains all the involved parties (cosign, Collector releases, OpAMP Server, Supervisor) and how they dance together to make sure the downloaded binary is what it pretends to be and is safe to use.

Yeah, good call out. I will work on this and get something in our spec that explains everything here.

by the OpAMP Server, if the AcceptsPackages capability is enabled.

### Overview
![Supervisor architecture diagram](agent-upgrades-e2e.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent diagram and workflow description! Thank you. This is very heplful.

the URL that the OpAMP server gave it for downloading. After
downloading, it verifies the content hash matches the server
provided content hash from the PackagesAvailable message.
9. The supervisor verifies the certificate via the root certificates downloaded in step 7.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what is the difference between "verifies the certificate via the root certificates" and checks against the Rekor transparency log to verify the certificate hasn't been tampered with". I assume certificate is signed by a root certificate (the CA), so if it was tampered it would not pass the first verification step. How is possible that it passes the first check but fails the second? Are we thinking about a scenario that the root certificate is tampered with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little iffy on this myself. I think the transparency log is related to the time of signing in some way, I'll dig a little more to see if I can understand better.

cmd/opampsupervisor/specification/README.md Outdated Show resolved Hide resolved
* This field may be explicitly set empty to skip verifying the repository field
in the certificate
* A set of [Fulcio] root certificates.
* These certificates are retrieved from "https://tuf-repo-cdn.sigstore.dev"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to explore the possibility of offline certificate verification. We could embed the root certificates in the Supervisor at build time, but I don't know if the verification process itself can be done without connecting to Rekor.
This can be a future work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an option to skip the Rekor/transparency log verification. I couldn't get it to work as I was developing, but I figure it's probably something simple I was missing. We should definitely be able to get it working in some follow-up.

@dpaasman00 dpaasman00 force-pushed the feat/supervisor-update-collector branch 2 times, most recently from c535e23 to f7d67cc Compare October 31, 2024 15:20
@tigrannajaryan
Copy link
Member

@andykellr or @dpaasman00 do you plan to take over this PR?

@dpaasman00
Copy link
Contributor

@tigrannajaryan Yes, I will be taking this PR over!

@dpaasman00 dpaasman00 force-pushed the feat/supervisor-update-collector branch from f7d67cc to 7e25bbb Compare November 8, 2024 15:05
@dpaasman00 dpaasman00 force-pushed the feat/supervisor-update-collector branch from 7e25bbb to 3750f52 Compare November 8, 2024 15:43
@dpaasman00
Copy link
Contributor

@evan-bradley @tigrannajaryan Are you able to take another look when you have a chance? I believe feedback you left for Brandon was addressed, so let me know if there's anything that was missed!

type AgentSignature struct {
// TODO: The Fulcio root certificate can be specified via SIGSTORE_ROOT_FILE for now
// But we should add it as a config option.
// https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/3593
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong link?

// You can read more about Cosign and signing here.
// https://docs.sigstore.dev/cosign/signing/overview/
type AgentSignature struct {
// TODO: The Fulcio root certificate can be specified via SIGSTORE_ROOT_FILE for now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to embed (//go:embed) the root certificate into the executable? How often is the root certificate updated and how long is it valid for? Embedding could simplify things for the user (one less thing to obtain and supply via config) if it is valid for long enough.

Comment on lines +224 to +231
// github_workflow_repository defines the expected repository field
// on the sigstore certificate.
CertGithubWorkflowRepository string `mapstructure:"github_workflow_repository"`

// Identities is a list of valid identities to use when verifying the agent.
// Only one needs to match the identity on the certificate for signature
// verification to pass.
Identities []AgentSignatureIdentity `mapstructure:"identities"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason that these need to be configurable? Is it so that non-official Collector builds can be supported?

// https://github.com/open-telemetry/opamp-spec/blob/main/specification.md#packages
agentPackageKey = ""

lastPackageStatusFileName = "last-reported-package-statuses.proto"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: .proto extension is normally used for Protobuf source code (IDL) files. .binpb is recommended extension for wire format: https://protobuf.dev/programming-guides/techniques/#suffixes

topLevelVersion string

storageDir string
agentPath string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the executable file path or a directory path where executable is contained? Perhaps agentExePath or agentFilePath to make it clear?

}

// Create reader for new agent
gzipReader, err := gzip.NewReader(bytes.NewBuffer(by))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know the content is gzip? Is this specified somewhere?

return fmt.Errorf("read tarball for collector: %w", err)
}

for h.Name != "otelcol-contrib" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know the part name is "otelcol-contrib"? It would nice to document this and the fact that it is gziped tarball.

defer agentFile.Close()

// Copy to backup
_, err = io.Copy(backupFile, agentFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible alternate instead of copying the files is to rename them. Delete backup, rename original to backup, write to the original name. Skips the copying and is likely faster.

}
defer restoreFile.Close()

if _, err := io.Copy(restoreFile, backupFile); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep the backup file or we can rename instead of copying? Renaming likely is less susceptible to out-of-disk-space issues.


return &packageManager{
persistentState: persistentState,
topLevelHash: agentHash,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing that we have both AllPackagesHash and topLevelHash. Are they the same or different? If different it would be nice to explain in a comment somewhere.

@tigrannajaryan
Copy link
Member

I am most concerned about the hash/signature verification parts. We want to make sure we don't have any vulnerabilities here and don't introduce any in the future.

I think it is important that we have good test coverage for the verification code. Would you be able to add some tests? I can't see coverage results for this PR in codecov. Ideally we want to have full branch coverage for UpdateContent and funcs it calls.

@tigrannajaryan
Copy link
Member

@dpaasman00 ping me when you want me to do another round.

Copy link
Contributor

github-actions bot commented Dec 3, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@dpaasman00
Copy link
Contributor

Planning to get back and address review by the end of the next week, December 13.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 18, 2024
@mx-psi mx-psi assigned evan-bradley and unassigned mx-psi Dec 18, 2024
@github-actions github-actions bot removed the Stale label Dec 19, 2024
Copy link
Contributor

github-actions bot commented Jan 4, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement package manager in supervisor
5 participants