Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding archive entry paths #3638

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

adding archive entry paths #3638

wants to merge 13 commits into from

Conversation

joeleonjr
Copy link
Contributor

@joeleonjr joeleonjr commented Nov 20, 2024

Description:

Related to and inspired by #1551. This PR adds a new ExtraData field named Archive Entry Path that contains the relative path of the file containing a secret within an archive file.

Note: When decompressing files that only contain one file (ex: .gz, .xz, .lz4, etc), it won't list the decompressed file's name due to a limitation in the archiver library we use. However, since it only results in one decompressed file, it should be self-explanatory to the user.

For example:

/archive.tar
-- secret.zip.gz
--/--/secret.zip
--/--/--/secret.txt

If there is a file in secret.txt, we'll see:

Archive Entry Path: secret.zip.gz/secret.txt
File: /path/to/archive.tar

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

Copy link
Collaborator

@mcastorina mcastorina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty clean, nice job!

pkg/handlers/archive.go Outdated Show resolved Hide resolved
pkg/sources/sources.go Show resolved Hide resolved
pkg/engine/engine.go Show resolved Hide resolved
pkg/handlers/archive.go Outdated Show resolved Hide resolved
pkg/handlers/archive.go Outdated Show resolved Hide resolved
pkg/handlers/archive.go Outdated Show resolved Hide resolved
@joeleonjr joeleonjr marked this pull request as ready for review November 21, 2024 18:39
@joeleonjr joeleonjr requested review from a team as code owners November 21, 2024 18:39
@@ -101,37 +102,37 @@ func (h *archiveHandler) HandleFile(ctx logContext.Context, input fileReader) ch
var ErrMaxDepthReached = errors.New("max archive depth reached")

// openArchive recursively extracts content from an archive up to a maximum depth, handling nested archives if necessary.
// It takes a reader from which it attempts to identify and process the archive format. Depending on the archive type,
// it either decompresses or extracts the contents directly, sending data to the provided channel.
// It takes a string representing the path to the archive and a reader from which it attempts to identify and process the archive format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment reads "a string" but the actual function takes a string slice. Which is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix that. That comment was from a previous version.

reader fileReader,
dataOrErrChan chan DataOrErr,
) error {
ctx.Logger().V(4).Info("Starting archive processing", "depth", depth)
defer ctx.Logger().V(4).Info("Finished archive processing", "depth", depth)
ctx.Logger().V(4).Info("Starting archive processing", "depth", len(archiveEntryPaths))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo just attach the paths here now that you have them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean change depth to paths in the logs?

return h.openArchive(ctx, depth+1, rdr, dataOrErrChan)
// Note: We're limited in our ability to add file names to the archiveEntryPath here, as the decompressor doesn't have access to a fileName value.
// We add a empty string so we can keep track of the archive depth.
return h.openArchive(ctx, append(archiveEntryPaths, ""), rdr, dataOrErrChan)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever end up stringifying the slice of path parts, an empty string isn't going to maximally visually obvious as another level of depth, compared to like a ? or something. (Compare: some/path///file.txt to e.g. some/path/?/?/file.txt) What do you think of using a non-empty marker like ? instead?

Copy link
Contributor Author

@joeleonjr joeleonjr Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 36bdef0 fixes an oversight from an earlier version and is relevant to this discussion.

The main change is using file.NameInArchive instead of file.Name(). The difference is file.NameInArchive contains the relative path inside the archive that is being extracted. This does a few things:

  1. Previously, I mistakenly treated all files during an extraction operation as if they were in a flat directory and didn't preserve the actual directory structure. I didn't realize the file.Name() operation just grabbed the filename, not the path. That's fixed and we now have complete relative file paths in all supported archives.
  2. This change makes it abundantly clear how a user would navigate an archive to get to the actual file. For example: if you have an archive file containing another archive named archive.tar.gz with a file named secret.txt, it would return this in the archive entry path archive.tar.gz/archive/secret.txt. When you manually double-click to unarchive archive.tar.gz it tosses all of the contents into a new folder named archive and then you'd see secret.txt. It's super clean and clear. This method still uses the "" to track depth during decompression, but in the filepath.Join() operation, empty strings are ignored, which is exactly what we want to reconstruct the relative file path. If we change to add ? or anything else, we'd need to write a custom function to strip those out during the filepath.Join(). Since we're able to keep accurate track of depth and relative file path, I'd suggest just leaving as is.
  3. One other note: I only updated the error logs to include file.NameInArchive b/c I didn't want to bloat our non-error logs with the entire file path. This means the log at the very beginning of the extractorHandler function only has file.Name(). Not sure if that was the correct call or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants