adding archive entry paths #3638

joeleonjr · 2024-11-20T20:56:52Z

Description:

Related to and inspired by #1551. This PR adds a new ExtraData field named Archive Entry Path that contains the relative path of the file containing a secret within an archive file.

Note: When decompressing files that only contain one file (ex: .gz, .xz, .lz4, etc), it won't list the decompressed file's name due to a limitation in the archiver library we use. However, since it only results in one decompressed file, it should be self-explanatory to the user.

For example:

/archive.tar
-- secret.zip.gz
--/--/secret.zip
--/--/--/secret.txt

If there is a file in secret.txt, we'll see:

Archive Entry Path: secret.zip.gz/secret.txt
File: /path/to/archive.tar

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

mcastorina

This looks pretty clean, nice job!

pkg/handlers/archive.go

pkg/sources/sources.go

pkg/engine/engine.go

pkg/handlers/archive.go

rosecodym · 2024-11-22T15:32:36Z

pkg/handlers/archive.go

@@ -101,37 +102,37 @@ func (h *archiveHandler) HandleFile(ctx logContext.Context, input fileReader) ch
 var ErrMaxDepthReached = errors.New("max archive depth reached")

 // openArchive recursively extracts content from an archive up to a maximum depth, handling nested archives if necessary.
-// It takes a reader from which it attempts to identify and process the archive format. Depending on the archive type,
-// it either decompresses or extracts the contents directly, sending data to the provided channel.
+// It takes a string representing the path to the archive and a reader from which it attempts to identify and process the archive format.


This comment reads "a string" but the actual function takes a string slice. Which is correct?

I'll fix that. That comment was from a previous version.

rosecodym · 2024-11-22T15:33:07Z

pkg/handlers/archive.go

 	reader fileReader,
 	dataOrErrChan chan DataOrErr,
 ) error {
-	ctx.Logger().V(4).Info("Starting archive processing", "depth", depth)
-	defer ctx.Logger().V(4).Info("Finished archive processing", "depth", depth)
+	ctx.Logger().V(4).Info("Starting archive processing", "depth", len(archiveEntryPaths))


imo just attach the paths here now that you have them

Do you mean change depth to paths in the logs?

rosecodym · 2024-11-22T15:35:29Z

pkg/handlers/archive.go

-		return h.openArchive(ctx, depth+1, rdr, dataOrErrChan)
+		// Note: We're limited in our ability to add file names to the archiveEntryPath here, as the decompressor doesn't have access to a fileName value.
+		// We add a empty string so we can keep track of the archive depth.
+		return h.openArchive(ctx, append(archiveEntryPaths, ""), rdr, dataOrErrChan)


If we ever end up stringifying the slice of path parts, an empty string isn't going to maximally visually obvious as another level of depth, compared to like a ? or something. (Compare: some/path///file.txt to e.g. some/path/?/?/file.txt) What do you think of using a non-empty marker like ? instead?

Commit 36bdef0 fixes an oversight from an earlier version and is relevant to this discussion.

The main change is using file.NameInArchive instead of file.Name(). The difference is file.NameInArchive contains the relative path inside the archive that is being extracted. This does a few things:

Previously, I mistakenly treated all files during an extraction operation as if they were in a flat directory and didn't preserve the actual directory structure. I didn't realize the file.Name() operation just grabbed the filename, not the path. That's fixed and we now have complete relative file paths in all supported archives.

This change makes it abundantly clear how a user would navigate an archive to get to the actual file. For example: if you have an archive file containing another archive named archive.tar.gz with a file named secret.txt, it would return this in the archive entry path archive.tar.gz/archive/secret.txt. When you manually double-click to unarchive archive.tar.gz it tosses all of the contents into a new folder named archive and then you'd see secret.txt. It's super clean and clear. This method still uses the "" to track depth during decompression, but in the filepath.Join() operation, empty strings are ignored, which is exactly what we want to reconstruct the relative file path. If we change to add ? or anything else, we'd need to write a custom function to strip those out during the filepath.Join(). Since we're able to keep accurate track of depth and relative file path, I'd suggest just leaving as is.

One other note: I only updated the error logs to include file.NameInArchive b/c I didn't want to bloat our non-error logs with the entire file path. This means the log at the very beginning of the extractorHandler function only has file.Name(). Not sure if that was the correct call or not.

…ity/trufflehog into archive-entry-paths

initial pass at adding archive entry paths

8c4c83b

mcastorina reviewed Nov 20, 2024

View reviewed changes

pkg/handlers/archive.go Outdated Show resolved Hide resolved

pkg/sources/sources.go Show resolved Hide resolved

pkg/engine/engine.go Show resolved Hide resolved

pkg/handlers/archive.go Outdated Show resolved Hide resolved

joeleonjr commented Nov 20, 2024

View reviewed changes

pkg/handlers/archive.go Outdated Show resolved Hide resolved

joeleonjr commented Nov 20, 2024

View reviewed changes

pkg/handlers/archive.go Outdated Show resolved Hide resolved

joeleonjr and others added 7 commits November 20, 2024 17:08

updated to combine archive paths + depth

fc528d2

Merge branch 'main' into archive-entry-paths

6cfae75

Merge branch 'main' into archive-entry-paths

05fcfe6

updated decompressor file naming

6c1feea

Merge branch 'main' into archive-entry-paths

6ec0688

updated test cases

704937a

Merge branch 'main' into archive-entry-paths

48d9323

joeleonjr marked this pull request as ready for review November 21, 2024 18:39

joeleonjr requested review from a team as code owners November 21, 2024 18:39

Merge branch 'main' into archive-entry-paths

8ab76fa

rosecodym reviewed Nov 22, 2024

View reviewed changes

joeleonjr and others added 4 commits November 25, 2024 14:25

Merge branch 'main' into archive-entry-paths

4283b1d

Merge branch 'main' into archive-entry-paths

ceefd1e

swapped file.Name() for file.NameInArchive

36bdef0

Merge branch 'archive-entry-paths' of https://github.com/trufflesecur…

a37a6dd

…ity/trufflehog into archive-entry-paths

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding archive entry paths #3638

adding archive entry paths #3638

joeleonjr commented Nov 20, 2024 •

edited

Loading

mcastorina left a comment

rosecodym Nov 22, 2024

joeleonjr Nov 25, 2024

rosecodym Nov 22, 2024

joeleonjr Nov 25, 2024

rosecodym Nov 22, 2024

joeleonjr Nov 25, 2024 •

edited

Loading

adding archive entry paths #3638

Are you sure you want to change the base?

adding archive entry paths #3638

Conversation

joeleonjr commented Nov 20, 2024 • edited Loading

Description:

Checklist:

mcastorina left a comment

Choose a reason for hiding this comment

rosecodym Nov 22, 2024

Choose a reason for hiding this comment

joeleonjr Nov 25, 2024

Choose a reason for hiding this comment

rosecodym Nov 22, 2024

Choose a reason for hiding this comment

joeleonjr Nov 25, 2024

Choose a reason for hiding this comment

rosecodym Nov 22, 2024

Choose a reason for hiding this comment

joeleonjr Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

joeleonjr commented Nov 20, 2024 •

edited

Loading

joeleonjr Nov 25, 2024 •

edited

Loading