Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binary vs. textual file enumeration. #2739

Merged
merged 13 commits into from
Dec 1, 2023

Conversation

michaelcfanning
Copy link
Member

@michaelcfanning michaelcfanning commented Nov 30, 2023

  • BRK: EnumeratedArtifact now sniffs artifacts to distinguish between textual and binary data. The Contents property will be null for binary files (use Bytes instead).
  • BRK: EnumeratedArtifact raises InvalidOperationException for non-seekable streams (unless the SupportNonSeekableStreams property is set). Non-seekable streams result in performance inefficiencies when expanding.
  • BRK: MultithreadedZipArchiveArtifactProvider now distinguishes binary vs. textual data using a hard-coded binary files extensions list. This data will be made configurable in a future change.
  • NEW: EnumeratedArtifact now automatically detects and populates a Bytes property for binary files such as executables and certificates.

Previously, artifact enumeration assumed all data consisted of text. This breaks attempts to load/examine binary data.

In EnumeratedArtifact, we now examine the beginning of all seekable streams to detect whether we are seeing textual or binary data. We do so using an OSS snippet that looks for null/control characters in that data. This code also attempts to retrieve a range of text file encodings but this capability is not verified. Binary files are retrieved as byte arrays. Text files go to the existing Contents property.

If we have a non-seekable stream, what we have to do is retrieve the file as a byte array and then subsequently convert it to text if we determine that it is textual. This is a performance degradation from our previous implementation and so we require users to explicitly set a SupportNonSeekableStreams property to ensure they know what they are opting into.

Separately, the MultithreadedZipArchiveArtifactProvider does enumerate non-seekable streams, since they comprise compressed data. This provider uses a specialized ZipArchiveArtifact kind. Rather than accepting the costs to double-transform a text stream to bytes and then a decoded string, we simply hard-code a set of binary file extensions and only retrieve these as bytes. This property is settable. This isn't a really ideal solution but works for our immediate scenarios. You could imagine updating this class to optionally allow for the non-performant behavior (expand as bytes, sniff to classify as binary or text, convert text to strings). This would allow for higher fidelity classification without requiring advance knowledge of file extensions.

Code coverage information:

94.38% : EnumeratedArtifact
97.22% : MultithreadedZipArtifactProvider
93.59% : ZipArchiveArtifact

@@ -25,108 +24,4 @@ public ArtifactProvider(IEnumerable<IEnumeratedArtifact> artifacts)

public IFileSystem FileSystem { get; set; }
}

public class SinglethreadedZipArchiveArtifactProvider : ArtifactProvider
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SinglethreadedZipArchiveArtifactProvider

Notoriously, this single file has included multiple class definitions, I finally took a few minutes and extracted all of them.

public string Contents
{
get => GetContents();
get => GetArtifactData().text;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetArtifactData

This is the core change. On any attempt to access a property that returns artifact contents, we attempt to fault in the data. This may entail reading the contents from disk or a stream. We now sniff the data to determine whether it is binary or textual and populate Contents or Bytes accordingly.

/// <param name="rawData">The raw data.</param>
/// <param name="start">The start.</param>
/// <returns></returns>
public static bool CheckForByteOrderMark(byte[] rawData, out Encoding encoding, int start = 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckForByteOrderMark

We do not have explicit testing/use of BOM detection. I have a stubbed test but haven't nailed down how to generate useful test cases. I'm leaving this (and a stub test) for the future.

In practice, detecting/handling esoteric encodings is not a prominent scenario for most analysis scenarios.

@@ -517,6 +518,7 @@ protected override string ConstructTestOutputFromInputResource(string inputResou
mockFileSystem.Setup(x => x.DirectoryGetDirectories(It.IsAny<string>())).Returns(Array.Empty<string>());
mockFileSystem.Setup(x => x.DirectoryEnumerateFiles(inputLogDirectory, inputLogFileName, SearchOption.TopDirectoryOnly)).Returns(new string[] { inputLogFilePath });
mockFileSystem.Setup(x => x.FileReadAllText(inputLogFilePath)).Returns(v2LogText);
mockFileSystem.Setup(x => x.FileOpenRead(inputLogFilePath)).Returns(new MemoryStream(Encoding.UTF8.GetBytes(v2LogText)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileOpenRead

An interesting detail is that the new artifact read logic now never calls File.ReadAllText (which won't work for binary data) but FileOpenRead exclusively. We still mock ReadAllText to allow upstream tests that aren't going through new code paths.

@michaelcfanning michaelcfanning marked this pull request as ready for review November 30, 2023 17:12

GetArtifactData();
this.isBinary = this.contents == null;
return this.isBinary.Value;
Copy link
Collaborator

@LingZhou-gh LingZhou-gh Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining the last two statements, will the return value be null? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! and just delete the backing field.

}
}

internal bool IsArtifactBinary()
Copy link
Contributor

@scottoneil-ms scottoneil-ms Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest: make private to narrow API surface to IsBinary and focus internal use/experience on the more public version? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'internal' is still private, i tend to do this to make things testable but not public. As it happens I can now delete this helper entirely.

It now occurs to me: this new property value is untested!! Need to add it to the test validations. So I'm glad you pointed to it. :)

throw new InvalidOperationException("Stream is not seekable. Provide a seekable stream or set the 'SupportNonSeekableStreams' property.");
}

this.bytes = new byte[Stream.Length];
Copy link
Contributor

@scottoneil-ms scottoneil-ms Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it the case that we can categorize a Seekable stream in 4096 bytes, but we need to read in the full stream for a Nonseekable one? If 4096 is sound, I suspect we'd both love to take a one-block peek in the Nonseekable case, and use that as the starting buffer for a full Stream or byte[] read depending on what is appropriate. So I probably just don't see a fatal flaw yet.

Where is 4096 coming from in the Seekable case? Looking at the implementation of FileEncoding, I'd optimistically credit that we probably don't get much more accuracy in the classifier after this many bytes if that's coming from some tribal knowledge or guidance stemming from other users of this code. But right now that's just a brittle prior without any experience on with so many of these file extensions, so I'm really curious what the author guidance or community discourse looks like. #Pending

Copy link
Contributor

@scottoneil-ms scottoneil-ms Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's maybe a way to put a finger on what I'm getting at:

The OSS code snippet we're taking asserts that if you see no more than 10% of the bytes as being non-textual, then that's a good-enough heuristic. I buy that. Not second-guessing it. But that's the kind of suspiciously-round number someone just yanked out of their brain one day and it just worked well enough in the real world that it stuck and became an institution.

Maybe 4096 is the same way and we don't need to do full file scans to classify in any scenario? This is what I'm wondering/hoping. (Not asserting.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so first, 4096 is way too much data and I dropped it to 1024. This is probably still more data than we need.

Let's talk offline further about this (in particular about whether we can address the inefficiency in the non-seekable case of reading the whole thing). I'm just not aware of how we can handle a string decoding from the original non-seekable string if we've already obtained 1024 bytes from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. We can't. I was basically proposing a copy with extra steps. (And apologies if you're getting this twice. I tried the e-mail reply feature but don't see it in review.)

For a large file, it might be better to close and re-open the stream after reading 1k. That would work. But I was wrong that we can do this in about the same speed in both cases, so the API structure is fundamentally sound.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually: if the zip archive is always already in memory, maybe closing/reopening is trivially cheap?

this.Stream.Seek(0, SeekOrigin.Begin);

byte[] header = new byte[4096];
int length = this.Stream.Read(header, 0, header.Length);
Copy link
Contributor

@scottoneil-ms scottoneil-ms Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fully cognizant that this is almost certainly out of scope for what you've set out to accomplish but I'm duty-bound to wonder aloud when I see more synchronous code being crafted: do you see a path to the Task Programming Model? #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing! It is a SMOP. Just a 'small matter of programming'. :)

{
var artifacts = new List<IEnumeratedArtifact>();

foreach (ZipArchiveEntry entry in zipArchive.Entries)
Copy link
Contributor

@scottoneil-ms scottoneil-ms Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this indirectly recurse? Do we want it to? (I imagine we do.)

(This might become obvious to me later, but tossing it out there in case it doesn't and I forget.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this should not. This is the zip archive's own enumerator, which handles traversing all zip contents. As far as I'm aware, the zip archive format does not allow features (such as a kind of symbolic link) that would result in any cycles/recursion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unclear and imprecise. The zip archive doesn't know anything about the contents of its individual entries. It's counterproductive to zip up a file that is already zipped, but you can. And it probably happens in some cases where people are automating bulk archiving where other processes and/or user behavior may have caused some of the contents to be zipped up.

To get super concrete: if we're ever working through the entrys in "BulkArchive.zip", and find an entry in it such that entry.Name == "doublezipped.zip" --will we crack open doublezipped.zip?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks. No, we do not recurse into other zip archives. In theory, we could instantiate a new provider in this case. In theory, we could try to expand other arbitrary archives as well. We've discussed this as an interesting possibility for the tool (you and I and others). But it's not realized today. I tend to think we'd want a different design if we had this capability beyond a simple iteration paradigm. Maybe a queue or more of a subscription model where something could observe names/file types and just grab what was interesting to them.

{
get
{
lock (this.archive)
Copy link
Contributor

@scottoneil-ms scottoneil-ms Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe strengthen error checking in the constructor to preclude the null condition and elide the lock?

Edit: maybe the lock is still necessary. (That's not clear to me.) The nullchecking could be potentially elided if the constructor enforces non-null values. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus: if we do need locking, it would be nice to be locking the entry as opposed to the archive. (Assuming you don't need a whole archive locked to safely open an entry and such.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to lock on the archive! Too bad. We tested the other case. There must be (makes sense) state maintained at the archive level about decompression/other entry state. I'm not sure why we permit a null entry. Will try to rip that out.

string extension = Path.GetExtension(Uri.ToString());
if (this.binaryExtensions.Contains(extension))
{
this.bytes = new byte[Stream.Length];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all our underlying Streams support .Length?

It is more defensive to promote these .bytes fields to MemoryStreams and use Stream.Copy. If I'm reading the docs right: wherever the underlying Stream supports .CanSeek, the MemoryStream can be precisely presized by safely querying .Length. Otherwise, .Length can throw for some implementations (but not others.)

The tradeoff of doing this is that if we ever needed a precisely-sized byte[] for an artifact later, we would potentially need to take a copy. (But not in cases where we were able to size it precisely.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZipArchive supports providing a decompressed length. What's the performance impact of creating a memory stream as you describe? Will it effectively use the wrapped stream as its buffer? And so the length computation is computable before the memory stream did it? :)

This sounds like a nice next clean-up. It also makes me wonder if I've been over-concerned about handling non-seekable streams. This case may be entirely unremarkable based on what you've pointed out: we should simply read the original stream into a byte array, sniff it and then, if we need to decode, we wrap it in a memory stream (which will use the byte array as its buffer and not create a new copy).

Copy link
Contributor

@scottoneil-ms scottoneil-ms Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know we have .Length always, then it doesn't matter if we store these things as byte[]s or MemoryStreams. We can go back and forth fluidly. (ie: new MemoryStream(this.bytes)). Which means we can get a decoded string easily, as you say. But that's a second copy. (Copy 1: zip stream into memory stream or byte[]; copy 2: into string). You've convinced me this is unavoidable for Nonseekables, at least without closing/reopening the stream.

For Seekables, it's worth avoiding. We can just do one copy. (Copy 1: zip stream into string.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. But for this specific location, ZipArchive, these streams are never seekable. Do you have a have suggestion or do you agree this code can ship as is? :)


if (Stream == null && this.contents == null)
if (Stream == null && this.contents == null && this.bytes == null)
Copy link
Collaborator

@yongyan-gh yongyan-gh Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream

nit: have both Stream and this.Stream. could be consistent by using 1 style this.Stream #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this consistent for this file. We should discuss this style point, whether properties have a this. prefix in code is very inconsistent across the code base.

@@ -5,6 +5,7 @@
using System.Collections.Generic;
using System.IO;
using System.Net.Mime;
using System.Text;

using FluentAssertions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esp. since there are some new files here and you're a fellow appreciator: have you measured code coverage? If so: where did it land?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thanks for asking, I meant to provide that data already. Those new files are mostly refactored classes. Here's the coverage data for the classes that have functional changes. I took a little time to fill out testing for ZipArchiveArtifact.

94.38% : EnumeratedArtifact
97.22% : MultithreadedZipArtifactProvider
93.59% : ZipArchiveArtifact

{
if (rawData[i - 1] == 0 && rawData[i] == 0)
{
if (++nullSequences > 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++nullSequences > 1 requires finding of 2 '\0\0' sequences.
Only 1 should be enough? >= 1 or > 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem, I believe, is that some BOM sequences has a '\0' character (but no more than 1). So your heuristic might make use believe that a text file with a BOM is binary.

throw new InvalidOperationException("Stream is not seekable. Provide a seekable stream or set the 'SupportNonSeekableStreams' property.");
}

this.bytes = new byte[Stream.Length];
Copy link
Collaborator

@yongyan-gh yongyan-gh Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes

should we consider adding lock to the bytes array in case a thread tries to read dirty data? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, our driver framework assumes enumerated artifacts aren't shared across threads. Madness would ensure. A single thread gets a single scan target and all rules run against that target on that thread.

I do see that some other scenarios would find value in hardening for this. But I wouldn't do it prematurely or do it without testing perf impact for our core scenario.

@@ -737,7 +737,7 @@ private async Task ScanTargetsAsync(TContext globalContext, IEnumerable<Skimmer<

DriverEventSource.Log.ReadArtifactStart(filePath);
// Reading the length property faults in the file contents.
long sizeInBytes = perFileContext.CurrentTarget.Contents.Length;
long sizeInBytes = perFileContext.CurrentTarget.SizeInBytes.Value;
Copy link
Collaborator

@yongyan-gh yongyan-gh Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SizeInBytes.Value

...SizeInBytes?.Value ?? 0

SizeInBytes may be null in some weird condition. #WontFix

Copy link
Member Author

@michaelcfanning michaelcfanning Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we should not provoke those conditions. :) There is a unit test for this. What actually happens is that we will raise InvalidOperationException for the weird cases, so the nullness operator doesn't help.

In our code, however, we always provide a URI and contents and so there shouldn't be a problem. You do not want to code defensively in these cases because if you do so you'll mask an issue that you'd assumed, by design, is protected against.

That can lead to very subtle bugs. Better is to have the code as it is. We think our objects have integrity, if not, we will fail fast with an exception.

}
}

private void ValidateBinaryArtifact(EnumeratedArtifact artifact, int sizeInBytes)
Copy link
Contributor

@scottoneil-ms scottoneil-ms Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inferring standards from practice: are we okay with sequences of assertions wherever the [Fact] is validating a single input/case, as opposed to a multiplicity of them?

I'm hypothesizing that if this method were ValidateBinaryArtifacts (plural) then it would be using a StringBuilder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! For a true, more atomic unit test, it's fine to break early on the first condition that fails. There's two interesting things to consider, one is what failure information is more useful to the debugger. The second is to ensure that the literal output of the test is illustrative.

Some teams have stringent standards around emitting textual messages with all failures. Seeing 'Expected '17' but observed '13'' isn't always helpful.


string guid = Guid.NewGuid().ToString();

#pragma warning disable SYSLIB0001 // Type or member is obsolete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧐
Which type or member is obsolete and why can't we use the shiny new stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thanks for catching this. So UTF7 is marked as obsolete in .NET. I previously was playing around with this encoding to test the behavior of our encoding sniffing. This turned into a rabbit hole so I pulled back (but left the #pragma in place).

@scottoneil-ms
Copy link
Contributor

scottoneil-ms commented Dec 1, 2023 via email

@scottoneil-ms
Copy link
Contributor

scottoneil-ms commented Dec 1, 2023 via email

Copy link
Contributor

@scottoneil-ms scottoneil-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last lingering question on my mind is if it might be fine to close/reopen the stream when we cannot seek to origin. (Apologies if I'm missing a response on this one.) I think that depends on what's underneath the stream. If it's in memory already, it should be cheap, and that could allow the code to not care about seekability at the API level.

If close/reopen means you go to network a second time, then you'd need to know it's large for that to be worth it.

Since the proposed API design solicits a seekability behavior preference, there's an implication for the API surface here, and I don't know what our commitments are, to whom, or when they start applying in terms of those becoming intractable to change. I think my parting suggestion for consideration is that it might be better to remove us caring about seekability in our API surface, handle the details internally, and later add it if we definitely need it --than to go in the other direction. But as noted: I don't know the tradeoffs there, so I'll leave the question to you.

This is awesome and I'm super stoked to see where your coverage numbers landed.

@michaelcfanning michaelcfanning merged commit b0952b0 into main Dec 1, 2023
7 of 8 checks passed
@michaelcfanning michaelcfanning deleted the enumerated-files-binaries branch December 1, 2023 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet