Add support for parent/child TypeMatchers #86

ferdnyc · 2020-07-04T04:07:40Z

Going a different route from #73, this PR modifies the matchers interface to account for parent/child relationships between types. (I initially called the children subtypes, as you can see from the branch name, but since that term's already used in the MIME code it would've been confusing as heck.)

Modifications to existing code

`matchers/matchers.go`

Two new interfaces are added, Matcher and Typer
- The Matcher interface requires a method, Match([]bytes), which returns true if the buffer matches the matching function
- The Typer interface adds Type([]bytes), which returns the associated types.Type` if the buffer matches the matching function
The matching function type is renamed to ByteMatcher
- ByteMatcher implements Matcher
TypeMatcher is now a struct, containing a types.Type and a ByteMatcher
TypeMatcher implements both Matcher and Typer
- TypeMatcher.Match() checks whether the buffer matches the type or any of its child types
- TypeMatcher.Type() checks the buffer against any child types, then the parent type, and returns the type on successful match
ChildMatchers is a map[types.Type][]TypeMatcher which registers the parent-child relationship between a Type and its child TypeMatchers (not directly between TypeMatchers)

`match.go`

The matching code is updated to use the Match() and Type() methods, instead of running ByteMatchers directly.

`filetype.go`

The Is() and IsType() functions now use Match() and Type(), instead of running ByteMatchers directly. (Actually, Is now just calls IsType, which checks the type's TypeMatcher.Match().)
- Is() and IsType() will return true for both a buffer's specific type, and its parent type. (e.g. IsType(types.TypeZip) and Is("zip") are true for a buffer containing .docx bytes.)

New source files

The new file matchers/children.go adds storage for the the parent-child mappings:

TypeMatcher is expanded with an AddChild(types.Type, ByteMatcher) method, to add a child of the matcher's Type
A non-method AddChildType(parent, child types.Type, ByteMatcher) is also added, to create children for types that don't yet have a TypeMatcher. (This was necessary to keep the matchers.Map{} functionality working, since ChildMatchers will end up getting populated before Matchers.)

Updates to existing data structures

Epub, Docx, Xlsx, and Pptx are all made children of TypeZip

Testing

A new unit test verifies ChildMatcher functionality

Notes/Caveats

In theory this could be extended to support a multi-level hierarchy, with child types having children of their own. However, as that wasn't necessary to support current needs, only one level is implemented. The Type() method of TypeMatcher doesn't use the Type() method of its children to look up their matching types, it queries the child's ByteMatcher directly. Recursive ChildMatcher lookups would be necessary to support children of children.

There's no way to support parent-child relationships in an arbitrary MatchMap() call. Because it does run the ByteMatchers it's passed directly, it will ignore ChildMatchers and can return incorrect results if it's passed a matchers.Map containing overlapping matchers.

Fixes #38, fixes #40

h2non · 2020-07-04T11:50:56Z

Thanks for investing the time on this.

Overall, LGTM, but just a small comment:

The hierarchy solves part of the #73 issue's problem about deterministic matching, but what about the other matchers? Should all they use hierarchical matchers according to the proposed implementation?

I might understand that the purpose of this PR is not strictly to solve this issue but makes sense to me to solve it consistently before releasing a new major version after the merge since there are public interface breaking changes here.

Fundamentally, I would change the map structure that now holds the matchers and turn into an ordered list implicitly defining its priority by declaration order. I see some other options here but I would pick that one for the sake of simplicity.

Would you be interested in pursuing that goal here or do you prefer to merge things and iterate separately in that direction?

ferdnyc · 2020-07-04T18:37:30Z

I thought about priority some.

What I realized was, even with priority you still want hierarchy. Otherwise, you're forced into priority inversion in order to establish the hierarchy.

IOW, say I have a buffer full of zip-signature data. In order for priority to correctly identify it no matter the type, I have to test every signature that represents a zip file, in backwards order from the least likely to the most, before I can declare it a zip file.

If it happens to not be an epub, or an Office doc, then it'll fail all of those tests and match the zip signature, at which point I can call it a zip file.

With hierarchy, for the most part the same thing happens. To declare it a zip file, I have to test it against every one of its child matchers. But there are two key differences:

I know that it's some kind of zip file as soon as I test it against the zip matcher, which doesn't have to wait until after I test all of the higher-priority matchers.
If it's NOT zip data at all, then I don't have to run any of the zip child matchers against it. I can skip the whole hierarchy as soon as it fails the zip matcher.

Ideally, the more-specific matchers wouldn't be in MatcherKeys at all -- they can still be in the Matchers map, so they can be used by IsType() to look up specific matchers, but they wouldn't be on the list of matchers to try initially against an unknown buffer. The tests for epub, docx, etc. would run iff TypeZip.Match() first returns true. (That's also good for performance, since the Zip matcher is the fastest & quickest of the set. The Epub matcher and especially the Office type matchers, they're all significantly more complex. So any opportunity to avoid running them is a win.)

Priority/ordering isn't necessarily a bad idea. But in terms of efficiency, I'd think the most benefit would come from prioritizing the most likely/common types. But to do that without then misidentifying more specific types that share the same signature, their parent/child relationship -- which in a sense is the inverse of priority -- still needs to be tracked as well.

As far as working on it, I can't promise any timeframe, but I'm not abandoning this PR. If it needs further development, I'm happy to discuss changes. A priority implementation, specifically, I think is probably a separate enough thing that even if I were to work on it, I'd do it as a separate PR. I don't see that as something that would invalidate this PR, or vice versa.

ferdnyc added 3 commits July 3, 2020 23:11

matchers: Support parent/child TypeMatchers

106be8e

Add tests for ChildMatchers

e5ee92b

Make Epub, Docx, Xlsx, Pptx children of Zip type

b3fd4b7

ferdnyc mentioned this pull request Jul 4, 2020

More free sample files #87

Open

Run 'go fmt' on matchers/children.go

834e23b

ferdnyc force-pushed the subtypes branch from 57a8acf to 834e23b Compare July 4, 2020 04:52

h2non self-assigned this Jul 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parent/child TypeMatchers #86

Add support for parent/child TypeMatchers #86

ferdnyc commented Jul 4, 2020 •

edited

Loading

h2non commented Jul 4, 2020

ferdnyc commented Jul 4, 2020

Add support for parent/child TypeMatchers #86

Are you sure you want to change the base?

Add support for parent/child TypeMatchers #86

Conversation

ferdnyc commented Jul 4, 2020 • edited Loading

Modifications to existing code

matchers/matchers.go

match.go

filetype.go

New source files

Updates to existing data structures

Testing

Notes/Caveats

h2non commented Jul 4, 2020

ferdnyc commented Jul 4, 2020

ferdnyc commented Jul 4, 2020 •

edited

Loading

`matchers/matchers.go`

`match.go`

`filetype.go`