Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue/refactor filters #185

Merged
merged 1 commit into from
Aug 11, 2022
Merged

Conversation

colmsnowplow
Copy link
Collaborator

@colmsnowplow colmsnowplow commented Aug 8, 2022

Once I've had a sense check on this PR, I'll rebase this and squash into a single issue - have left it this way to leave the reasoning as auditable as I can.

Context: In reviewing the release/1.0.0 PR, I had some questions/concerns about how the fitlers worked. It then became apparent that the way we had factored the code made it very difficult to decipher the actual behaviour. This is all stuff that should've come out in the feature PR review, so that much is on me.

Since it was quite difficult to resolve those questions, and create unit tests to illustrate them etc., I went ahead and created this PR to refactor the code and make that process easier. On doing so I found a number of issues with the filters - some flaws in design and some other incorrect behaviours for which we had incorrect unit tests.

@istreeter - as mentioned on slack I've asked for review as a sense check for:
a) That this way of factoring the code is readable/sensible
b) That the various things I've directly called out in comments make sense, and are sensible decisions.

The below is more a description of the steps I took, feel free to peruse it too but it's a lot of info and it's not necessarily relevant to doing the above two things - so feel free not to pay attention. I'm using it as both a way to record what's in this PR for Robert's benefit, and also as basically what I'm putting in the issue description for posterity's sake.

@TiganeteaRobert - I've tried to create this PR in a way that's hopefully useful to review so that you can each of those things. Below is a description of what I did - I think it might be useful for you to take a look through, then we can discuss when we catch up tomorrow if you have any questions.

  • First, I refactored the code structure: 5dcaaa4

The main thing here is that we had two functions which handled multiple distinct pieces of logic (one for each type of filter) - I refactored to make those independent pieces of logic testable on their own.

I also changed the API for unstruct filters here to make the checks on event name and version explicit on configuration.

  • There's also some factoring out of compiling regexes: b2c0d68

This is for two reasons, 1. efficiency, 2. If we are to fail to compile, it should only happen at startup. I know we have a validation check on this so it's not a major concern.

These are two of the bugs that I found originally, which was hard to understand without refactoring. In sum, if we don't retrieve any value, the v1 PR behaviour is to always return false. However if we're matching negatively it should be true when we don't find the value.

Secondly, if we do retrieve a value, but it's nil, since we're using janky fmt.Sprint() to cajole it into string, we actually match against the string <nil> - this is down to pre-existing jankery for which I've created an issue: #186

  • Using Compile() instead of MustCompile(): 61dbba8

We don't ever want to panic, it's always better to handle the error.

This is primarily down to my now thinking the design decision we (I really) landed on in the analytics SDK was wrong. We should fix it there. Issue related to that can be found here: snowplow/snowplow-golang-analytics-sdk#36

Please pay particular attention to these. There were a number of tests passing that were configured to test for the opposite of the desired behaviour. It's important to step back and consider what the actual desired behaviour is for each test case.


Edit to add: Just for completeness of info, I've made issues for some future improvements we could make, which I didn't feel were worth delaying v1 release further. Happy to hear any dissenting opinions on them, or on their being pushed out to post-v1. :)

// if valuesFound is nil, we found no value.
// Because negative matches are a thing, we still want to match against an empty string
if valuesFound == nil {
valuesFound = make([]interface{}, 1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, when there were no values, we never evaluated anything (just returned false)

This is important as a negative lookahead should evaluate to true when we have no value (beacuse "" != "someValue")

return nil, nil
for _, v := range valuesFound {
if v == nil {
v = "" // because nil gets cast to `<nil>`
Copy link
Collaborator Author

@colmsnowplow colmsnowplow Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because we're using fmt.Sprint() to cast everything to string when we evaluate the filter.

Really, this is a workaround to cover up some jank - I've opened an issue to do better here: #186

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my mistake here, I did not realise that nil does not work properly with Sprintf. We should indeed find a better solution here. My vote goes towards handling the different possible types in the existing filter.

RegexTimeout int `hcl:"regex_timeout,optional"`
// For native filters
Description string `hcl:"description,optional"`
UnstructEventName string `hcl:"unstruct_event_name,optional"`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unstruct filters (ie filters on data within custom events) - we actually check the event name (because the same fields can be present in separate events), and optionally the version (if provided).

In this refactor, I realised that these should be explicit options - this forces the user into awareness that those checks are part of this filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, tbh before this it was a bit confusing when it came to the path to the field that we wanted to apply the filter on, this makes it much friendlier.

// createSpEnrichedFilterFunction returns a TransformationFunction which filters messages based on a field in the Snowplow enriched event
// and a regex declared by the user.
func createSpEnrichedFilterFunction(field, regex string, regexTimeout int, isUnstructEvent bool) (TransformationFunction, error) {
func createSpEnrichedFilterFunction(regex string, regexTimeout int, getFunc valueGetter) (TransformationFunction, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to call out as it may make it easier to make sense of the changes:

This create function (and the make functions for Getters) run once, on startup. The filter function itself runs once per batch of event (which at the moment is usually batches of 1 event - so for every event).

The idea of this refactor is:

  • We only compile regexes once
  • We keep DRY principle
  • We outsoruce the part which gets the value to a separate function, which is independently unit testable. (getter)
  • The nuance of unstruct filters doing more work is also shifted to the getter function - which is appropriate to the design IMO: if the event name or version don't match, we should behave as if the field desired simply is nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brilliant, I love the new structure both from a readability and from a technical view.


valueFound, err := parsedMessage.GetUnstructEventValue(convertPathToInterfaces(path)...)
// We don't return an error for empty field since this just means the value is nil.
if err != nil && err.Error() != analytics.EmptyFieldErr && !strings.Contains(err.Error(), "not found") {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since fields can be optional, I've changed this to return nil when we can't find the field. Actually this is a problem with the API design in the analytics SDK so there's an issue to fix that also.

Copy link

@istreeter istreeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like the approach, and your comments in the PR were helpful.

Feel free to ignore my comment about edge cases -- honestly it's fine. But for reference, in enrich we use a 3rd party jsonpath library to handle this kind of thing, e.g. when parsing configs for api enrichment, sql enrichment, pii enrichment.

func NewSpEnrichedFilterFunctionUnstructEvent(eventNameToMatch, eventVersionToMatch, pathToField, regex string, regexTimeout int) (TransformationFunction, error) {
// This regex retrieves the path fields
// (e.g. field1.field2[0].field3 -> [field1, field2, 0, field3])
regexWords := `\w+`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for fun....

I can think of edge cases that will not work with this implementation. You might tell me these edge cases are crazy and we should ignore them, and tbf you would be correct.

Edge case 1

schema:

"properties": {
  "foo-bar": {
     "type": "string"
  }
}

valid event:

{
  "foo-bar": "baz"
}

And imagine pathToField is foo-bar. This would break because \w+ would create an array of two items ["foo", "bar"].

Edge case 2

"properties": {
  "123": {
    "type": "string"
  }
}

valid event:

{
  "123": "foo"
}

And imagine pathToField is 123. This would break because snowplow_enriched_util would parse it as an integer and then the analytics sdk would treat it as an array index.

Copy link
Collaborator Author

@colmsnowplow colmsnowplow Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't think they're crazy, I simply didn't think of them.

The analytics SDK's API is this way (expecting a list of interface arguments), is because it uses json-iterator under the hood, which expects a list of interfaces. We use it because it allows for retrieving values without having to Unmarshal the JSON - and so is vastly more efficient than doing so.

At a cursory glance I don't think there's a jsonpath library that doesn't require unmarshalling. However there may be a solution which is better than what we have (albeit likely less reliable than using a full jsonpath library). I'll write some tests and then see what I can come up with.

Copy link
Collaborator Author

@colmsnowplow colmsnowplow Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an issue on the analytics SDK repo to explore a different JSON parser that looks like it may offer a simpler API (essentially jsonpath dot notation) for this task, without necessarily impeding performance: snowplow/snowplow-golang-analytics-sdk#38

I don't think blocking this release on that is the right call, however, so I'm still exploring whether I can provide an improved solution to what we have here also.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I have now come up with something different - but I don't know if it's better. What I've done is:

  • Make explicit config option for the full context name, rather than including that in dot notation
  • Replaced convertPathToInterfaces with parsePathToArguments (the rename is unnecessary now that I look at how it shakes out - it wasn't at first)

parsePathToArguments splits on . first, then we determine if it should be an integer via the length of a split on [.

It's kinda jank. I think the preferable solution long term is to change the analytics sdk API.

Also it still has edge cases - we couldn't have [ in the field name (probably not valid anyway), and we can't have a nested arrays (field[0][1] won't work). These to me seem less likely than the others.

Still, I'm not 100% about this - if we think it's too jank I'm happy to revert back to what we had. WDYT @istreeter ?

@colmsnowplow colmsnowplow mentioned this pull request Aug 9, 2022
14 tasks
return createSpEnrichedFilterFunction(regex, regexTimeout, getUnstructValuesForMatch)
}

func parsePathToArguments(pathToField string) ([]interface{}, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a neater implementation using something like:

	re := regexp.MustCompile(`\w+|\[\d+\]`)
	parts := re.FindAllString(pathToField, -1)

        convertedPath := make([]interface{}, 0)
	for _, s := range parts {
		// if it looks like `[42]` then extract the integer and treat it as an array index
                // otherwise treat it as an object key
                append(convertedPath, ???)
	}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That regex in my previous suggestion was flawed. This one might be better:

re := regexp.MustCompile(`\[\d+\]|[^\.\[]+`)

I'm not 100% sure this method will end up being neater than what you did. But I think it's worth following through with the effort, to see what the end result looks like and then to decide which is neater.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why I didn't play with a different regex first, just pivoted to something else... 🤔 let me give it a test and see where I get to.

Copy link
Collaborator Author

@colmsnowplow colmsnowplow Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@istreeter after experimenting with it, this way is definitely better, covers more cases, and leaves us with cleaner code.

The one edge case that was still present was when the string provided contains an unclosed opening brace (like the case in our last unit test case: test1.test[2.test3. Because the first regex strips the opening brace, it thinks that's just a "2" - which means it wouldn't fail, and since a non-existent path just means 'match against nil'. It's an unlikely edge case but there's some downside.

I tried a different way of chopping it up to handle this explicitly, but it left us with less readable code and was quite awkward. But thinking about this case, I don't think it's likely that we have an unmatched [ in the path without it being an error. On that assumption, I just added a check that count([) == count(]) in the string before proceeding (and if it doesn't, we throw an error on startup).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to worry about such weird edge cases. If a user really has such weird field names, then they just have to accept that they cannot use stream replicator to perform transformations. That's OK, I think they'll get over it. Or they will re-design their schemas!

(What's important to me, is that nothing else breaks if a schema has weird fields. I mean, it would be bad if someone could generate an unexpected event that makes the replicator crash; but that's not what we're discussing here)

In other words, I think what you've done is good. It's a nice balance between trying to support all schemas, but not going overboard in unnecessarily supporting weird edge cases which don't deserve our support!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on..... does this regexp work, and also fix the edge case you were worried about?

re := regexp.MustCompile(`\[\d+\]|[^\.]+`)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It leaves us with parts like test1[1], which then needs to be handled gain.

That's not really a problem but I prefer what we have in the PR becuase I think it's actually preferable to throw the error when there's a miscount of braces - it's far more likely that this is misconfiguration than not. Without the check this misconfiguration would lead to the app deploying successfully and incorrectly filtering the data - which is a headache.

So I think it's better to keep the check and throw the error - in which case, I don't think there's a benefit to changing the regex.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@colmsnowplow colmsnowplow force-pushed the issue/refactor-filters branch from 41e0325 to 7507e23 Compare August 11, 2022 15:51
@colmsnowplow colmsnowplow force-pushed the issue/refactor-filters branch from 7507e23 to 58ce661 Compare August 11, 2022 16:33
@colmsnowplow colmsnowplow merged commit 2aaa079 into release/1.0.0 Aug 11, 2022
@colmsnowplow colmsnowplow deleted the issue/refactor-filters branch August 11, 2022 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants