Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] Add Collector Payload Thrift encoding / decoding transformations #251

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

jbeemster
Copy link
Member

@jbeemster jbeemster commented Jan 20, 2023

This PR adds transformations that can convert a JSON string into a Collector Payload with Thrift encoding and can also take a Thrift Payload and convert it to a JSON string.

This can be used to:

  1. Get access to Collector Payloads in a usable format without needing to deal with Thrift directly
  2. To publish custom events directly into the "raw" stream (for example as part of a streaming recovery operation)

Q: Should the CollectorPayload module live in this repo or its own repo? Should it have the Apache2 license on it as well over Community?

@jbeemster jbeemster requested a review from miike January 20, 2023 05:12
Copy link
Collaborator

@miike miike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I don't think these serializers need to be concerned about the heal context as that can get added elsewhere - likely in the function that yields the event itself.

@jbeemster jbeemster force-pushed the thrift-encoding-poc branch from 55a2246 to 767fd8a Compare January 23, 2023 00:16
@jbeemster jbeemster changed the title Thrift encoding poc [POC] Add Collector Payload Thrift encoding / decoding transformations Jan 23, 2023
@colmsnowplow
Copy link
Collaborator

colmsnowplow commented Jan 23, 2023

Looks good!

Four questions/discussion points from me:

1. Re: Mike's comment:

I don't think these serializers need to be concerned about the heal context as that can get added elsewhere - likely in the function that yields the event itself.

I would argue that we should not rely on the function that yields the event to do this, because it's intended to be a user-configured custom script. We should neither rely on nor expect the everyday user to instrument that correctly IMO.

The way I see it, there are three places that it could reasonably go IMO:

  1. When we generate the bad row itself (ie in enrich and anywhere else we generate bad rows) - ie we have a 'failed event' context that gets attached to the event itself, recording some metadata about it. Then the 'should I attempt recovery' logic in this application can be 'if there's more than one failed event context, don't attempt to recover'.
  2. Where we generate the recovered event - ie the script to recover. If we want to do it here, I would suggest it's hardcoded into a wrapper around the scripting transformation itself - ie we don't rely on the user to instrument it. Even still, I'm not sure it's the cleanest (because what stops the user from just using normal JS transformation?)
  3. Where we serialise - ie this PR. I agree that it feels wrong to have this logic live in serialisation though.

2. Suggestion:

By a similar vein to my argument above, do we here rely on the usuer to return a correct 'raw event' type from their JS enrichment? I wonder if we could assist with this by defining a struct that raw event should fit, and coercing the data we received from the script into this struct.

Perhaps there's a way to fail out with an error if they get it wrong? Or some other way to ensure that we are receiving the correct data.

This might be the responsibility of another transformation, or some other tooling to assist with testing though.

3. Small question about b64:

When do we not want to b64 encode the data (ie why make it configurable)? Isn't it always b64 encoded in the raw stream?

4. Question about practicalities:

This PR is marked as a POC and is a draft, so I'm assuming that the review here is conceptual feedback and discussions at this point. Please let me know if I should begin to treat it as something to be scheduled into a release (or when). Just flagging in case my assumption is wrong. :)

@jbeemster
Copy link
Member Author

@colmsnowplow on number 1 I think we have a few different options but they should be out of scope of this initial PoC. I think we could ideally have a decorator transformation that could append a context to a "raw" or "enriched" event potentially but we can think about that later!

On 2 - for sure we can add other transformations to ensure this but again for PoC I think we can write this into the JS / Lua layer to assert the idea works before improving on the whole flow here. We are likely going to need:

  1. A transformation to filter and prepare a failed_event for recovery and;
  2. A transformation to take a recovered event and prepare it for turning back into a collector payload

On 3 - the b64 part was really only added so that we could have reliable unit tests. In production you would not base64 encode at that layer as the transport to SQS / Kinesis / PubSub does this implictly based on the []byte that is passed to it which should just be the thrift encoded []byte. So entirely as an aid for portability of the result in unit tests or if using stdout (this is the same as what the Collector in stdout mode does actually).


For sure still a PoC - we want to prove that this is viable and then we can dig a bit deeper into a lot of the very good questions you have posed on how we would make this much more user friendly / what other transformations we would want to make this really simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants