Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema_registry_encode truncation of input json before encode #2525

Open
hendoxc opened this issue Apr 16, 2024 · 3 comments
Open

schema_registry_encode truncation of input json before encode #2525

hendoxc opened this issue Apr 16, 2024 · 3 comments
Labels
bug needs more info An issue that may be a bug or useful feature, but requires more information

Comments

@hendoxc
Copy link

hendoxc commented Apr 16, 2024

I previously posted an issue with schema_registry_encode that was resolved by fixing a race condition

I'm getting similar issues with, but an extra key is now truncated:

{"key_1":  1, "key_2": {"sub_key_1": 2}, "key_3": 3}

would result in an error message like

could not decode any json data in input {"sub_key_1": 2}, "key_3": 3} for key "key_2"
@hendoxc hendoxc changed the title schema_registry_encode missing keys causing error schema_registry_encode truncation of input json before encode Apr 22, 2024
@mihaitodor
Copy link
Collaborator

Hey @hendoxc, I had a quick look at this, but I can't come up with a way to reproduce it. Are you able to trigger it consistently and is it possible to come up with a setup that I can run myself?

@mihaitodor mihaitodor added bug needs more info An issue that may be a bug or useful feature, but requires more information labels May 5, 2024
@hendoxc
Copy link
Author

hendoxc commented Jun 26, 2024

Hey @mihaitodor I've looked into this more, and the underlying issue was that my schema had a field that was not-nullable, but the message was null.

I've run a local example and stepped through the benthos code, and down into the goavro library. the issue is that the error returned from goavro isn't very helpful here, and actually misleading to a end user

another example is if a avro schema has a long field, but a message in the pipeline contains string then you get the same error as above, the error message is very misleading.

I think the issue can be closed, however given that schema_registry_encode is arguably a core piece of any stream processing application, it might be worth to get some better error messages here, which might mean updating goavro or switching to a different avro library.

however again, this issue can be closed :)

@mihaitodor
Copy link
Collaborator

Thank you for digging into it @hendoxc! Indeed, goavro has some bugs / limitations (252 and 253 are the ones I'm aware of) and, since LinkedIn doesn't seem very motivated to invest in it further and maybe work towards a V2, I'm not sure if anyone is willing to pick up the slack, at least not when it comes to AVRO. One big issue is that serialising AVRO to JSON without preserving the message schema is lossy in some ways when it comes to logical types / unions and each implementation makes different tradeoffs...

I did see https://github.com/hamba/avro pop up a while back, but I think it was missing certain features like marshalling logical types in a way that is compatible with whatever Kafka Connect is doing. It did get plenty of updates since then, so maybe now one can fully replicate the Kafka Connect behaviour using this alternative library, but I don't really have a use case for AVRO right now, so it's hard to justify diving into that can of worms...

One way to do it would be to add an advanced flag to schema_registry_encode and schema_registry_decode which instructs them to use this alternative library so we don't break backwards compatibility, but it's a bit of work to figure out how that library works and to get the tests to pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs more info An issue that may be a bug or useful feature, but requires more information
Projects
None yet
Development

No branches or pull requests

2 participants