Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Should the shredded JSON contain additional schema information? #31

Open
miike opened this issue Aug 8, 2017 · 2 comments
Open

Comments

@miike
Copy link

miike commented Aug 8, 2017

The current JSON contexts shredding results in a simplified payload that results in a context name that includes the model along with the data e.g.,

    [
      ("context_com_acme_duplicated_1", [{"value": 1}, {"value": 2}]),
      ("context_com_acme_unduplicated_1", [{"unique": true}])
    ]

https://github.com/snowplow/snowplow-python-analytics-sdk/blob/master/snowplow_analytics_sdk/json_shredder.py#L102

This process is lossy and there are circumstances where the revision and addition components of the schemaver are important e.g., determining whether data is backwards compatible when running an aggregation, filtering/dropping on specific schema versions etc. Should the payload be restructured in include the schema version information (or more widely the schema information available to Redshift). Thoughts @chuwy ?

@chuwy
Copy link
Contributor

chuwy commented Aug 8, 2017

Hey @miike,

I think argument about information loss is quite strong. I'd like to preserve revision and addition as long as possible.

There's a function in Scala SDK (not in Python yet), called transformWithInventory which basically extracts set of Iglu keys along with transformed JSON result. Column names are still same (version-lossy), but there's a good chance you can use information about shred types in something like Spark to identify Schema-compatibility issues. Do you think it can be a solution for use cases you mentioned?

@miike
Copy link
Author

miike commented Aug 8, 2017

I think that makes sense though this would be a subset of that such that the transformation of the input line would yield the Iglu version in the output something like:

{
  "app_id": "test",
  "context_com_acme_duplicated_1": {
    "schema_version": "1-0-1",
    ...
    "data": {
        "value": 1
    }
    ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants