-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release R136 #1217
Release R136 #1217
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dilyand please can you explain more about what you are trying to achieve with this schema change, and therefore why you chose these values for maxLength
?
We spoke earlier about this being "the schema change to end all schema changes" -- meaning we're all fed up of updating enrich each time the yauaa enrichment sees an unexpected user agent string. So we talked about lengthening each field to protect it from future changes.
But with that in mind, I can see you have chosen a mixture of 256, 1000, and 1024 as max length for different fields. How did you decide on these values?
}, | ||
"deviceName": { | ||
"description": "Example: Google Nexus 6", | ||
"type": ["string", "null"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change from type string to type [string, null]
? I'm not against it if you have a good reason. It struck me as odd because enrich doesn't output nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are optional fields, so presumably at some point they may be sent with a null
value. We even have it as a check in igluctl.
I appreciate we currently do not do that in enrich, but if we change it in the future, we won't need to worry about the schema not allowing it. It's part of trying to make the schema more future-proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling at the moment is like this:
If you're going to allow nulls for some fields, then you need to allow nulls for all fields, or else it's not consistent between fields. Which means allow nulls for the enum fields too, like operatingSystemClass
. Which means add null
to the enum array.
If you're adding null
to the enum array then you need to be very very sure that RDB Loader will not choke on the schema migration. Remember that RDB Loader immediately (and probably wrongly) tries to migrate the table as soon as the schema is published; it does not wait until it sees the first 1-0-4
event. Remember also, we need to be sure that old versions of RDB Loader also don't choke on the schema migration, otherwise we might break oss pipelines just by publishing a schema.
Now RDB Loader probably handles this OK. I'm just saying we need to be 100% sure before merging this.
For that reason.... personally I would avoid adding the nulls, and put it back to be more like how it was in 1-0-3. There is no reason whatsoever to add nulls to this schema. I see no reason why enrich will ever change in future and to start emitting nulls.
Yes we want to make the schema future-proof, but I see it as future-proofing against things we are not in control of, like crazy user agent strings. Whereas nulls are something we are completely in control of, so we don't need to future-proof against nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a sensible idea. I was trying to make igluctl happy, but I agree that in this instance, where we're in control of this, there is no need to worry.
}, | ||
"operatingSystemClass": { | ||
"description": "See https://yauaa.basjes.nl/README-Output.html", | ||
"type": ["string", "null"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added "null" as a type... but this does not make any difference, unless you also add null
to the enum array. As currently written, null is still not an allowed value for operatingSystemClass
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did not realise that. I think if we keep null as an allowed type, I'll add it to the enum.
I put most of the reasoning in the issue #1216, but very briefly:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
@@ -0,0 +1,234 @@ | |||
{ | |||
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", | |||
"description": "Schema for an entity context generated by the YAUAA enrichment after parsing the user agent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entity context or context entity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"context entity" feels more right to me. I'd probably just go for "entity" though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it back to just 'context'. I think the word entity confuses things in this case. This schema does not describe an entity, but rather just a grab bag of things that could be deduced from the useragent.
I added a separate commit to add |
Could we also merge #1221 into |
I'm happy with the amendment made to this PR since I last approved it. |
b440d28
to
dd6a49f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Couple of nit pick comments but happy for this to go out once the changelog is tidied.
CHANGELOG
Outdated
Add nl.basjes/yauaa_context/jsonschema/1-0-4 (#1216) | ||
Extend copyright to 2022 (close #1218) | ||
Update links in Readme (close #1219) | ||
Add CONTRIBUTING.md (close #1220) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think we usually have the word close
in these lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,234 @@ | |||
{ | |||
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", | |||
"description": "Schema for an entity context generated by the YAUAA enrichment after parsing the user agent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"context entity" feels more right to me. I'd probably just go for "entity" though.
db44e7a
to
d553671
Compare
Added a new version of the |
@@ -0,0 +1,225 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@istreeter doing a diff with 1-0-0 a lot of these fields now have a maximum
as well as a few now allowing object, null
pairing over just object
. Just want to confirm that the addition of new limits and type groupings is not going to cause issues in any of our loaders / data-warehouses?
I am not entirely certain if this schema gets tracked into pipelines to be fair but if it was would these changes get automatically represented in their respective columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is still valid, just wanted to say that the schema is not being used in pipelines, it is just to help developers better define their configurations (some IDEs may use it for validating the JSONs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this on a pipeline with a loader. Even if it's not expected to be used in tracking, it was worth testing.
It is completely safe, because the field configurationBundle
is a complex object, so RDB transformer simply stringifies the whole field, and RDB Loader loads it as a string. Any change you make to a sub-field (e.g. configurationBundle.namespace
) does not bother the transformer/loader at all.
@@ -0,0 +1,238 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question again actually @istreeter - do all of the columns get automatically widened when maxLength is changed or say a longer enum value is added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WARNING -- this migration did not work when I tested in with Redshift RDB Loader. The new field got added to the table, but the existing columns were not updated for the new lengths.
I opened this issue in rdb loader to investigate it. I spent quite a while trying to pinpoint the error, but no luck yet.
I'm not comfortable with merging this release until we have found why RDB Loader did not alter the lengths as it should.
d553671
to
e8cb2aa
Compare
}, | ||
"method": { | ||
"description": "The method used to send the requests (GET or POST).", | ||
"type": ["string", "null"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @matus-tomlein just a warning, the "null" on this line does not do what you think it does! If you want this field to be nullable then you must also add null
to the enum:
"enum": ["get", "post", null]
Otherwise, a null will get rejected during validation.
I know this is slightly surprising. You can test it out though using an online validator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @istreeter, didn't know about this! Fixed.
e8cb2aa
to
2f88aeb
Compare
2f88aeb
to
56c4f3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically I have no issues with the schemas, but there's a few nitpicks and a few questions on if these are the schemas we want to be commiting to for whatever use case they are being used to build (especially object
🤮)
"name" | ||
], | ||
"self": { | ||
"vendor": "io.snowplow.foundation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does foundation
represent/mean? Are these not the recipe schemas from Try Snowplow? Why not io.snowplow.recipe
or a final part that represents the "usecase" we typically see this schema used in io.snowplow.content
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was OK with the name "foundation" for the vendor, meaning like a collection of schemas that can be used when first starting off with Snowplow, before graduating to more advanced use cases when the user would want to design their own bespoke schemas.
But... I find the names of the schemas a bit strange. The names "content" and "conversion" don't mean anything to me, but maybe that's because I don't come from the world of web analytics.
I don't have any problem with the schemas though, so I'd be OK to approve this release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foundation
was the group consensus for what Ian describes
we didn't want to go with use-case specific names in case we later want to add more advanced content / conversion / funnel schemas that would then clash with these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy with recipe
too, I don't who that decision is ultimately up to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be that guy...
Now that I understand what foundation
means, as in foundational... I think I prefer it over receipes as its more generic and allows these schemas to be used in a more broader context than just the recipe context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - can I get a green-light when you lot get online and ill go ahead and merge it!
No description provided.