Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release R136 #1217

Merged
merged 8 commits into from
Sep 20, 2022
Merged

Release R136 #1217

merged 8 commits into from
Sep 20, 2022

Conversation

dilyand
Copy link
Contributor

@dilyand dilyand commented Aug 4, 2022

No description provided.

@dilyand dilyand marked this pull request as draft August 4, 2022 09:46
@dilyand dilyand marked this pull request as ready for review August 4, 2022 10:43
@dilyand dilyand requested review from istreeter and a team August 4, 2022 10:43
Copy link
Contributor

@istreeter istreeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dilyand please can you explain more about what you are trying to achieve with this schema change, and therefore why you chose these values for maxLength?

We spoke earlier about this being "the schema change to end all schema changes" -- meaning we're all fed up of updating enrich each time the yauaa enrichment sees an unexpected user agent string. So we talked about lengthening each field to protect it from future changes.

But with that in mind, I can see you have chosen a mixture of 256, 1000, and 1024 as max length for different fields. How did you decide on these values?

},
"deviceName": {
"description": "Example: Google Nexus 6",
"type": ["string", "null"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from type string to type [string, null]? I'm not against it if you have a good reason. It struck me as odd because enrich doesn't output nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are optional fields, so presumably at some point they may be sent with a null value. We even have it as a check in igluctl.

I appreciate we currently do not do that in enrich, but if we change it in the future, we won't need to worry about the schema not allowing it. It's part of trying to make the schema more future-proof.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling at the moment is like this:

If you're going to allow nulls for some fields, then you need to allow nulls for all fields, or else it's not consistent between fields. Which means allow nulls for the enum fields too, like operatingSystemClass. Which means add null to the enum array.

If you're adding null to the enum array then you need to be very very sure that RDB Loader will not choke on the schema migration. Remember that RDB Loader immediately (and probably wrongly) tries to migrate the table as soon as the schema is published; it does not wait until it sees the first 1-0-4 event. Remember also, we need to be sure that old versions of RDB Loader also don't choke on the schema migration, otherwise we might break oss pipelines just by publishing a schema.

Now RDB Loader probably handles this OK. I'm just saying we need to be 100% sure before merging this.

For that reason.... personally I would avoid adding the nulls, and put it back to be more like how it was in 1-0-3. There is no reason whatsoever to add nulls to this schema. I see no reason why enrich will ever change in future and to start emitting nulls.

Yes we want to make the schema future-proof, but I see it as future-proofing against things we are not in control of, like crazy user agent strings. Whereas nulls are something we are completely in control of, so we don't need to future-proof against nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a sensible idea. I was trying to make igluctl happy, but I agree that in this instance, where we're in control of this, there is no need to worry.

},
"operatingSystemClass": {
"description": "See https://yauaa.basjes.nl/README-Output.html",
"type": ["string", "null"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added "null" as a type... but this does not make any difference, unless you also add null to the enum array. As currently written, null is still not an allowed value for operatingSystemClass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did not realise that. I think if we keep null as an allowed type, I'll add it to the enum.

@dilyand
Copy link
Contributor Author

dilyand commented Aug 5, 2022

Hey @dilyand please can you explain more about what you are trying to achieve with this schema change, and therefore why you chose these values for maxLength?

We spoke earlier about this being "the schema change to end all schema changes" -- meaning we're all fed up of updating enrich each time the yauaa enrichment sees an unexpected user agent string. So we talked about lengthening each field to protect it from future changes.

But with that in mind, I can see you have chosen a mixture of 256, 1000, and 1024 as max length for different fields. How did you decide on these values?

I put most of the reasoning in the issue #1216, but very briefly:

  • All fields that are extracted from the useragent string have a limit of 1000 because this is the max length for the useragent field in atomic.events. So, even if the whole string is interpreted as a single property, it still can't exceed that value.
  • All fields that are deduced (rather than extracted) have a limit of 128 / 256. This somewhat arbitrary. I just wanted to increase the existing limits which seemed a bit too strict. My thinking here is that these values are controlled by the YAUAA library, not by whoever puts together the useragent strings, and so are less likely to fluctuate widely.
  • The fields with length 1024 were already like that. I do not want to lower this limit because that is not a backward-compatible change. But in practice, we should never have anything higher than 1000 because of the limitation for the atomic useragent field.

Copy link
Contributor

@istreeter istreeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@@ -0,0 +1,234 @@
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for an entity context generated by the YAUAA enrichment after parsing the user agent",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entity context or context entity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"context entity" feels more right to me. I'd probably just go for "entity" though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it back to just 'context'. I think the word entity confuses things in this case. This schema does not describe an entity, but rather just a grab bag of things that could be deduced from the useragent.

@dilyand dilyand requested a review from a team August 9, 2022 13:15
@oguzhanunlu
Copy link
Member

oguzhanunlu commented Aug 16, 2022

I opened a PR in enrich to upgrade yauaa from 5.23 to 7.4.0

yauaa 7.4.0 has WebviewAppNameVersion as well (since v6.4), it is not documented here

is there a specific reason we didn't include webviewAppNameVersion field in this new version of the schema?

@adatzer adatzer mentioned this pull request Aug 16, 2022
@oguzhanunlu
Copy link
Member

I added a separate commit to add webviewAppNameVersion cc @istreeter @dilyand

@adatzer
Copy link
Contributor

adatzer commented Aug 19, 2022

Could we also merge #1221 into r136? cc @oguzhanunlu @dilyand

@istreeter
Copy link
Contributor

I'm happy with the amendment made to this PR since I last approved it.

@dilyand dilyand force-pushed the release/r136 branch 4 times, most recently from b440d28 to dd6a49f Compare August 23, 2022 14:37
Copy link
Contributor

@paulboocock paulboocock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Couple of nit pick comments but happy for this to go out once the changelog is tidied.

CHANGELOG Outdated
Comment on lines 3 to 6
Add nl.basjes/yauaa_context/jsonschema/1-0-4 (#1216)
Extend copyright to 2022 (close #1218)
Update links in Readme (close #1219)
Add CONTRIBUTING.md (close #1220)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think we usually have the word close in these lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,234 @@
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for an entity context generated by the YAUAA enrichment after parsing the user agent",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"context entity" feels more right to me. I'd probably just go for "entity" though.

@matus-tomlein
Copy link
Contributor

Added a new version of the remote_config schema reviewed in #1226, also updated the changelog to include that, cc @paulboocock

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,225 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@istreeter doing a diff with 1-0-0 a lot of these fields now have a maximum as well as a few now allowing object, null pairing over just object. Just want to confirm that the addition of new limits and type groupings is not going to cause issues in any of our loaders / data-warehouses?

I am not entirely certain if this schema gets tracked into pipelines to be fair but if it was would these changes get automatically represented in their respective columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is still valid, just wanted to say that the schema is not being used in pipelines, it is just to help developers better define their configurations (some IDEs may use it for validating the JSONs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on a pipeline with a loader. Even if it's not expected to be used in tracking, it was worth testing.

It is completely safe, because the field configurationBundle is a complex object, so RDB transformer simply stringifies the whole field, and RDB Loader loads it as a string. Any change you make to a sub-field (e.g. configurationBundle.namespace) does not bother the transformer/loader at all.

@@ -0,0 +1,238 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question again actually @istreeter - do all of the columns get automatically widened when maxLength is changed or say a longer enum value is added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING -- this migration did not work when I tested in with Redshift RDB Loader. The new field got added to the table, but the existing columns were not updated for the new lengths.

I opened this issue in rdb loader to investigate it. I spent quite a while trying to pinpoint the error, but no luck yet.

I'm not comfortable with merging this release until we have found why RDB Loader did not alter the lengths as it should.

},
"method": {
"description": "The method used to send the requests (GET or POST).",
"type": ["string", "null"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @matus-tomlein just a warning, the "null" on this line does not do what you think it does! If you want this field to be nullable then you must also add null to the enum:

"enum": ["get", "post", null]

Otherwise, a null will get rejected during validation.

I know this is slightly surprising. You can test it out though using an online validator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @istreeter, didn't know about this! Fixed.

Copy link
Contributor

@paulboocock paulboocock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically I have no issues with the schemas, but there's a few nitpicks and a few questions on if these are the schemas we want to be commiting to for whatever use case they are being used to build (especially object 🤮)

"name"
],
"self": {
"vendor": "io.snowplow.foundation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does foundation represent/mean? Are these not the recipe schemas from Try Snowplow? Why not io.snowplow.recipe or a final part that represents the "usecase" we typically see this schema used in io.snowplow.content?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was OK with the name "foundation" for the vendor, meaning like a collection of schemas that can be used when first starting off with Snowplow, before graduating to more advanced use cases when the user would want to design their own bespoke schemas.

But... I find the names of the schemas a bit strange. The names "content" and "conversion" don't mean anything to me, but maybe that's because I don't come from the world of web analytics.

I don't have any problem with the schemas though, so I'd be OK to approve this release.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foundation was the group consensus for what Ian describes

we didn't want to go with use-case specific names in case we later want to add more advanced content / conversion / funnel schemas that would then clash with these

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with recipe too, I don't who that decision is ultimately up to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be that guy...

Now that I understand what foundation means, as in foundational... I think I prefer it over receipes as its more generic and allows these schemas to be used in a more broader context than just the recipe context.

schemas/io.snowplow.foundation/object/jsonschema/1-0-0 Outdated Show resolved Hide resolved
Copy link
Member

@jbeemster jbeemster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - can I get a green-light when you lot get online and ill go ahead and merge it!

@jbeemster jbeemster merged commit cb00a49 into master Sep 20, 2022
@istreeter istreeter mentioned this pull request Oct 24, 2022
@cksnp cksnp deleted the release/r136 branch February 21, 2024 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants