Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide special treatment for human-written text #261

Open
chuwy opened this issue Aug 24, 2017 · 6 comments
Open

Provide special treatment for human-written text #261

chuwy opened this issue Aug 24, 2017 · 6 comments

Comments

@chuwy
Copy link
Contributor

chuwy commented Aug 24, 2017

In JSON Schema specification maxLength keyword specifies the amount of characters in field, not bytes. However, in Redshift (as in most databases, I believe) in VARCHAR we specify amount of bytes, which may introduce mismatch in text-fields, usually written by humans, not computers.

This is generally not a problem as analytical data typically contains ASCII-text (written by computers), where amount of bytes precisely match amount of characters.

But at the same time, I can imagine an issue:

  1. User supposed to enter his/her city name in native language (likely with non-ASCII characters)
  2. Web-developer constrains input-field to 32 characters
  3. Analysts makes a wrong assumption that maxLength: 32 is correct constrain
  4. Redshift truncates all non-ASCII city names to 16 characters

This could be done as part of #170 (format: "unicode", which specify that string has absolutely no structure) or similar custom JSON-schema extension.

@kasper-agg
Copy link

I would like to add to point 3 that it is not necessarily a mistake or assumption.

In our case we use the schema as source of truth and generate DataTransferObjects from the schema, in order to comply with the schema.
In other words, if the schema defines a specific maxLength expressed in characters, then that is what the application (and developer) honors and should be able to trust to be stored as the schema describes.

Maybe igluctl can "detect" if specific storage engines need over provisioning based on the schema requirements and thus generate "larger" fields in the generated DDL (Redschift sql schema) files?

@chuwy
Copy link
Contributor Author

chuwy commented Aug 25, 2017

Hey @kasper-agg,

Yes I agree that using JSON Schemas as source of truth is a great idea and this issue prevents engineers from doing that. I also agree that it would be a right thing to do to over-provision Redshift, but problem here is that we need to come up with way to "mark" these fields as it is relatively narrow subset of cases.

@kasper-agg
Copy link

Regardless of the solution, I must agree it sounds like a challenge. Even more to fix it in clean way...

@chuwy
Copy link
Contributor Author

chuwy commented Aug 25, 2017

I'm quite keen to add custom extensions into JSON Schema (example #170, but new properties as well - it is fully-compatible with standard) and I think this is the way to go here.

@alexanderdean
Copy link
Member

I have a weird feeling we may end up having to drive a Draft 5 (?) of the JSON Schema spec, given there don't seem to be many other vendors taking a major interest in it...

@chuwy
Copy link
Contributor Author

chuwy commented Aug 25, 2017

That would be lovely! I've seen efforts on Draft 5 specification, but almost no implementations out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants