-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2] Unicode punctuation: Go Big Or Go Home? #386
Comments
oh this is an even deeper can of worms. Number signs for keywords? fullwidth characters for keyword names? otoh, if we do this and we nail it, it means people can write kdl documents in their native language and input modes, and it'll Just Work, and there won't be any surprises? This also affects reserved KQL characters in identifiers, (and KQL itself) |
Summarizing some of the thoughts I had from discussing this with you on fedi, hopefully not adding too much noise here:
So…the sum there is that I find it a commendable goal to support human language, and the context-free stuff ( |
So it turns out, after I looked into it, too, that UAX #44 defines character classes for opening/closing brace pairs, using the This gets us 90% of the way there, I think. The rest is picking out the exceptions we want for one meaning or another, and reserve those for, e.g. "curly" semantics vs "type annotation" semantics. We can also exclude some as needed. There's also things like the EU's guidance on quotation marks, but I think unicode's tagging would be enough for us tbh. |
Sorry, it took me a sec to take a break from stuff so I could sit and read this properly: I understand the pathological issue, but I'm wondering: what if we just let "any It does mean that things can look weird, but they won't be _dangerously wrong. It just means you might have a document that looks like As far as the |
I think the worst-case scenario is that KDL would cover a fairly wide swath of natural language with maybe some edge cases. Maybe that's not a very big deal, especially as I'm parachuting into this issue without a proper understanding of the project's goals, attracted by the interesting problem. I was thinking of cases like
I think it boils down to whether or not anybody would ever want to write a string whose contains include other punctuation that could have opened the string as KDL's parser sees it. If that's not something you want to support, then I think doing it the way you describe where you always open with |
One of the goals of kdl, for me, is to have a human oriented configuration languages. Humans are messy, they type things in funny ways, and it's important for kdl to support that and protect them from the worst cases (like, we disallow some bidi stuff because it can be malicious and has little actual usage) If they want to write it literally into a string, they can use raw strings ( (which reminds me, I guess we need NFKC for |
I have to admit, this does scare me. My main issues are that this does add a lot of complexity that may require a lot of complex rules to manage, and means that you have to keep up with updates to Unicode and understanding of what characters mean. I personally vote to keep the document syntax as simple as possible. This reminds me of another file format called Collada. It is an XML 3d asset/scene format that basically allowed the exporter to dump the data in any way it wanted (Z up vs Y up, calling things blend shapes or shape keys, etc). This made it incredibly hard for importers to be able to consume the data, and basically why you don't see much support for that format. The reasons why FBX and GLTF are more successful is because they expect the data to be formatted in a certain way/make the exporters do the hard work, so that it is easier for importers to behave. The reason why I bring this up is it is an example of what could happen if this extra complexity gets introduced. But, that's my two cents. I do agree with the idea of making the document format really great for humans, but I'm concerned with how this will become a maintenance burden/bad for machines. |
@scott-wilson I’m heading something in the direction of “look up the current table of Unicode characters under the specific tags we want, and name every pair in the spec, in specific tables (just like we do with newlines and whitespace and equals right now), and possibly, MAYBE, add a clause that says “implementations MAY extend this table if future revisions of Unicode introduce new pairs”, but that last clause is… not very likely to fire, and if it does fire, we have the tools to specify consistent semantics. In the end, I see this working the same way as our other Unicode support stuff (with the tables), not some mysterious vague mention that leaves questions in the air. Does that help you? |
That does help, but I'm still worried about some of the rules with stuff like opening and closing characters. For example, if we had something like Also, I admit that my worries could be 100% unfounded. Right now alarm bells are ringing in my brain, and I'm still trying to understand what it is that is triggering my reaction to this idea. |
@scott-wilson from what we were talking about above, non-"matching" openers and closers would be valid, as long as they're the same "class", but that's not necessarily how we have to do it. We could require they be matched by their paired opener. The idea is that yes, your example would ideally "just work", because, honestly, it looks like it should. I don't see much of a reason to say it shouldn't. |
I do think it's dangerously wrong that I would be incapable of writing If we're allowing a bunch of quote styles, this suggests we should also allow the ASCII Unless (if I'm reading between the lines correctly on a later comment of yours) you're requiring that paired quoting characters come in pairs? So my first example using french quotes is fine, because the ending french quote associates with the opening french quote, and thus doesn't close the string? (ASCII apostrophe would still be a problem.) |
Or hm maybe I misinterpreted your sentence
and you're suggesting not that there's pair tracking, but that using an opening quote character in a string at all is invalid, since it would attempt to pair with the quote that actually opened the string, and that's invalid? If so I really don't like that. It feels like a big footgun that I'd have to escape or raw-string all quote characters, ever. |
Thinking about The Way Things Are Done, it does make sense that we would do pair tracking, which is what every programming language that lets you use both ' and " for strings does. So in that case you WOULD be able to write your example string as is, and only escape things if you were using guillaumes as your opener/closer how does that sound, @tabatkins ? |
If we're tracking the "appropriate" closer for the given opener, and you don't need to escape anything inside the string except for the appropriate closer, then I'm a lot happier, yeah. (There might be multiple valid closers for a given opener, per @SnoopJ's example of (I'm neutral to weakly positive on the overall change; your reasoning makes sense, but it's relatively unorthodox in this space.) |
I am both terrified and very excited by this. For what it's worth, the most common French keyboard layouts make it very difficult to type « guillemets ». I use a regular French PC keyboard and I have to rely on my OS to enter them (I use compose key on Linux, on Windows I have to install a third party tool). 99.9% of French people incorrectly use " instead. Also, typographically correct guillemets must include a narrow non-breaking space after the opening one and before the closing one. I am sure that a lot of other languages have little quirks like this. I don't think we can cover all the cases, but we are going to set expectations. |
I highly doubt people using guillemets in their documents actually insert nnbsps into their source as a general rule, anyway. That sounds like something done by a typesetter. ^_^ |
More stuff in favor of quotes and how to implement them: CSS has a quotation system that automatically changes based on language: https://www.w3.org/TR/css-content-3/ |
Tho that's purely a display artifact, and doesn't auto-match anything - you have to provide the opening and closing quotes yourself. It just lets you write |
While I don't use narrow non-breaking spaces, I do use regular non-breaking spaces. Most French people use straight quotes (because that is what is on their keyboard). Those who use guillemets use regular spaces, but word processors will replace them with non-breaking spaces (or even insert the space if it was not entered). I know nobody will write KDL in Word, but if the goal is to allow people to write KDL as they would write their natural language, people will add spaces and they will expect those spaces to be part of the punctuation, not the quoted string. |
I do like this idea, but I worry about its affect on performance. It would drastically incresae the amount of cases needed to be handled by parsers, which seems like it would make a decent diffrence. I don't know how KDL benchmarks now, its defintely not focused on performance from what I have seen, but I can't imagine its irrelevent either, since you mention its meant to also be used as a serialization language which have the singular goal of performance for thier neiche. |
I'd be very surprised if a modern parser really slowed down from having to check a ~dozen cases for a certain token. It's really not a very big number. |
Fair enough. Partially its not knowing how many punctuations there are (maybe theres like 40 types of quotes, I don't know!), and with what is mentioned above they are paired rather than just "find an open, find a close" so the state would need to be stored. Both scale badly with a lot more options, but a few dozen is likely more than fine. |
to update on this: I've recently started writing a lot of kdl by hand on my phone because I'm using (v1) in Iron Vault, and one thing that stood out is that I have to constantly remember to long-press on the double quotes to pick "programmer quotes". So, I think it's a great idea to specify this, at least for some of the usual suspects. |
Terrifying. I'm sorry for being a bit sceptical; I don't want to spoil anyone's excitement, but please let me share some reasons why I think "go home" would be the better/safer choice here:
Thus, it seems to me, "go big" conflicts with most of the design principles of KDL: |
I’m withdrawing this |
So I've been working on the
kdl-rs
update for v2, and something that jumped out at me is that even though we're doing multiple unicode=
signs for property delimiters, we're still using "regular" characters for quotation marks and curly braces.After thinking about it, I'm wondering if the equals sign thing is just silly? I do like the idea, at the security level, that
foo=bar
can't pass itself off as an argument with a string value of"foo=bar"
, as opposed to a property. That seems important!But folks can still do:
foo{ bar }
(those are the fullwidth versions of the curly braces)
or:
foo “bar”
(fancy quotes).
I think the only thing that would leave is the node termination semicolon, which also has various unicode variants.
So the question is: Do we expand the treatment we gave
=
to all the other punctuation, thus making it so "a kdl document means basically what it looks like, no surprises", or do we roll back the equals change and surrender to our unicode overlords altogether? I think we should go big or go home on this one. Doing it halfway doesn't feel right.I do think it would be cool to include all the various unicode variants, though. That's not a thing you see very often...
The text was updated successfully, but these errors were encountered: