-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backout Unicode bare keys #979
base: main
Are you sure you want to change the base?
Conversation
This backs out the unicode bare keys from toml-lang#891. This does *not* mean we can't include it in a future 1.2 (or 1.3, or whatever); just that right now there doesn't seem to be a clear consensus regarding to normalisation and which characters to include. It's already the most discussed single issue in the history of TOML. I kind of hate doing this as it seems a step backwards; in principle I think we *should* have this so I'm not against the idea of the feature as such, but things seem to be at a bit of a stalemate right now, and this will allow TOML to move forward on other fronts. It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until 2019, and has only 11 upvotes. Other than that, the issue was raised only once before in 2015 as far as I can find (toml-lang#337). I also can't really find anyone asking for it in any of the HN threads on TOML. Reverting this means we can go forward releasing TOML 1.1, giving people access to the much more frequently requested relaxing of inline tables (toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other more minor things (e.g. `\e` has 12 upvotes in toml-lang#715). Basically, a lot more people are waiting for this, and all things considered this seems a better path forward for now, unless someone comes up with a proposal which addresses all issues (I tried and thus far failed). I proposed this over here a few months ago, and the responses didn't seem too hostile to the idea: toml-lang#966 (comment)
Yeah, rolling back all the work done to allow more bare keys is a very bad idea. Don't confuse bare keys with the issues surrounding potential normalization of key strings. Quoted keys could always raise these issues. And we can wait until after v1.1.0 is released to address normalization issues, should they arise in real life. @arp242 I think I know where you're coming from by floating this proposal. I once opened a PR, #663, that I hesitated to suggest but seemed to have some support in its favor, in order to advance discussion. Ultimately it was closed, but it certainly advanced discussion, and it prompted a counter proposal, #676, that did have broad consensus. That said, someone may think to push a PR to move the normalization debate. I would do so if I had more time, but I've left plenty of proposal language in the discussions which anyone is invited to use. |
I don't really see it so much as "rolling back" but rather as "delaying". No work is really lost. I'd just like to move forward with TOML. Right now no one seems 100% happy with how things are with the unicode keys (although the reasons for the unhappiness are very varied, which is part of the problem) and it's been like that for over a year now.
Yes, but very few people write quoted keys so it's much less of an issue (almost a non-issue in reality). It's not a surprise this has gone largely unnoticed/undiscussed before the Unicode PR was merged.
Possibly solutions such as "only normalized forms will be allowed in TOML bare keys" will be closed off due to backwards compatibility, so I don't really agree with that. I have said this many times before, but I very much think we should consider both issues together. |
I did spend quite a bit of time trying to come up with a better proposal which should address all issues by the way, but it's not so easy. IMHO by far the best way would be to spec out all problems so they can't occur, but then the Unicode ranges get much larger – I managed to reduce it to about 100, which is still quite a lot. It's a problem a lot of environments have struggled with. I thought I mentioned that in the commit message but it seems that got lost in editing. |
I'm 1000% opposed to this proposal and I'm very relieved to see that it doesn't seem to have any realistic chance anyway. As I see it, bare keys written in any natural language will the the big new feature and advantage of TOML 1.1 if and when it comes. As for the million dollar question: I don't think TOML 1.1 is blocked by the normalization issue, rather it's blocked by nobody working on it. I once tried to triage the issues that should be addressed in some way or other for it to reach completion. Possibly one or two of the issues mentioned there have meanwhile been resolved, and probably a few new ones have emerged, but I think that analysis is still largely valid. In an ideal world, there would be somebody working on these issues, trying to find a suitable solution (which might include closing as "out of scope" or deferring after 1.1) for each of them, unless finally an RC for 1.1 is ready to be published. In a perfect world, that person or persons would be TOML maintainers – that world, sadly, is not in sight, but even some other community member(s) should be able to make progress if they focus on this task. Maybe there are any volunteers here willing to pick it up? (I can't, due to other obligations.) @arp242: Maybe it would be something for you, seeing that you seem very interested in getting 1.1 finished? |
I don't think we really need to do anything for TOML 1.1 @ChristianSi; anything can be in any future version, so any "milestones" are rather arbitrary, IMHO. |
Honestly, I'm not opposed to this idea as strongly as others seem to be. While I do agree that allowing bare keys across the board would be an improvement, the argument for "this is complicated and no one has the time to work on it, let's get to this later" is a sound one.
FWIW, I agree with this. Don't use what's been put in the milestones as a hammar to shut down a line of argument. :) I think it comes down to how important we think it is to have this in this release rather than delaying it for a future release. IIUC, @arp242 thinks that it's worth getting a release out sooner, while @eksortso and @ChristianSi think that it's worth resolving the outstanding issues with bare keys and including it in the release. I don't know yet what my preferences are -- I haven't caught up on the entirety of the bare keys discussion yet, to have an updated sense of the complexities involved -- but I don't think that this "doesn't seem to have any realistic chance" is an accurate read. It is very much possible that we go down this route, if the bare keys discussion doesn't reach a resolution / have a feasible solution that there's sufficient agreement on. Two arguments against deferring this for a new release are that:
Footnotes
|
It's not even "lack of time", it's that I simply can't find a solution which would satisfy everyone. This may be a failure on my part though, but it's not just "lack of time", it's "we can't agree on a good way to proceed on all these issues". Every possible resolution I can see has strong opposition from at least one person. |
Let's keep in mind that the true problem is not at all this one, but #966. And I think that one is solvable. But even if we decide to leave the normalization issue unsolved in 1.1 and address it only later, nothing would be lost. |
That would be unwise, because we will no longer be able to solve them by changing the specification. There's also #954. |
Yea, I'd mentally bundled those together as "bare keys need changes before being good-enough for release". |
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and ANY solution is a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is the strongest argument in favour of this and the biggest improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative is probably the right way forward. - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work", but "this character works fine, but this very similar doesn't". This shows up in a number of things: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". From the user's perspective this seems like a bug in the TOML parser. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it". Being conservative for these type of things is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written). People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment. - Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it". Being conservative for these type of things is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written). People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment. - Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I don't think this entire feature is worth it: the issue doesn't have that many upvotes, the PR has zero, when I did this PR I searched for people asking for this feature in HN threads about TOML and I couldn't find anyone (I saw a lot of people complaining about a lot of stuff, it is HN after all, but not this), YAML allows it but no one seems to be using it, we haven't had people ask about it in issues or anywhere else. In short, it's not really solving a significant practical problem. I say this as someone who is actually using it right now (my And that would be acceptable if it's a small thing with simple implications. But it's not. TOML is for configuration files. That's what we're always saying, right? How often do you want non-ASCII in configuration files? Rarely. My The best we can do is "we're going to do a kind of bad half-Unicode implementation, but because not a lot of people will use it, probably, it's not going to be a problem, probably, but if people do it will be a problem." Meh. That's basically what we have now. #990 improves on that IMHO, but still doesn't really solve everything. The only way we can solve it is by going full-in Unicode, but people don't want that either, for reasonable reasons, so we just can't win. We've spent more time on this than it will ever save. This is increasingly just a sunk-cost fallacy. |
This backs out the unicode bare keys from #891.
This does not mean we can't include it in a future 1.2 (or 1.3, or whatever); just that right now there doesn't seem to be a clear consensus regarding to normalisation and which characters to include. It's already the most discussed single issue in the history of TOML.
I kind of hate doing this as it seems a step backwards; in principle I think we should have this so I'm not against the idea of the feature as such, but things seem to be at a bit of a stalemate right now, and this will allow TOML to move forward on other fronts.
It hasn't come up that often; the issue (#687) wasn't filed until 2019, and has only 11 upvotes. Other than that, the issue was raised only once before in 2015 as far as I can find (#337). I also can't really find anyone asking for it in any of the HN threads on TOML.
Reverting this means we can go forward releasing TOML 1.1, giving people access to the much more frequently requested relaxing of inline tables (#516, with 122 upvotes, and has come up on HN as well) and some other more minor things (e.g.
\e
has 12 upvotes in #715).Basically, a lot more people are waiting for this, and all things considered this seems a better path forward for now, unless someone comes up with a proposal which addresses all issues (I tried and thus far failed).
I proposed this over here a few months ago, and the responses didn't seem too hostile to the idea:
#966 (comment)