Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backout Unicode bare keys #979

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

arp242
Copy link
Contributor

@arp242 arp242 commented Jun 2, 2023

This backs out the unicode bare keys from #891.

This does not mean we can't include it in a future 1.2 (or 1.3, or whatever); just that right now there doesn't seem to be a clear consensus regarding to normalisation and which characters to include. It's already the most discussed single issue in the history of TOML.

I kind of hate doing this as it seems a step backwards; in principle I think we should have this so I'm not against the idea of the feature as such, but things seem to be at a bit of a stalemate right now, and this will allow TOML to move forward on other fronts.

It hasn't come up that often; the issue (#687) wasn't filed until 2019, and has only 11 upvotes. Other than that, the issue was raised only once before in 2015 as far as I can find (#337). I also can't really find anyone asking for it in any of the HN threads on TOML.

Reverting this means we can go forward releasing TOML 1.1, giving people access to the much more frequently requested relaxing of inline tables (#516, with 122 upvotes, and has come up on HN as well) and some other more minor things (e.g. \e has 12 upvotes in #715).

Basically, a lot more people are waiting for this, and all things considered this seems a better path forward for now, unless someone comes up with a proposal which addresses all issues (I tried and thus far failed).

I proposed this over here a few months ago, and the responses didn't seem too hostile to the idea:
#966 (comment)

This backs out the unicode bare keys from toml-lang#891.

This does *not* mean we can't include it in a future 1.2 (or 1.3, or
whatever); just that right now there doesn't seem to be a clear
consensus regarding to normalisation and which characters to include.
It's already the most discussed single issue in the history of TOML.

I kind of hate doing this as it seems a step backwards; in principle I
think we *should* have this so I'm not against the idea of the feature
as such, but things seem to be at a bit of a stalemate right now, and
this will allow TOML to move forward on other fronts.

It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until
2019, and has only 11 upvotes. Other than that, the issue was raised
only once before in 2015 as far as I can find (toml-lang#337). I also can't
really find anyone asking for it in any of the HN threads on TOML.

Reverting this means we can go forward releasing TOML 1.1, giving people
access to the much more frequently requested relaxing of inline tables
(toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other
more minor things (e.g. `\e` has 12 upvotes in toml-lang#715).

Basically, a lot more people are waiting for this, and all things
considered this seems a better path forward for now, unless someone
comes up with a proposal which addresses all issues (I tried and thus
far failed).

I proposed this over here a few months ago, and the responses didn't
seem too hostile to the idea:
toml-lang#966 (comment)
@eksortso
Copy link
Contributor

eksortso commented Jun 3, 2023

Yeah, rolling back all the work done to allow more bare keys is a very bad idea.

Don't confuse bare keys with the issues surrounding potential normalization of key strings. Quoted keys could always raise these issues. And we can wait until after v1.1.0 is released to address normalization issues, should they arise in real life.

@arp242 I think I know where you're coming from by floating this proposal. I once opened a PR, #663, that I hesitated to suggest but seemed to have some support in its favor, in order to advance discussion. Ultimately it was closed, but it certainly advanced discussion, and it prompted a counter proposal, #676, that did have broad consensus.

That said, someone may think to push a PR to move the normalization debate. I would do so if I had more time, but I've left plenty of proposal language in the discussions which anyone is invited to use.

@arp242
Copy link
Contributor Author

arp242 commented Jun 3, 2023

rolling back all the work done

I don't really see it so much as "rolling back" but rather as "delaying". No work is really lost.

I'd just like to move forward with TOML. Right now no one seems 100% happy with how things are with the unicode keys (although the reasons for the unhappiness are very varied, which is part of the problem) and it's been like that for over a year now.

Don't confuse bare keys with the issues surrounding potential normalization of key strings. Quoted keys could always raise these issues.

Yes, but very few people write quoted keys so it's much less of an issue (almost a non-issue in reality). It's not a surprise this has gone largely unnoticed/undiscussed before the Unicode PR was merged.

And we can wait until after v1.1.0 is released to address normalization issues, should they arise in real life.

Possibly solutions such as "only normalized forms will be allowed in TOML bare keys" will be closed off due to backwards compatibility, so I don't really agree with that. I have said this many times before, but I very much think we should consider both issues together.

@arp242
Copy link
Contributor Author

arp242 commented Jun 3, 2023

Ultimately it was closed, but it certainly advanced discussion, and it prompted a counter proposal, #676, that did have broad consensus.

I did spend quite a bit of time trying to come up with a better proposal which should address all issues by the way, but it's not so easy. IMHO by far the best way would be to spec out all problems so they can't occur, but then the Unicode ranges get much larger – I managed to reduce it to about 100, which is still quite a lot.

It's a problem a lot of environments have struggled with. I thought I mentioned that in the commit message but it seems that got lost in editing.

@ChristianSi
Copy link
Contributor

ChristianSi commented Jun 4, 2023

I'm 1000% opposed to this proposal and I'm very relieved to see that it doesn't seem to have any realistic chance anyway. As I see it, bare keys written in any natural language will the the big new feature and advantage of TOML 1.1 if and when it comes.

As for the million dollar question: I don't think TOML 1.1 is blocked by the normalization issue, rather it's blocked by nobody working on it. I once tried to triage the issues that should be addressed in some way or other for it to reach completion. Possibly one or two of the issues mentioned there have meanwhile been resolved, and probably a few new ones have emerged, but I think that analysis is still largely valid.

In an ideal world, there would be somebody working on these issues, trying to find a suitable solution (which might include closing as "out of scope" or deferring after 1.1) for each of them, unless finally an RC for 1.1 is ready to be published. In a perfect world, that person or persons would be TOML maintainers – that world, sadly, is not in sight, but even some other community member(s) should be able to make progress if they focus on this task.

Maybe there are any volunteers here willing to pick it up? (I can't, due to other obligations.) @arp242: Maybe it would be something for you, seeing that you seem very interested in getting 1.1 finished?

@arp242
Copy link
Contributor Author

arp242 commented Jun 6, 2023

I don't think we really need to do anything for TOML 1.1 @ChristianSi; anything can be in any future version, so any "milestones" are rather arbitrary, IMHO.

@pradyunsg
Copy link
Member

pradyunsg commented Aug 28, 2023

Honestly, I'm not opposed to this idea as strongly as others seem to be. While I do agree that allowing bare keys across the board would be an improvement, the argument for "this is complicated and no one has the time to work on it, let's get to this later" is a sound one.

any "milestones" are rather arbitrary, IMHO.

FWIW, I agree with this. Don't use what's been put in the milestones as a hammar to shut down a line of argument. :)


I think it comes down to how important we think it is to have this in this release rather than delaying it for a future release. IIUC, @arp242 thinks that it's worth getting a release out sooner, while @eksortso and @ChristianSi think that it's worth resolving the outstanding issues with bare keys and including it in the release.

I don't know yet what my preferences are -- I haven't caught up on the entirety of the bare keys discussion yet, to have an updated sense of the complexities involved -- but I don't think that this "doesn't seem to have any realistic chance" is an accurate read. It is very much possible that we go down this route, if the bare keys discussion doesn't reach a resolution / have a feasible solution that there's sufficient agreement on.

Two arguments against deferring this for a new release are that:

  • It's unlikely that implementers will pick a new TOML version to just support this additional complexity (better bare keys), so it makes sense to couple the functionality that people want with complicated-but-valuable functionality that enables non-english TOML configuration files to be easy to write as well.
  • As a configuration format specification which would end up serving as a very-low-level piece of language and tooling ecosystems, each release of TOML requires a bunch of work from those downstream ecosystems. Fewer releases will mean less work for those downstream ecosystems.1

Footnotes

  1. I'm biased, as I expect that I'll likely end up doing a bunch of work on the Python side of this.

@arp242
Copy link
Contributor Author

arp242 commented Aug 28, 2023

"this is complicated and no one has the time to work on it, let's get to this later" is a sound one.

It's not even "lack of time", it's that I simply can't find a solution which would satisfy everyone. This may be a failure on my part though, but it's not just "lack of time", it's "we can't agree on a good way to proceed on all these issues". Every possible resolution I can see has strong opposition from at least one person.

@ChristianSi
Copy link
Contributor

ChristianSi commented Sep 1, 2023

Let's keep in mind that the true problem is not at all this one, but #966. And I think that one is solvable. But even if we decide to leave the normalization issue unsolved in 1.1 and address it only later, nothing would be lost.

@arp242
Copy link
Contributor Author

arp242 commented Sep 1, 2023

if we decide to leave the normalization issue unsolved in 1.1 and address it only later, nothing would be lost.

That would be unwise, because we will no longer be able to solve them by changing the specification.

There's also #954.

@pradyunsg
Copy link
Member

Yea, I'd mentally bundled those together as "bare keys need changes before being good-enough for release".

arp242 added a commit to arp242/toml that referenced this pull request Sep 22, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and ANY solution is a
trade-off. That said, I do believe some trade-offs are better than
others, and after looking at a bunch of different options I believe this
is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is the strongest argument in favour of this and the biggest
  improvement: we can't really do anything wrong here in a way that we
  can't correct later. Being conservative is probably the right way
  forward.

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work", but "this
  character works fine, but this very similar doesn't". This shows up in
  a number of things:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  From the user's perspective this seems like a bug in the TOML parser.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

[1]: Aside: I encountered this just the other day as I created a TOML
     file with all UK election results since 1945, which looks like:

         [1950]
         Labour       = [13_266_176, 315, 617]
         Conservative = [12_492_404, 298, 619]
         Liberal      = [ 2_621_487,   9, 475]
         Sinn_Fein    = [    23_362,   0,   2]

     That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just
     wrote it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this pull request Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this pull request Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this pull request Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this pull request Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
@arp242
Copy link
Contributor Author

arp242 commented Sep 30, 2023

I don't think this entire feature is worth it: the issue doesn't have that many upvotes, the PR has zero, when I did this PR I searched for people asking for this feature in HN threads about TOML and I couldn't find anyone (I saw a lot of people complaining about a lot of stuff, it is HN after all, but not this), YAML allows it but no one seems to be using it, we haven't had people ask about it in issues or anywhere else.

In short, it's not really solving a significant practical problem. I say this as someone who is actually using it right now (my Sinn_Féin key that I mentioned the other day). Yes, it will be useful on occasion, but I feel that >95% of this is "adding it for the sake of adding it".

And that would be acceptable if it's a small thing with simple implications. But it's not.

TOML is for configuration files. That's what we're always saying, right? How often do you want non-ASCII in configuration files? Rarely. My Sinn_Féin usage is as a "data file" rather than "config file".

The best we can do is "we're going to do a kind of bad half-Unicode implementation, but because not a lot of people will use it, probably, it's not going to be a problem, probably, but if people do it will be a problem."

Meh.

That's basically what we have now. #990 improves on that IMHO, but still doesn't really solve everything. The only way we can solve it is by going full-in Unicode, but people don't want that either, for reasonable reasons, so we just can't win.

We've spent more time on this than it will ever save. This is increasingly just a sunk-cost fallacy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants