-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YAML parser rejects Unicode surrogates. #2206
Comments
YAML 1,2 specification, first line of section 5.2:
I can get My conclusion to all this is that surrogates are invalid in UTF-8, but the YAML parser ought to do something sensible when presented with surrogate code points presented as \uD800-\uDFFF escapes. And throwing an error is not sensible, especially not when it works fine for UTF-16 with a byte order mark. (And I'd claim that it ought to work without the BOM too since all the other characters come out fine; so we know it detected the byte order correctly). |
To be more explicit: The only spec-compliant way to represent 𐌶 in JSON is as So regardless of what the YAML spec says or doesn't say about surrogates in |
Yeah this is a known issue with the underlying go-yaml library, which hasn't received much love recently :( |
Describe the bug
The YAML parser rejects Unicode surrogates, but the JSON parser accepts them. This breaks the expectation that you can parse JSON as if it were YAML.
I have a real world example where the JSON response body from an HTTP call to an API endpoint contains surrogates (in string values), but I'll illustrate it with a single non-BMP character (u+10336 Gothic Letter Iuja, 𐌶):
So arguably the author of the JSON should have used \U{10336}, but that only works for EcmaScript strings. (Tested in the Firefox console, but
yq -pj
doesn't grok it. Firefox also accepts surrogates).YAML supports \U00010336, but that only works with
yq -py
. FWIW, the YAML 1.2 spec doesn't mention surrogates, but you can argue that they aren't "characters". I just need them to work...(This sort of proves that YAML doesn't have JSON as a subset if you use the full EcmaScript string definition; but the JSON spec only has \uxxxx, so it's cool but you do need surrogates to reach outside the BMP).
Version of yq: 4.44.5
Operating system: linux amd64
Installed via: binary release
The text was updated successfully, but these errors were encountered: