Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example JSON transformer doesn't process escape sequences #415

Closed
hniksic opened this issue Jul 25, 2019 · 13 comments
Closed

Example JSON transformer doesn't process escape sequences #415

hniksic opened this issue Jul 25, 2019 · 13 comments
Labels
discussion Discuss new features or other changes

Comments

@hniksic
Copy link

hniksic commented Jul 25, 2019

The JSON transformer presented in the documentation doesn't handle the escape sequences in JSON strings. For example, copy the last full source (from Part 5 - Step 1) to a lark-json script and run it:

$ python lark-json /dev/stdin
"foo\nbar"
^D
foo\nbar

The expected output would be foo and bar in two separate lines.

An easy and (I hope) fast way to achieve this is to use the unicode_escape built-in codec, which also handles the \uhhhh sequences:

diff --git a/lark-json.orig b/lark-json
index 3d07bd6..8442239 100644
--- a/lark-json.orig
+++ b/lark-json
@@ -25,7 +25,7 @@ json_grammar = r"""
 
 class TreeToJson(Transformer):
     def string(self, (s,)):
-        return s[1:-1]
+        return s[1:-1].decode('utf-8').decode('unicode_escape')
     def number(self, (n,)):
         return float(n)
 

This will also modify string to return Unicode strings in Python 2. Since JSON is defined in terms of Unicode, this is probably the right thing to do, but if it's undesirable, the result could be re-encoded to UTF-8.

@hniksic
Copy link
Author

hniksic commented Jul 25, 2019

BTW I noticed this while working on my own parser and discovering that common.ESCAPED_STRING doesn't automagically process the escapes in the token. That's reasonable, but I was curious to see how the JSON code does it, only to find that it doesn't. :)

@erezsh
Copy link
Member

erezsh commented Jul 25, 2019

Well, it's not really the parser's job, and I didn't try to make a functional json parser, just to provide an example so newcomers can see how to use Lark.

I might include your fix as a comment in the tutorial, but I don't want to complicate it with non-parsing code.

@hniksic
Copy link
Author

hniksic commented Jul 25, 2019

No problem, it's just a little thing I noticed. Technically the change only affects the transformer, so the parser stays concerned with parsing. :)

But please note that at least some of the parsers you are comparing performance with do handle string escapes - parsimonious_json calls ast.literal_eval and funcparselib unescapes them manually. (Both approaches are significantly slower than decoding with the unicode_escape codec.) Adding escape processing to the Lark code would ensure a fair comparison.

@erezsh
Copy link
Member

erezsh commented Jul 25, 2019

That's a good point about benchmarking. I don't imagine it would change the results by much, though.

@erezsh erezsh added the discussion Discuss new features or other changes label Mar 8, 2020
@erezsh erezsh closed this as completed Mar 8, 2020
@hniksic
Copy link
Author

hniksic commented Mar 8, 2020

@erezsh Why is this closed without fix? The contributed patch fixes the parser without incurring a significant overhead.

@erezsh
Copy link
Member

erezsh commented Mar 8, 2020

I decided against it, because it doesn't fix Lark. It fixes an example program, that was written in order to teach, not to solve a real-life problem.

The result still wouldn't be a proper JSON parser, because there are many other edge cases to solve.

I'm not even sure your solution is 100% correct. It takes a lot of careful tests to be sure you "solved" unicode conversions. So by adding it I might be propagating an incorrect idiom. I'd much prefer adding literal_eval, even though it's slower, just to make sure I'm not missing anything.

@goodmami
Copy link

goodmami commented Mar 23, 2020

Sorry to comment on a closed issue, but this may be relevant for people turning up late:

The ast.literal_eval() function is not perfect for two (and a half) reasons:

  • it will evaluate some things that the JSON spec does not allow, such as octal escapes:

    >>> ast.literal_eval(r'"\100"')
    '@'
    >>> json.loads(r'"\100"')
    Traceback (most recent call last):
    [...]
    json.decoder.JSONDecodeError: Invalid \escape: line 1 column 2 (char 1)
  • The \/ escape sequence in JSON is not a Python escape sequence, so the escaping character itself becomes escaped:

    >>> ast.literal_eval(r'"\/"')
    '\\/'
    >>> json.loads(r'"\/"')
    '/'
  • (half reason) it's pretty slow

It is more correct and faster to just use something like regex replaces. E.g.:

import re

_json_unesc_re = re.compile(r'\\(["/\\bfnrt]|u[0-9A-Fa-f])')
_json_unesc_map = {
    '"': '"',
    '/': '/',
    '\\': '\\',
    'b': '\b',
    'f': '\f',
    'n': '\n',
    'r': '\r',
    't': '\t',
}

def _json_unescape(m):
    c = m.group(1)
    if c[0] == 'u':
        return chr(int(c[1:], 16))
    else:
        return _json_unesc_map[c]

def json_unescape(s):
    return _json_unesc_re.sub(_json_unescape, s[1:-1])

Use the json_unescape() function on the strings (with the quotes, or else remove the [1:-1] part).

(code from here)

By the way, I found some other places where the illustrative json parser doesn't follow the spec:

  • It allows non-whitespace data after the top-level element (see Add EOF symbol to match end of input #237)
  • It allows invalid JSON characters and escape sequences in strings
  • It allows + signs before numbers
  • It allows leading zeros in numbers

The following changes to the grammar seem to work:

    string : STRING
    STRING: "\"" INNER* "\""
    INNER: /[ !#-\[\]-\U0010ffff]*/
         | /\\(?:["\/\\bfnrt]|u[0-9A-Fa-f]{4})/

    NUMBER : INTEGER FRACTION? EXPONENT?
    INTEGER: ["-"] ("0" | "1".."9" INT?)
    FRACTION: "." INT
    EXPONENT: ("e"|"E") ["+"|"-"] INT

    %import common.INT

(code from here)

These changes collectively slow down things a little bit, but not terribly.

@erezsh
Copy link
Member

erezsh commented Mar 23, 2020

@goodmami Thanks for the additional information! I labeled the issue as "discussion", so it's definitely welcome. At the same time, I'm sure there are plenty of open-source json parsers with a very detailed escaping function, if anyone ever needs to seriously parse JSON by himself.

@hniksic
Copy link
Author

hniksic commented Mar 23, 2020

I think there is value in the JSON parser example being correct, even if it somewhat complicates the example. The other parsers seem to strive to do so, as pointed out in the issue description. Consider that LARK's Python parser makes a serious attempt to parse all of Python.

Examples that show how to solve real-world problems rather than simplifications are a great way to show the true capabilities of the software. Their existence often distinguishes between production-ready and toy projects, and Lark is definitely in the former category.

@goodmami
Copy link

@hniksic I'd agree with @erezsh that having a fully correct parser is not appropriate for the tutorial. All the little details distract from the purpose of giving a simple walkthrough of the Lark's features. However it wouldn't be a bad idea to mention the "next steps" for a fully valid parser, perhaps in the afterword, maybe even with a complete implementation in examples/json_parser.py.

Also, I forgot to mention my fix for trailing characters, but it's not very satisfying. I require a ; character (somewhat arbitrarily chosen) after the top-level element, and I append it to the string that's passed in to be parsed. I welcome other ideas for fixing this.

@erezsh
Copy link
Member

erezsh commented Mar 23, 2020

LARK's Python parser makes a serious attempt to parse all of Python.

maybe even with a complete implementation in examples/json_parser.py.

I would definitely welcome adding a real json parser to the examples, as long as an actual attempt is made at completeness and clean code. I could then link to it from the toy json parser.

But as for the tutorial code, I think it's best to leave it as-is.

@erezsh
Copy link
Member

erezsh commented Mar 23, 2020

To expand a little on what I mean by real attempt: Lark's example Python parser can parse every Python file in the built-in library, and a special Python file full of gotchas, made by someone who isn't me.

If you can get a JSON parser to that level of rigor, I will be happy to include it.

@goodmami
Copy link

Maybe something like this: http://www.json.org/JSON_checker/

See the linked test suite. Note, however, that this test suite expects the top-level element to only be an object or an array, but the spec allows other value types. I think the spec may have changed at some point and the tests were not updated, but it still is a useful set of examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discuss new features or other changes
Projects
None yet
Development

No branches or pull requests

3 participants