Example JSON transformer doesn't process escape sequences #415

hniksic · 2019-07-25T09:21:27Z

The JSON transformer presented in the documentation doesn't handle the escape sequences in JSON strings. For example, copy the last full source (from Part 5 - Step 1) to a lark-json script and run it:

$ python lark-json /dev/stdin
"foo\nbar"
^D
foo\nbar

The expected output would be foo and bar in two separate lines.

An easy and (I hope) fast way to achieve this is to use the unicode_escape built-in codec, which also handles the \uhhhh sequences:

diff --git a/lark-json.orig b/lark-json
index 3d07bd6..8442239 100644
--- a/lark-json.orig
+++ b/lark-json
@@ -25,7 +25,7 @@ json_grammar = r"""
 
 class TreeToJson(Transformer):
     def string(self, (s,)):
-        return s[1:-1]
+        return s[1:-1].decode('utf-8').decode('unicode_escape')
     def number(self, (n,)):
         return float(n)

This will also modify string to return Unicode strings in Python 2. Since JSON is defined in terms of Unicode, this is probably the right thing to do, but if it's undesirable, the result could be re-encoded to UTF-8.

The text was updated successfully, but these errors were encountered:

hniksic · 2019-07-25T09:26:10Z

BTW I noticed this while working on my own parser and discovering that common.ESCAPED_STRING doesn't automagically process the escapes in the token. That's reasonable, but I was curious to see how the JSON code does it, only to find that it doesn't. :)

erezsh · 2019-07-25T10:12:56Z

Well, it's not really the parser's job, and I didn't try to make a functional json parser, just to provide an example so newcomers can see how to use Lark.

I might include your fix as a comment in the tutorial, but I don't want to complicate it with non-parsing code.

hniksic · 2019-07-25T10:38:11Z

No problem, it's just a little thing I noticed. Technically the change only affects the transformer, so the parser stays concerned with parsing. :)

But please note that at least some of the parsers you are comparing performance with do handle string escapes - parsimonious_json calls ast.literal_eval and funcparselib unescapes them manually. (Both approaches are significantly slower than decoding with the unicode_escape codec.) Adding escape processing to the Lark code would ensure a fair comparison.

erezsh · 2019-07-25T10:51:10Z

That's a good point about benchmarking. I don't imagine it would change the results by much, though.

hniksic · 2020-03-08T11:12:53Z

@erezsh Why is this closed without fix? The contributed patch fixes the parser without incurring a significant overhead.

erezsh · 2020-03-08T12:19:10Z

I decided against it, because it doesn't fix Lark. It fixes an example program, that was written in order to teach, not to solve a real-life problem.

The result still wouldn't be a proper JSON parser, because there are many other edge cases to solve.

I'm not even sure your solution is 100% correct. It takes a lot of careful tests to be sure you "solved" unicode conversions. So by adding it I might be propagating an incorrect idiom. I'd much prefer adding literal_eval, even though it's slower, just to make sure I'm not missing anything.

goodmami · 2020-03-23T10:07:27Z

Sorry to comment on a closed issue, but this may be relevant for people turning up late:

The ast.literal_eval() function is not perfect for two (and a half) reasons:

it will evaluate some things that the JSON spec does not allow, such as octal escapes:

>>> ast.literal_eval(r'"\100"')
'@'
>>> json.loads(r'"\100"')
Traceback (most recent call last):
[...]
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 2 (char 1)

The \/ escape sequence in JSON is not a Python escape sequence, so the escaping character itself becomes escaped:
```
>>> ast.literal_eval(r'"\/"')
'\\/'
>>> json.loads(r'"\/"')
'/'
```
(half reason) it's pretty slow

It is more correct and faster to just use something like regex replaces. E.g.:

import re

_json_unesc_re = re.compile(r'\\(["/\\bfnrt]|u[0-9A-Fa-f])')
_json_unesc_map = {
    '"': '"',
    '/': '/',
    '\\': '\\',
    'b': '\b',
    'f': '\f',
    'n': '\n',
    'r': '\r',
    't': '\t',
}

def _json_unescape(m):
    c = m.group(1)
    if c[0] == 'u':
        return chr(int(c[1:], 16))
    else:
        return _json_unesc_map[c]

def json_unescape(s):
    return _json_unesc_re.sub(_json_unescape, s[1:-1])

Use the json_unescape() function on the strings (with the quotes, or else remove the [1:-1] part).

(code from here)

By the way, I found some other places where the illustrative json parser doesn't follow the spec:

It allows non-whitespace data after the top-level element (see Add EOF symbol to match end of input #237)
It allows invalid JSON characters and escape sequences in strings
It allows + signs before numbers
It allows leading zeros in numbers

The following changes to the grammar seem to work:

    string : STRING
    STRING: "\"" INNER* "\""
    INNER: /[ !#-\[\]-\U0010ffff]*/
         | /\\(?:["\/\\bfnrt]|u[0-9A-Fa-f]{4})/

    NUMBER : INTEGER FRACTION? EXPONENT?
    INTEGER: ["-"] ("0" | "1".."9" INT?)
    FRACTION: "." INT
    EXPONENT: ("e"|"E") ["+"|"-"] INT

    %import common.INT

(code from here)

These changes collectively slow down things a little bit, but not terribly.

erezsh · 2020-03-23T11:27:51Z

@goodmami Thanks for the additional information! I labeled the issue as "discussion", so it's definitely welcome. At the same time, I'm sure there are plenty of open-source json parsers with a very detailed escaping function, if anyone ever needs to seriously parse JSON by himself.

hniksic · 2020-03-23T11:41:56Z

I think there is value in the JSON parser example being correct, even if it somewhat complicates the example. The other parsers seem to strive to do so, as pointed out in the issue description. Consider that LARK's Python parser makes a serious attempt to parse all of Python.

Examples that show how to solve real-world problems rather than simplifications are a great way to show the true capabilities of the software. Their existence often distinguishes between production-ready and toy projects, and Lark is definitely in the former category.

goodmami · 2020-03-23T12:55:58Z

@hniksic I'd agree with @erezsh that having a fully correct parser is not appropriate for the tutorial. All the little details distract from the purpose of giving a simple walkthrough of the Lark's features. However it wouldn't be a bad idea to mention the "next steps" for a fully valid parser, perhaps in the afterword, maybe even with a complete implementation in examples/json_parser.py.

Also, I forgot to mention my fix for trailing characters, but it's not very satisfying. I require a ; character (somewhat arbitrarily chosen) after the top-level element, and I append it to the string that's passed in to be parsed. I welcome other ideas for fixing this.

erezsh · 2020-03-23T15:49:48Z

LARK's Python parser makes a serious attempt to parse all of Python.

maybe even with a complete implementation in examples/json_parser.py.

I would definitely welcome adding a real json parser to the examples, as long as an actual attempt is made at completeness and clean code. I could then link to it from the toy json parser.

But as for the tutorial code, I think it's best to leave it as-is.

erezsh · 2020-03-23T15:51:27Z

To expand a little on what I mean by real attempt: Lark's example Python parser can parse every Python file in the built-in library, and a special Python file full of gotchas, made by someone who isn't me.

If you can get a JSON parser to that level of rigor, I will be happy to include it.

goodmami · 2020-03-23T15:59:51Z

Maybe something like this: http://www.json.org/JSON_checker/

See the linked test suite. Note, however, that this test suite expects the top-level element to only be an object or an array, but the spec allows other value types. I think the spec may have changed at some point and the tests were not updated, but it still is a useful set of examples.

erezsh added the discussion Discuss new features or other changes label Mar 8, 2020

erezsh closed this as completed Mar 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example JSON transformer doesn't process escape sequences #415

Example JSON transformer doesn't process escape sequences #415

hniksic commented Jul 25, 2019

hniksic commented Jul 25, 2019

erezsh commented Jul 25, 2019

hniksic commented Jul 25, 2019 •

edited

Loading

erezsh commented Jul 25, 2019

hniksic commented Mar 8, 2020

erezsh commented Mar 8, 2020

goodmami commented Mar 23, 2020 •

edited

Loading

erezsh commented Mar 23, 2020

hniksic commented Mar 23, 2020

goodmami commented Mar 23, 2020

erezsh commented Mar 23, 2020

erezsh commented Mar 23, 2020 •

edited

Loading

goodmami commented Mar 23, 2020

Example JSON transformer doesn't process escape sequences #415

Example JSON transformer doesn't process escape sequences #415

Comments

hniksic commented Jul 25, 2019

hniksic commented Jul 25, 2019

erezsh commented Jul 25, 2019

hniksic commented Jul 25, 2019 • edited Loading

erezsh commented Jul 25, 2019

hniksic commented Mar 8, 2020

erezsh commented Mar 8, 2020

goodmami commented Mar 23, 2020 • edited Loading

erezsh commented Mar 23, 2020

hniksic commented Mar 23, 2020

goodmami commented Mar 23, 2020

erezsh commented Mar 23, 2020

erezsh commented Mar 23, 2020 • edited Loading

goodmami commented Mar 23, 2020

hniksic commented Jul 25, 2019 •

edited

Loading

goodmami commented Mar 23, 2020 •

edited

Loading

erezsh commented Mar 23, 2020 •

edited

Loading