-
-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example JSON transformer doesn't process escape sequences #415
Comments
BTW I noticed this while working on my own parser and discovering that |
Well, it's not really the parser's job, and I didn't try to make a functional json parser, just to provide an example so newcomers can see how to use Lark. I might include your fix as a comment in the tutorial, but I don't want to complicate it with non-parsing code. |
No problem, it's just a little thing I noticed. Technically the change only affects the transformer, so the parser stays concerned with parsing. :) But please note that at least some of the parsers you are comparing performance with do handle string escapes - |
That's a good point about benchmarking. I don't imagine it would change the results by much, though. |
@erezsh Why is this closed without fix? The contributed patch fixes the parser without incurring a significant overhead. |
I decided against it, because it doesn't fix Lark. It fixes an example program, that was written in order to teach, not to solve a real-life problem. The result still wouldn't be a proper JSON parser, because there are many other edge cases to solve. I'm not even sure your solution is 100% correct. It takes a lot of careful tests to be sure you "solved" unicode conversions. So by adding it I might be propagating an incorrect idiom. I'd much prefer adding |
Sorry to comment on a closed issue, but this may be relevant for people turning up late: The ast.literal_eval() function is not perfect for two (and a half) reasons:
It is more correct and faster to just use something like regex replaces. E.g.: import re
_json_unesc_re = re.compile(r'\\(["/\\bfnrt]|u[0-9A-Fa-f])')
_json_unesc_map = {
'"': '"',
'/': '/',
'\\': '\\',
'b': '\b',
'f': '\f',
'n': '\n',
'r': '\r',
't': '\t',
}
def _json_unescape(m):
c = m.group(1)
if c[0] == 'u':
return chr(int(c[1:], 16))
else:
return _json_unesc_map[c]
def json_unescape(s):
return _json_unesc_re.sub(_json_unescape, s[1:-1]) Use the (code from here) By the way, I found some other places where the illustrative json parser doesn't follow the spec:
The following changes to the grammar seem to work:
(code from here) These changes collectively slow down things a little bit, but not terribly. |
@goodmami Thanks for the additional information! I labeled the issue as "discussion", so it's definitely welcome. At the same time, I'm sure there are plenty of open-source json parsers with a very detailed escaping function, if anyone ever needs to seriously parse JSON by himself. |
I think there is value in the JSON parser example being correct, even if it somewhat complicates the example. The other parsers seem to strive to do so, as pointed out in the issue description. Consider that LARK's Python parser makes a serious attempt to parse all of Python. Examples that show how to solve real-world problems rather than simplifications are a great way to show the true capabilities of the software. Their existence often distinguishes between production-ready and toy projects, and Lark is definitely in the former category. |
@hniksic I'd agree with @erezsh that having a fully correct parser is not appropriate for the tutorial. All the little details distract from the purpose of giving a simple walkthrough of the Lark's features. However it wouldn't be a bad idea to mention the "next steps" for a fully valid parser, perhaps in the afterword, maybe even with a complete implementation in Also, I forgot to mention my fix for trailing characters, but it's not very satisfying. I require a |
I would definitely welcome adding a real json parser to the examples, as long as an actual attempt is made at completeness and clean code. I could then link to it from the toy json parser. But as for the tutorial code, I think it's best to leave it as-is. |
To expand a little on what I mean by real attempt: Lark's example Python parser can parse every Python file in the built-in library, and a special Python file full of gotchas, made by someone who isn't me. If you can get a JSON parser to that level of rigor, I will be happy to include it. |
Maybe something like this: http://www.json.org/JSON_checker/ See the linked test suite. Note, however, that this test suite expects the top-level element to only be an object or an array, but the spec allows other value types. I think the spec may have changed at some point and the tests were not updated, but it still is a useful set of examples. |
The JSON transformer presented in the documentation doesn't handle the escape sequences in JSON strings. For example, copy the last full source (from Part 5 - Step 1) to a
lark-json
script and run it:The expected output would be
foo
andbar
in two separate lines.An easy and (I hope) fast way to achieve this is to use the
unicode_escape
built-in codec, which also handles the\uhhhh
sequences:This will also modify
string
to return Unicode strings in Python 2. Since JSON is defined in terms of Unicode, this is probably the right thing to do, but if it's undesirable, the result could be re-encoded to UTF-8.The text was updated successfully, but these errors were encountered: