Accented characters #4064

Korporal · 2023-01-07T21:32:36Z

Korporal
Jan 7, 2023

I'm seeing a problem when I try to consume text that contains French accented characters like here déclarer. The parser/lexer seems to see it as a separator and it breaks the expected single token keyword into two parts.

The file is a UTF-8 file and the character when I view the file has vale E9

The parsing works fine if I edit the grammar and replace the é with an ordinary e (value 65) and edit the source text to replace it there too.

I tried defining the token like this even, but that too fails:

('d\u00E9clarer')

What must I do to be able to consume this correctly? I'm running the GUI tool which runs the generated parser/lexer and it fails.

I did just discover something puzzling, if I look at the .g4 file in the hex editor, this is how the token déclarer looks:

The accented char appears as two characters, the two bytes C3A9, if I paste that into my source code then it parses! but looks like this in an editor:

Answered by ericvergnaud

Jan 7, 2023

I suspect the issue comes from your editor settings. UTF-8 will indeed encode 'é' as 2 bytes, so the top screenshot is certainly not UTF-8. This issue could also originate from your git settings, silently changing the encoding in the background.

View full answer

ericvergnaud · 2023-01-07T22:12:24Z

ericvergnaud
Jan 7, 2023
Maintainer

I suspect the issue comes from your editor settings. UTF-8 will indeed encode 'é' as 2 bytes, so the top screenshot is certainly not UTF-8. This issue could also originate from your git settings, silently changing the encoding in the background.

0 replies

Korporal · 2023-01-07T23:21:04Z

Korporal
Jan 7, 2023
Author

Thanks @ericvergnaud that's definitely what's happening (or was). I was (for reasons I won't bore you with) editing the test source input files inside Visual Studio, where I also had a C# project open that consumes the C# lexer/parser generated by Antlr. But I had the .g4 file opened and being edited within Visual Studio Code (because that has extensions that understand Antlr grammars, Visual Studio 2022 does not).

So under the hood, saving source code from VS and saving the grammar file from VSC was doing different things, the accented text looked the same in each editor but inside the files it was not.

I now edit the g4 file and my test source files from VSC and all is well, nice n simple, what a puzzler, what a hair-puller-outer!

Thx

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accented characters #4064

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Accented characters #4064

Korporal Jan 7, 2023

Replies: 2 comments

ericvergnaud Jan 7, 2023 Maintainer

Korporal Jan 7, 2023 Author

Korporal
Jan 7, 2023

ericvergnaud
Jan 7, 2023
Maintainer

Korporal
Jan 7, 2023
Author