Accented characters #4064
-
I'm seeing a problem when I try to consume text that contains French accented characters like here The file is a UTF-8 file and the character when I view the file has vale The parsing works fine if I edit the grammar and replace the I tried defining the token like this even, but that too fails:
What must I do to be able to consume this correctly? I'm running the GUI tool which runs the generated parser/lexer and it fails. I did just discover something puzzling, if I look at the .g4 file in the hex editor, this is how the token The accented char appears as two characters, the two bytes |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I suspect the issue comes from your editor settings. UTF-8 will indeed encode 'é' as 2 bytes, so the top screenshot is certainly not UTF-8. This issue could also originate from your git settings, silently changing the encoding in the background. |
Beta Was this translation helpful? Give feedback.
-
Thanks @ericvergnaud that's definitely what's happening (or was). I was (for reasons I won't bore you with) editing the test source input files inside Visual Studio, where I also had a C# project open that consumes the C# lexer/parser generated by Antlr. But I had the .g4 file opened and being edited within Visual Studio Code (because that has extensions that understand Antlr grammars, Visual Studio 2022 does not). So under the hood, saving source code from VS and saving the grammar file from VSC was doing different things, the accented text looked the same in each editor but inside the files it was not. I now edit the g4 file and my test source files from VSC and all is well, nice n simple, what a puzzler, what a hair-puller-outer! Thx |
Beta Was this translation helpful? Give feedback.
I suspect the issue comes from your editor settings. UTF-8 will indeed encode 'é' as 2 bytes, so the top screenshot is certainly not UTF-8. This issue could also originate from your git settings, silently changing the encoding in the background.