What to do with Unicode strings #1

Ashvin-Ranjan · 2022-01-02T21:14:38Z

The issue

Currently EMP removes all non-ascii characters from its strings using this line of code, which might make things harder when converting because both JSON and NBT support Unicode of some variety.

JSON

In JSON:

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.

From json.org

There are some useful things to note from this definition, as though it might not work to include UTF-8 or UTF-16 because of the fact that strings over length 16 cannot have the byte 00000100 in it otherwise it will terminate prematurely, messing up the rest of the decoding, we can still include \b, \f, \n, \t, and \r as none of those conflict with the current system.

NBT

According to wiki.vg NBT uses Modified UTF-8, this may be able to be modified further to disallow the usage of the byte 00000100.

The text was updated successfully, but these errors were encountered:

Ashvin-Ranjan added help wanted Extra attention is needed question Further information is requested labels Jan 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do with Unicode strings #1

What to do with Unicode strings #1

Ashvin-Ranjan commented Jan 2, 2022

What to do with Unicode strings #1

What to do with Unicode strings #1

Comments

Ashvin-Ranjan commented Jan 2, 2022

The issue

JSON

NBT