-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NBT does not use UTF-8, it's MUTF-8. #144
Comments
@TkTech Thanks for the report! And great suggestion, I would support an fix. (Unfortunately, I'm not actively maintaining this package anymore, so will do not myself, at least not anytime soon I'm afraid). Interesting topic, I thought I'd seen it all after the different line-endings, different normalizations in UTF, and the BOM-or-no-BOM. Another variant I was not aware of. Seems like one of the original Java programmers had a field day torturing the original UTF-8 and UTF-16 specs. Alas, that's what we have to deal with. [Edit, seems I was wrong at first] Just for the record, am I correct to assume that:
And if I'm correct, there are two difference between this format and UTF-8
|
That would be correct. I know it can be confusing, especially with some of the first posts you see (such as stackoverflow) when searching just suggest replacing NULLs, which is incorrect. |
Since this library wants to be dependency free, and py2.7 compatible (which
|
Alright, so it looks like I'll need to fork the project and apply this patch. I currently have my block entity scanning script crash when trying to iterate over HermitCraft Season 7's world due to this bug. |
Turns out Hermitcraft 6 has a region file which breaks mutf8. I've isolated the broken region file to be |
Interestingly enough, the message doesn't show up in the mutf8 source code. However, I believe the message is supposed to come from line 65 of mutf8.py as it's the message with the if statement that checks for byte The problem with the way the exceptions are handled right now, is, I have no way of finding out which chunk or block is corrupted. So, I have to play guess which block's corrupted as I cannot get the coordinates from the stack trace. I'm about to wipe out every block that isn't bedrock or marked as axe mineable (for chests). If the error goes away, then it'll most likely be a hopper, furnace, blast furnace, etc... I'm thinking to make it easier to debug, it might do me well to create a datapack that includes every vanilla block that isn't a block entity and use that for wiping out blocks. It may not even be a block that causing the exception, but I can't load the region file in https://irath96.github.io/webNBT/ or in the NBT Plugin I have installed into Intellij Idea. So, it may do well to build an editor that can log every exception without breaking the chunk loop. I should also mention, neither Minecraft, nor Amulet detects a problem with loading these chunks. Even using the Optimize World feature to upgrade the region file from 1.14.4 to 1.17.1 doesn't fix the issue. So, most likely, the data is valid, just that mutf8 can't handle it. |
Can you attach the region file? It may or may not be an issue with mutf8, might just be a genuinely corrupted tag. |
Yes. I've located the chest causing the problem too. Turns out Docm had created a series of books with weird characters and named them Alien Tech. Even minecraft froze for a second when I ran the I copied the chest to every hotbar save, so you can run something like x+1 to get the chest which breaks mutf8. |
As I may have accidentally uploaded a copy post my breaking the chest (to confirm that the problem was with the chest), here's an unedited copy of the region file. Also, here's the command which has the chest's coordinates in it. Edit: If it helps, this is the scanner I'm working on (https://github.com/alexis-evelyn/WorldScanner/blob/master/scanner.py). I'm currently using the patch you provided at #144 (comment). I have not uploaded the patched version of NBT yet, but can do so if you don't already have a fork that has a patch (If I add my own patches, then I can include yours too if you'd like). |
Still getting the same issue though:
|
This was an interesting one! Thankfully I found an issue page about MUTF-8 handling on the repo for Twoolie/NBT, the Python project. It gave me some insight and a file to test against. I wrote my own script to slim it down a bunch, and dedupe the tags that are used multiple times. It's crazy how big just book text can get! I used this actual version of NBTify in this commit, to write the new content to the file. That's also why I diffed it out, I wanted to make sure when I slimmed it down that the content coming out of it was actually what it was supposed to be as well. When using older NBTify, it didn't work correctly, because MUTF-8 handles things different than standard UTF-8. ```js // @ts-check import { readFile, writeFile } from "node:fs/promises"; import * as NBT from "./NBTify/src/index.ts"; const data = await readFile("./hotbar.nbt"); const trimmed = data.subarray(0x000BAE96, 0x000CA7C2); console.log(trimmed); /** @type {NBT.NBTData<any>} */ const hotbar = await NBT.read(data); const book = hotbar.data[0][1].tag.BlockEntityTag.Items[12]; console.log(book); const mutf8Demo = await NBT.write(book); console.log(mutf8Demo); const demoDiff = mutf8Demo.subarray(1, -2); console.log(Buffer.compare(trimmed, demoDiff)); await writeFile("./alien-book.nbt", mutf8Demo); ``` #42 #44 twoolie/NBT#144 (comment) twoolie/NBT#144 I'm still not sure I'm going to use the dependency itself or if I should just emded that into NBTify on it's own. I think I may just use it as a dependency, as I've been trying to get more used to not reinventing the wheel for everything, unless that has benefits. The MUTF-8 library already does everything I need it to, and it's ESM TypeScript, so I'm not sure what other reason I have to not just use it, it's great! Eventually I want to move my compression handling into a separate module too, so I will have to use module resolution for that down the road either way. I say heck to it! Let's do it :) Gonna look into if there's anything I'm forgetting, before doing that though. I really like having the ability to use projects like these (NBTify) without needing a transpilation or build step. Modern CDNs seem to handle this nicely, so we'll see.
Just wanted to stop by and say thanks for documenting this! I'm working on an NBT library as well, and MUTF-8 does have a notable difference in output for the character ranges that it handles compared to UTF-8. Having that |
NBT uses MUTF-8, not UTF-8. Valid game-generated files will result in
UnicodeDecodeErrors
when using Twoolie's NBT. Minimal reproduction file with an embedded MUTF-8NULL
: encoded.dat.gzI'd normally send you a PR to use my MUTF-8 encoder, but being dependency-free seems to be a project goal. There's a pure-python version in there you can just copy.
@1dt
The text was updated successfully, but these errors were encountered: