-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate par2 format for parity #22
Comments
I re-read your comment and I'm still a little lost on what you mean by either "alignment" (just in the sense of padding?) or scrambling. Could you provide a reference on the par2 format? I'm familiar with the tool, but not the actual file format, and staring at it would probably help. |
What dependencies are required for restore is always a critical question. Personally, the approach I've taken* is to throw a tar file header on the first QR code, and use Par2 for the rest. My par2 implementation is still in progress, but explicitly has the goal of supporting the optional "Input File Slice packet", and zero padding the packets for maximum recoverability in QR codes. One of the primary things I am focusing on is using Par2 packets to allow file recovery, even if all QR codes are scrambled. An intermediate solution is to use a tar file header (512 bytes) in front of the file data, and another header (512 bytes) in front of the par2 file. As long as the QR codes are read in the proper order and those headers survived, this would allow for recovery even if a data QR code was unreadable. * Different requirements, and design philosophy. Important information regarding Par2:
|
Oh, another approach is the master branch of The down side is you must use a par2 program to do so. However, my approach ends up with that requirement anyways, so thinking about it, I will probably ignore the whole "Input File Slice packet" thing, and just generate over 100% recovery data. That allows any par2 program to recover the data! That does put a dependency that you may not be happy with however. Which is another reason I am working on a pure python implementation. |
One of the primary things I am focusing on is using Par2 packets to allow file recovery, even if all QR codes are scrambled.
Again, what do you mean by "scrambled"?
I'll point out that the QR codes I'm using have ~200 bytes, so 64-byte overhead would be a lot. If you're having luck with actually scanning/photographing significantly bigger ones, can you let me know what tools/hardware you're using?
I'm taking a look at par2. Hand-rolling par2 IMO seems more complex than using reed-solomon directly--it would only be worthwhile for the restore process. I think we may have the same design requirements, not different ones, if I'm understanding correctly. https://github.com/za3k/qr-backup/blob/master/docs/FAQ.md#what-are-the-design-goals-of-qr-backup
Are you decoding to a 2-member tar archive (separate file and .par2)? Or is it a 1-member archive? I'd like the restore to work without par2, which seems doable, but needs a lot of magic knowledge on the encoding end. The main downside is that no one will be able to understand the format any more--only the restore process.
P.S. Make sure think about what happens if your headers are the QR codes which are damaged--add redundancy to these! I'm sure you already are :)
|
OK yeah on the whole I think this seems like a good parity option because of the command-line restore option. Thanks for taking the extra time to explain the format you're using to me. Obstacles:
Let me know if you have any answers here. |
Sorry, I took so long to respond. Yes I mean if something like the pages of QR codes are mixed up, or the scanner decides to do something like read top to bottom instead of left to right. Tar format and QR sizeI have focused on 512 byte QR codes, and am considering actually bumping up to 1024 bytes. With a full 25% per code redundancy. Just using my phone at this point, 512 scans every time, but 1024 does not. The tar format works on 512 byte blocks and is a streaming format. Basically, you take a regular file and slap a tar header in front of it. So you can do:
Python has a tar file parser as part of the standard library, so there are no additional dependencies to do that. My goals, and Par2My goal is to be able to make a book of QR codes* that if I ever need to recover I can just throw in an auto document feed scanner, convert the whole thing to one big raw file, and then recover everything with one command. Even if a page is missing, the pages were not in the right order, or water/fire damage destroyed a corner of every page. Par2, with each packet being on a separate QR code, supports everything I just said. Especially since the format supports something like Unfortunately, packet size consistency and alignment is explicitly not a goal of the par2 format. Here is what I have found, using the names from my Par2Py:
* Even at different times, provided all file names are different. |
If you haven't checked it out already, I have a much simpler method to "de-scramble" things: |
Thanks! That's a super simple way of handling it, and has far less overhead than the 68 bytes of a Par2 packet header. Re-reading your last comment, the largest issue is each Par2 archive is limited to 32768 input blocks. Which at a 444 byte block size results in a maximum size of just under 14MB. Here is me discussing the issue with the par2cmdline people, and being informed of this limitation. However, that's per archive. You can have multiple archives, they just wont share recovery data. |
As a note to my future self, the backup format would be a file always called 'file' and a file always called 'file.par2' both inside a nameless tarball. Here are the needed qr-backup QR "packet" types:
Restore process
|
Minor notes:
Edit: More notes: Here is a spec for Tar headers.
The Tar file format explicitly allows a single Tar file to have two headers with the same file name. This is because it was designed for tape drives where you could be backing up a file multiple times. It is implementation defined how those are unpacked. Most implementations either use the last file in the archive, or the file with the newest timestamp. |
Revised, there's no reason to invoke tar, what was I thinking?
Restore process To decode an uncorrupted file without par2 installed, take "N" packets and concatenate them.
|
OK, looked into whether it's possible to trim the 64 bytes--yes, but not while keeping it as a readable bash oneliner. |
All right, overall I'm leaning toward not using par2 for now. On qr-backup's default settings, 64 binary bytes overhead for each code would mean something like doubling the number of QR codes, which is quite bad. Also, par2 adds a must-have "header" QR code, which I have been specifically avoiding because it makes things complicated for users to think about. My feeling is that currently the costs outweigh the benefits. I do think it's still a good idea, so let me know if/when you get it working on your own project please. And I may revisit if I get larger QR codes working, and a system for requesting duplicate copies of QR codes. |
You were following one of my early approaches, not yours. Different design goals and considerations. Pros and cons to each. I suspect we have slightly different definitions of "simple". I'm glad we could discuss this. Thanks :) |
Blergh, all right, given that the alternatives look worse (reed-solomon coding isn't clearly a real option), let's give this a try! I'll see what the blowup looks like in practice. I do like the possibility of CLI restore quite a lot. |
Hey, don't stress out about it. I'm already working on things to make it easier/aligned. Let me deal with that part, and you can focus on how you want everything to be packed / laid out. |
Heh, I appreciate it. I just want parity working, and I'm frustrated that all the options are bad (IMO anyway). But having something is better than having nothing. I'll probably do it in a couple days. That said, if you want to give me a PR feel free! |
Investigated a little further. Linux, Windows, and Mac should be able to restore the par2 format. There are no apps to do so on Android |
Closed since I added erasure coding |
I've been working on this on my own implementation.
There's an easy solution that already exists. Par2 is a recovery file which generates Reed Solomon codes. It's a packetized format, so you can zero pad each packet to fit in a QR code and the programs will still read them.
The downside is that you need to be aware that recovery packets are 64 bytes larger compared to the recovery block size.
If you don't care about exact alignment, you can just use Par2 as is. If you do, I am currently working on Par2Py, which will allow for alignment and for using an optional packet type that stores the data itself inside the Par2 file itself.
The advantage of this approach is if you do one Par2 packet per QR code, then not only can you recover from some codes missing, but you can also recover from codes being scrambled!
Originally posted by @EmperorArthur in #2 (comment)
The text was updated successfully, but these errors were encountered: