Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of Par3 Specification #1

Open
mdnahas opened this issue Jan 29, 2022 · 108 comments
Open

Discussion of Par3 Specification #1

mdnahas opened this issue Jan 29, 2022 · 108 comments

Comments

@mdnahas
Copy link
Member

mdnahas commented Jan 29, 2022

Par3's specification is in near-final form. I created this issue to discuss any changes that need to be made, as we work on a reference implementation of Par3.

The earlier discussion about the Par3 specification is in this issue on the par2cmdline repo on GitHub:

Parchive/par2cmdline#130

The current draft of the specification is available at:

https://parchive.github.io/doc/Parity_Volume_Set_Specification_v3.0.html

@Yutaka-Sawada
Copy link
Collaborator

I found a typo in the spec, line 578;

Table: File Packet Body Contents

This may be "Table: Contents of a chunk description".

I have a question about how to map input files into input blocks. Is there any rule in the order of mapping input files ? PAR2 spec has definition of files' order by FileIDs on Main packet. In PAR3 spec, Root packet has a list of File packets and their order is defined. But, it seems that there may be no relation to mapping of input files into input blocks.

@mdnahas
Copy link
Member Author

mdnahas commented Jan 31, 2022

No, there is no relation on mapping input files to input blocks. I wanted that freedom, in case multiple threads were processing files separately.

It is best to use low numbers for the input blocks. E.g., if you use 50 input blocks, they should be numbered 0 to 49. (Or something like that.) Using low numbers leaves room for incremental backups.

@Yutaka-Sawada
Copy link
Collaborator

I saw an interesting point in the spec, line 347 under Start Packet Body Contents;

| 2 | unsigned int | The size of the Galois field in bytes.

The Galois field size is 2, when 16-bit Reed-Solomon Codes like PAR2. The Galois field size is 1, when 8-bit Reed-Solomon Codes like PAR1. Currently, most programing languages support 64-bit integer (8-bytes). Recent CPU's SIMD supports 128-bit, 256-bit or 512-bit (16, 32, or 64-bytes). But, won't 2040-bit Galois field size (255-bytes) be enough for PAR3 ? I felt that using 1-byte for the item might be enough.

Also, when only XOR is used like a simple parity or LDPC, may its Galois field size be stored as 0 ? At that time, is there no generator item in the Start packet ?

@mdnahas
Copy link
Member Author

mdnahas commented Feb 1, 2022

Hmmm.... for some reason, I thought SIMD 512 was 512 bytes, not bits. Yes, I think 1 byte is enough.

Change made. The attached MarkDown file has the fix.

For the moment, I left out support for XOR. First, I feel like there should be a generator polynomial that acts exactly like XOR, but I haven't taken time to look into it. Second, I wondered how much it would be used. Par3 operates on bytes and even using random 1-byte Galois Fields factors will significantly improve an LDPC's recovery rate. Lastly, it isn't hard to add support for it later, if we want.

Parity_Volume_Set_Specification_v3.0.md

@mdnahas
Copy link
Member Author

mdnahas commented Feb 1, 2022

I remembered why we don't need XOR: all Galois Field addition is XOR. We can create a recovery block using only XOR by making all the factors in the Code Matrix equal to 1. And that works for any Galois Field of any size.

If a client wants fast LDPC, they can just do the optimization that multiplying any input block by a factor of 1 is just the input block itself.

@Yutaka-Sawada
Copy link
Collaborator

While I was thinking how to put input files on input blocks, I found some interesting points. Basically PAR3 is more free than PAR2, and the construction may depend on PAR3 client's developers. It's too complex to be a headache.

As compared to PAR2, PAR3 is difficult (or almost impossible) to predict the number of input blocks. In PAR2, there is a strong relation between block size and number of input blocks. When a user set a block size in a PAR2 client, PAR2 can calculate the number of resulting input blocks for given input files. Also, when user set a preferable block count, PAR2 can suggest a reasonable block size for the given input files. In PAR3, new features (Packing of tail chunks and Deduplication) may decrease the number of actual input blocks than its pre-calculated number. So, difference between a user's specified block count and resulting block count will become larger.

There may be a range of Block size. Because it's stored as 64-bit integer (8-bytes), the format's maximum Block size is 2^64 - 1. The max size is limited by total size of input files, too. When all input files are put in one input block, the Block size is same as total file size. Setting larger block size than total file size is worthless. If packing feature isn't used, the max Block size may be same as the largest file size. About minimum block size, setting less size than 8 is worthless. When block size is 1~8 bytes, their checksums (CRC-64) can restore each input block without any recovery blocks. Thus, it may be good to mention that 8 is the minimum value of Block size.

I came up with another case of when recovery blocks are not required. When all input files are equal or less size than 39-bytes, they can be put in tail chunks by setting block size to be 40-bytes. In this case, there is no input blocks at all, because each input file is stored as raw data in File Packets. When some files are equal or larger than 40-bytes and other files are equal or less than 39-bytes, by setting Block size to be 40-bytes or larger, only small files can be restored without recovery blocks. Developers should be careful in treating those small files (equal or less than 39-bytes), because PAR2 didn't have such feature.

I have a question about External Data Packet. This packet contains checksums of input blocks. Does it require checksums of all input blocks ? When an input block consists of single or multiple tail chunks, their checksums are stored in each File Packet already. If it's possible to ignore checksums of such input blocks for tail chunks, the number of External Data Packets may be same as the total number of chunks in File Packets. Though it's possible to store checksums of all input blocks, duplicated checksums may not be useful.

Furthermore, when Block size is less than 24-bytes, checksums in External Data Packets become larger than original data of input files. In this case, it may be good to store raw data of input blocks as same as small tail chunk. Or, it may be good to suggest larger Block size than 24-bytes for efficiency. Anyway, users won't set so small Block size mostly.

@mdnahas
Copy link
Member Author

mdnahas commented Feb 2, 2022

Yes, it is harder to determine the number of input blocks. But I'm not sure that users care about that. I think they care more about how small a piece of damage can be repaired and how much recovery data there is.

Par3 is not good for very small blocks. There is too much overhead. I would be surprised if anyone wants to use blocksizes below 1024 bytes, because of the overhead. Personally, I would want blocks at least 2400 bytes, so because the overhead is at least 24-bytes per block and I'd want the overhead to be less than 1%.

I don't know that every 8-byte value has a unique CRC-64-ISO. It's possible, but I don't know that it is guaranteed.

Yes, large collections of tiny files would result in everything being stored in the tail data of the File packets. But I don't see a way around that ... other than zipping or tarring the files before using Par3.

The External Data packet is for a sequence of input blocks. The specification does NOT require that every input block has to be covered by a checksum in an External Data packet. It is up to the client to decide if they have enough information to do the recovery. I suppose a client author could try to deduplicate checksums, but I don't think there is much data to be saved by doing that.

As I said before, the minimum overhead is 24-bytes per block and around 131-bytes per file. So, it is not suited to small blocks or many very tiny files. If users want the overhead to be less than 1%, we're talking block sizes at least 4kb.

@Yutaka-Sawada
Copy link
Collaborator

While I tried to get file information by C-runtime library, I might find a possible problem in File System Specific Packets (UNIX Permissions Packet and FAT Permissions Packet). There are some documents under "File System Specific Packets" section in PAR3 spec. A PAR3 client may write favorite or safe items on a packet. Another PAR3 client may read favorite or safe items from a packet and will adopt the attributes or permission on a file. For example, storing and restoring timestamp is safe and may be useful.

Now, each item has 2 states; written value or zero. When a PAR3 client didn't write an item to eliminate a risk (or could not get the information), the item became zero. (PAR3 spec must fill zero for un-used bytes.) When another PAR3 client will read the item as zero bytes, it cannot distinguish the item value was originally zero or non-written item. This may be a fault, because zero has meaning as attributes.

For example, there is "i_mode" item in UNIX Permissions Packet. The item has bit, which is non-zero for read or write permission. When the value is zero, it means that all users have no read/write permission. Another PAR3 client may not determine that the file should have permission or not.

For example, there is "FileAttributes" item in FAT Permissions Packet. The item has a bit for "Archive" attribute. This bit is non-zero normally after creation or modification. When the value is zero, it means that it needs to erase the attribute, which was set by default. (Because this attribute isn't used nor refered mostly, there may be no problem.)

Thus, there should be a bit flag, which indicates the item was written or not. For example, setting either bit flag of Directory or File would be good. Both "i_mode" and "FileAttributes" items have these two bits. Because either bit flag is non-zero always, the written item cannot become zero. When a PAR3 client reads that an item is zero, it will indicate that another PAR3 client didn't write the item.

By the way, I felt that 1-byte might be enough for "FileAttributes" item in FAT Permissions Packet. There are many unused bits in the field. It's strange to allocate larger field size than required size. Anyway, it's impossible to use raw bit data from Win32API GetFileAttributes. So, I think that you don't need to keep original bit alignment. A programmer will be able to edit bit flag with ease.

@mdnahas
Copy link
Member Author

mdnahas commented Feb 3, 2022

I'm not sure what you mean by "favorite or safe".

I think you mean that a client might write some piece of data in the UNIX or FAT Permissions packet, but not all of them. For example, might write timestamps, but not the UID or GID.

The GNU C library supports all of the UNIX Permission packet fields, except for "xattr". https://www.gnu.org/software/libc/manual/html_node/File-Attributes.html I assumed that xattr could be left empty.

I assumed that any Windows client would support all the FAT Permissions packet fields.

The permission packets were not required, so any Java or Javascript client can just ignore them.

If a client wants to fill in some but not all parts of a Permissions packet, I figured that they could find reasonable default values:

  • timestamps set to current time
  • UID/GID set to MAX_INT (POSIX for an omitted value)
  • username and groupname set to an empty string
  • i_mode set to rwx------
  • FAT Attributes set to all 0s.

Yes, if the encoding client writes those default values into the Permissions packet, the decoding client needs to respect them, if possible. But, then again, if they were not written into the file, the decoding client would need to make something up anyway.

Perhaps I should add that to the specification. Particularly about the UID/GID set to MAX_INT and usernames set to empty string

As for the size of the FAT Attributes field, I think it's okay to use 1 more byte to make it had to screw up. (The per-file overhead is about 130 bytes, so adding 1 more isn't too expensive.) Those 2 bytes are returned by GetFileAttributesA and accepted by SetFileAttributesA. Actually, those functions return/expect 4 bytes, but I thought only 2 bytes were necessary, so I did save 2 bytes.

@mdnahas
Copy link
Member Author

mdnahas commented Feb 3, 2022

I added the "unset values" to the specification.
Parity_Volume_Set_Specification_v3.0.md

@Yutaka-Sawada
Copy link
Collaborator

I think you mean that a client might write some piece of data in the UNIX or FAT Permissions packet, but not all of them.

Yes, I do. Some users requested a feature to keep modification time of source files ago. C-runtime library will be able to restore the timestamp. Because modification time is the only common information between UNIX or FAT Permissions packets, storing the item may be useful for both users.

I figured that they could find reasonable default values:

  • timestamps set to current time

I think that default values for timestamp should be 0, because it's not a real used value. The definition of current time is unknown, as time spends while creating PAR3 files.

The same UNIX Permissions can be applied to multiple files, directories, or links.
The same FAT Permissions can be applied to multiple files, directories, or links.

From PAR3 spec, a same packet may be applied to multiple files, directories, or links. But, it will be impossible by different timestamps, even when their permissions are same. I think that timestamps are different normally, unless a user set a specific time manually. Or, do most files have a same timestamp on Linux/UNIX ? If timestamp differ between files, each file needs to attach its own Permissions Packet. I feel that it may be good to split timestamp from permissions. For example, two independent packet types like; Permissions Packet and Timestamp Packet.

The i_mode value contain the lower 12 bits of EXT4's values for the i_mode. The top 4 bits are all zeroes.

From PAR3 spec, only 12-bit are used in 16-bits fireld. Then, is it ok to set 0x8180 (regular file, owner read+write) as default i_mode value ? Because I don't know about Linux/UNIX, I may be bad to touch UNIX Permissions Packet.

Start packets are used to generate the InputSetID, which is included in the header of every packet. The InputSetID is the first 8 bytes of the Blake3 hash of the body of the Start packet.

When I read how to calculate InputSetID, I understand that the 8-bytes value's randomness depends on another random 16-bytes hash value. Then, I felt that CRC-64 might be enough to represent the packet body, instead of using first 8-bytes of the Blake3 hash. Calculating Blake3 hash of another Blake3 hash may be over power. (Hashing two times may not give more randomness.)

There are many ways to generate a globally unique random number. One way is to use the fingerprint hash of the triple consisting of: a machine identifier, a process identifier, and a high-resolution timestamp.

I'm not sure that you intended this big difference between PAR2 and PAR3. In PAR2, SetID is specific to input files and block size, even when the value looks random. If a user make a recovery data for same files and block size, the resulting recovery data has same SetID and they are compatible. In PAR3, InputSetID will differ with everytime a user create a recovery data for same files. Even though their recovery data can be compatible, PAR3 client won't use other data by different InputSetID.

For example, par2cmdline has an option -f<n> : First Recovery-Block-Number. With this option, a user can append more recovery blocks to existing PAR2 files, if the user specifies the same source files and block size. When source files were changed, it will create incompatible recovery blocks with different SetID. QuickPar and MultiPar have such Extra feature, too.

Now, if par3cmdline has a similar feature, it will require a parent PAR3 file instead of specifying input files. It's almost same as parent PAR3 file for incremental backup. But it will inherit the parent's InputSetID and Start Packet, when their input files are same. So, it needs to verify input files before creating additional recovery blocks. If their input files were changed after the first creation, it will fail to create extra blocks in the same InputSetID. At that time, a user may select repairng to original files or processing incremental backup for the changed files.

Anyway, PAR3 developers and users should notice the different behavior from PAR2. Even when a user create PAR3 files for a same source files with same setting, creating multiple times simply will result in incompatible recovey files. (Though recovery blocks themselves are compatible, they have different InputSetID.) It requires a special feature to create extra recovery blocks with the same InputSetID.

@mdnahas
Copy link
Member Author

mdnahas commented Feb 5, 2022

I think that default values for timestamp should be 0, because it's not a real used value.

0 is a perfectly valid timestamp. In both UNIX and Windows formats!

2^64-1 is a better value.

I should add a note to the specification about times from the future.

From PAR3 spec, only 12-bit are used in 16-bits fireld. Then, is it ok to set 0x8180 (regular file, owner read+write)...

Good catch! I've updated the spec to use 0x0180.

BTW, setting the UNIX i_mode to 0x0180 and FAT Attributes to all 0s are going to act like default values. That is, the decoding client cannot tell the default value from an "unset" value. I am okay with that right now. In most cases where that is a problem, the permission packets shouldn't be in the file at all. If you can present an expected usage where that is a problem, we can talk about changing it.

Some users requested a feature to keep modification time of source files ago.

First, I agree with the users. If a file exists and has the correct data, I don't know why a Par2 client should modify it. The client should only have to read the file, not write it. But that's a decoding client behavior and has nothing to do with the file specification. I'd expect a Par3 client to behave the same if no Permissions Packets are present.

If timestamp differ between files, each file needs to attach its own Permissions Packet.

Yes, that is correct.

I feel that it may be good to split timestamp from permissions. For example, two independent packet types like; Permissions Packet and Timestamp Packet.

The overhead for packet header is 48 bytes. The timestamps are 24 bytes. The other permissions are usually going to be about 28 bytes.

So, in the current specification, there is 1 packet for each file and its length is 48+24+28 bytes. You propose to replace that with one packet that holds the permissions at length 48+28 bytes plus one packet per file holding the timestamps of length 48+24 bytes. Since the File packet overhead is already around 140 bytes, with a very large number of files you'd cut the per-file-overhead from 240 bytes to 212 bytes. I don't think it's a big enough win to be worth a different packet.

When I read how to calculate InputSetID, I understand that the 8-bytes value's randomness depends on another random 16-bytes hash value. Then, I felt that CRC-64 might be enough to represent the packet body, instead of using first 8-bytes of the Blake3 hash. Calculating Blake3 hash of another Blake3 hash may be over power. (Hashing two times may not give more randomness.)

The Blake3 hash has uniqueness guarantees that the CRC-64 does not have. We need the InputSetID to be unique.

In PAR3, InputSetID will differ with everytime a user create a recovery data for same files.

First, it doesn't have to. Client authors can use a technique similar to Par2's calculation of the RecoverySetID. Line 368 of the Markdown version of the spec says: "Another method for generating a globally unique number is, if the input set's data is known ahead of time, is to use a fingerprint hash of the parent's InputSetID, block size, Galois field parameters, and all the files' contents."

But Par3 does allow a client to use a globally-unique random number. The reason for the change from Par2 to Par3 was that some client authors reported that Par2's value was too restrictive. In Par2, an encoding client had to hash all the files and file names first, to generate the RecoverySetID. Only then, could it write out the packets to the Par2 file. Moreover, if anything changed, even just a file name, the RecoverySetID changed.

For Par3, I wanted to add a little more freedom. The InputSetID could be the same value as with Par2, or it could be a globally-unique random number.

And, yes, that design impacts the usage. In Par2, for the same set of input files, every client would create the same packets. But that's not the case with Par3. But, given the other features (like deduplication), that wasn't going to be guaranteed anyway. (In fact, I think I need to rewrite the specification to make sure that there aren't clashes with deduplication.)

If a Par3 client has an existing Par3 file, it can create an incremental backup or just create new Recovery Packets (and possibly matrix packets) that reuse the existing InputSetID.

Yes, you are correct when you say "it needs to verify input files before creating additional recovery blocks." But, to create Cauchy-matrix recovery blocks, the client needs to read all the files anyway. So that isn't a big deal.

But, if we assume two clients create two different Par3 files using the same block size and Galois field, it is possible for a client to identify overlapping input blocks of common files and reuse the recovery information. The client would definitely be more complicated than the Par2 client, but it is possible. And, if a client supports incremental backup, it may not be too different.

@Yutaka-Sawada
Copy link
Collaborator

While I read the GNU C Library manual to see difference from Microsoft C-runtime library, I found an interesting point. stat64() function gets file's information. There are 3 items as a file's timestamp;

st_atime = access time
st_mtime = modification time
st_ctime = attribute modification time

utime() function sets 2 items only (access time and modification time). It automatically sets attribute modification time to current time. So, there may be no reason to store ctime in UNIX Permissions Packet. Even when a PAR3 client reads the valid ctime in the packet, it cannot set the value on a real file anyway. Or, is there a way to edit the ctime in another function ? Though I don't know about Linux/UNIX, I feel that storing ctime will be useless.

I don't think it's a big enough win to be worth a different packet.

Oh, I see. You are right. That was a bad idea. Then, I came up with a perfect solution for FAT Permissions Packet and a maybe good solution for UNIX Permissions Packet. FAT Permissions Packet has 4 items, and the packet body size is 26-bytes. If the priority order is fixed, it's possible to determine which items exist by body size. The list of body size and contents is like below;

26-bytes = CreationTimestamp, LastAccessTimestamp, LastWriteTimestamp, FileAttributes
24-bytes = CreationTimestamp, LastAccessTimestamp, LastWriteTimestamp
18-bytes = CreationTimestamp, LastWriteTimestamp, FileAttributes
16-bytes = CreationTimestamp, LastWriteTimestamp
10-bytes = LastWriteTimestamp, FileAttributes
8-bytes  = LastWriteTimestamp
2-bytes  = FileAttributes

Because UNIX Permissions Packet has more items, it's impossible to select each. So, it splits timestamp and permissions only. The list of body size and contents is like below;

8-bytes   = mtime
16-bytes  = atime, mtime
24-bytes  = atime, ctime, mtime (But, I think ctime may be useless.)
more size = both timestamp and permissions

How about my new idea ? Single packet can store various items. The packet's body size will indicate the kind of content.

@malaire
Copy link

malaire commented Feb 6, 2022

ctime is not meant to be set to any other time than current time. See e.g. https://unix.stackexchange.com/a/36105/182083

You cannot change the ctime by ordinary means. This is by design ...

@malaire
Copy link

malaire commented Feb 6, 2022

btw, ctime is meant for backup tools to detect if file metadata or contents have changed since it can't be (easily) faked.

Could storing it help with incremental backups here?

@mdnahas
Copy link
Member Author

mdnahas commented Feb 6, 2022

Yes, mtime is the time the file's contents changed and ctime is when the file's metadata changed.

The ctime can only be set by (1) mounting the drive without mounting the file system or (2) changing the system time. (I think #2 is called a "time storm".) And we don't want clients doing either of those.

The "tar" program stores ctime. If either the mtime or the ctime on a file changes, tar will backup the file and its metadata.

I believe @malaire is right --- the ctime might be useful for incremental backups. If the ctime changes, we update the UNIX Permissions packet. (And, obviously, if the mtime changes, we update the file's contents.)

The other way to check if permissions changed would be to call stat() and getxattrs() for every file. That will probably be expensive with xattrs. Also, "tar" stores ctime and, when in doubt, I'll imitate "tar".

I will add a note to the specification about ctime.

@Yutaka-Sawada Your idea of various-sized permissions packets is interesting. But, it requires picking an ordering the fields in importance. That's easy for the FAT Permissions packet. I don't see an obvious ordering for UNIX Permission fields.

@mdnahas
Copy link
Member Author

mdnahas commented Feb 6, 2022

Latest Markdown version of the specification.

Parity_Volume_Set_Specification_v3.0.md

@Yutaka-Sawada
Copy link
Collaborator

A chunk description starts with the length of the chunk and a fingerprint hash of the chunk's entire contents. If the chunk's contents are unknown and not protected by the recovery data, the fingerprint hash value is set to all zeroes. (This is used with the "Par inside" feature described below.)

When I tried to construct chunk information for files, I thought that 16-bytes hash of chunk might be troublesome. When chunk size is equal or larger than block size, External Data packets contains the hash of each blocks in the chunk. When chunk size is smaller than block size or there is a tail, the hash of the last piece is saved in the File Packet. So, hash of the chunk's entire contents isn't required to verify the file data.

When it can keep a whole file data on memory, there is no problem in calculating hash of chunks. But, it may be annoying to read and hash chunk's data from the first byte to the last byte ordinary, when file size is very large. File access order is the problem, because it wants to read every input blocks 's same offset at once, instead of reading each block incrementally. Technically, it's possible to read a whole file at first and read every blocks next, but it's not fast by reading the same file data two times. Also, I'm not sure the worth to calculate a hash value, which has no use later.

Thus, I suggest to use 1-byte flag, instead of 16-bytes hash. It indicates the chunk usage. If the flag is 0, the chunk isn't protected. (such like PAR inside) If the flag is 1, the chunk's hash may be stored in External Data packets. If the flag is 2, the chunk's data may be stored in Data packets. If the flag is 3, there may be both External Data packets and Data packets. This way will reduce packet size by 15-bytes, and PAR3 clients won't need to calculate additional redundant hash value.

@mdnahas
Copy link
Member Author

mdnahas commented Feb 8, 2022

This sounds like the same argument again. That there is a hash of each block, so we don't need a per-file hash. I believe we need per-file hashes. I've relaxed my position a little to have per-chunk hashes instead of per-file hashes, but am still having second thoughts about it.

The very first thing an archiving program should do when encoding is compute a checksum of the whole dataset. And the very last thing it should do when decoding data is verify that checksum.

Any extra calculation or extra complication to calculating that checksum is an opportunity for a bug to sneak in and for the client to report "the data is just fine!" when it isn't. That chucksum is the most important guarantee to our users. If we ever say "the data is just like it was at the beginning" and it isn't, we've lost the trust of all our users.

I know that per-file hashes (or per-chunk hashes) are extra work. And they may require a second pass over the data, or a more complicated client to avoid two passes. But I think it is important and worth it.

As for the details, I'm not sure what you mean by "File access order is the problem, because it wants to read every input blocks 's same offset at once, instead of reading each block incrementally." I'm not sure what "it" is. Is it the per-chunk hash algorithm? Is it the per-input-block hash algorithm? Is it the calculation of the recovery data? Because I'm pretty sure all of those can be coded to pass over the file incrementally from start to finish. I don't know of any algorithm that has to process the first byte (or Galois field) of every input block at once, before going to the second byte (or Galois field) of every input block.

I don't know how every Blake3 library is coded, but I can believe that some require the data to arrive sequentially. That is, file has to be loaded in order. That may be a problem for clients that want to go very fast and have multiple CPUs available to calculate the hash. In that case, the encoding client might want to break the input file into N equal-sized chunks, so that each of the N CPUs can each calculate the hash for one chunk of the file.

@Yutaka-Sawada
Copy link
Collaborator

The very first thing an archiving program should do when encoding is compute a checksum of the whole dataset. And the very last thing it should do when decoding data is verify that checksum.
That chucksum is the most important guarantee to our users.

I understand what you say. Verifying a whole file with a checksum is important. But, how to calculate the checksum at creating time is the problem. I felt that calculating BLAKE3 of chunks were worthless, while I calculated BLAKE3 of each input block in every chunks.

How about using general CRC-32 for each file's checksum ? Most archivers and checksum tools support CRC-32. (On the other hand, most tools don't support BLAKE3 yet.) Because CRC-32 is different algorithm from checksum of input blocks, file's CRC-32 will give additional trust. (However, my aim is that CRCs can join at calculation.) It will check CRC-64 and BLAKE3 of input blocks, and CRC-32 of input file as the final confirmation. They are worth to check, instead of checking BLAKE3 hashes of same file data two times.

Also, users will be able to see raw CRC-32 values in recovery set, and other tools can confrim the file reading was correct or it was verified correctly. One weak point of PAR3 is that using hash algorithms are minority. By using popular CRC-32 as file's checksum, PAR3's result become compatible with other tools. Though MD5 was the common checksum in PAR2, we use BLAKE3 instead of MD5 in PAR3. While we know BLAKE3 would be more reliable than PAR2's MD5 or CRC-32, it's difficult to check by 3rd party tool. So, storing CRC-32 in PAR3 may be good for compatibility.

I've relaxed my position a little to have per-chunk hashes instead of per-file hashes, but am still having second thoughts about it.

If PAR3 will use 1-byte flag in Chunk Descriptions and use CRC-32 as per-file hashes, we don't need per-chunk hashes. In this case, file's checksum will skip non-protected chunk bytes simply. Storing 16-bytes zeros as a flag for non-protected chunk is strange. From the nature of CRC, CRC-32 is suitable for per-file hashes. (It's easy to calculate on very large files at creation.)

@malaire
Copy link

malaire commented Feb 8, 2022

Any CRC is unsuitable as file checksum as they are too short and have too many collisions (and are also easy to forge, but not sure if that issue here).

Personally I think that every file must have single cryptographic hash (at least 128 bits, but I'd prefer 256) which covers whole file so that it is trivial to check that file was created correctly.

@Yutaka-Sawada
Copy link
Collaborator

I suggested CRC-32 for checking bug or failure in implementation. It doesn't need to have strength against forge or collision. There is BLAKE3 already for checking error or damage in file data.

Personally I think that every file must have single cryptographic hash

If it's a single hash to check integrity of files, I think so, too. Because PAR3 has BLAKE3 hash for every input blocks in a file, this case won't require cryptographic strength to see the correctness of read file data or repaired result. But, it may fail to arrange the blocks or miss somewhere. Then, checksum of whole input blocks is calculated as chunk's hash. In this limited usage, I thought that CRC-32 was suitable, because it's easy to concatenate.

All clients MUST be able to decoded File packets with multiple chunk descriptions and perform recovery.

There is a reason, why I wanted to shorten Chunk Descriptions. From PAR3 spec, a File Packet may contain multiple Chunk Descriptions. And all PAR3 clients must treat them. While I was thinking how to implement deduplication, I found that it's difficult (or impossible) to load many Chunk Descriptions on memory in rare case.

For example, there is a 1 MB file with random data. By setting block size to be 1-byte, it will make 256 input blocks and 1,048,320 duplicate blocks. There may be max 1,048,576 Chunk Descriptions in the File Packet. Because one Chunk Description is total 32-bytes now, Max 1,048,576 Chunk Descriptions consume around 32 MB. Though you think that 32 MB is small, a 1 GB file will make a File Packet of 32 GB. I predict that most PAR3 clients won't be able to decode so large packet. Small block size may result in many duplicated blocks. That was why I suggested to set minimum block size to be 8-bytes or 40-bytes ago.

@mdnahas
Copy link
Member Author

mdnahas commented Feb 9, 2022

We use the Blake3 hash for uniqueness. We use CRCs for rolling hashes.

We want the data, after decoding, to match what it was at the beginning. Uniquely. That is the entire purpose of having 1 hash for all of the data.

Thus, we use a tree of Blake3 hashes for it.

@malaire I agree that I would like every file to have a cryptographic hash. But Par3 doesn't protect all the data, in order to allow Par-in-Par. So, hashing the entire file was problematic. I settled on cryptographic hashes for each chunk. I feel that in most cases, each file will be one chunk.

@Yutaka-Sawada
Copy link
Collaborator

I came up with the best perfect solution for a problem; lack of file's checksum. I explain the problem and our opinions here again.

Design of PAR3

PAR3 treats an input file as an array of input blocks. Each input block is protected by BLAKE3 hash and CRC-64. For deduplication, incremental backup, or PAR inside, an input file may have single or multiple chunks, which consist of multiple input blocks or unknown data. Though each chunk is protected by BLAKE3 hash, each input file has no checksum for the entire file data.

The construction of a file is like below;

When a file has single chunk,
file{ chunk(block, block, block, block, block, block, block, block, block) }

When a file has multiple chunks,
file{ chunk(block, block, block), chunk(block, block, block), chunk(block, block, block) }

When a file has multiple chunks and unknown data,
file{ chunk(block, block, block), chunk(block, block, block), chunk(block, block), chunk(unknown) }

My (Yutaka Sawada's) opinion

I like simple construction and speed. Because I believe that tree of BLAKE3 hashes can protect the entire file data indirectly, I don't complain about the lack of file's checksum. Even when there is no checksum of the file itself, internal blocks' BLAKE3 hashes would be enough to represent the completeness. Then, I oppose checksum of chunk. I want to remove the BLAKE3 hash of chunk as a troublesome factor. While blocks' BLAKE3 hashes can proof the integrity of file, it should not require additional BLAKE3 hash for chunk.

Markus Laire' opinion

He wants that every file has single cryptographic hash, which covers whole file. Because it's impossible to check PAR3's array of BLAKE3 hashes manually, single hash is preferable for compatibility with other checksum tools. Some users requested that PAR3 should have included multiple kind of hashes to check integrity of source files ago. I understand their hope. There are some maniac or paranoid users, who calculate hashes of their files. Parchive is the last wish for them, when their files were broken.

Michael Nahas's opinion

While PAR3 protects file data by tree-structure of BLAKE3 hashes, PAR3 clients may happen to fail the calculation. So, it checks BLAKE3 hash of each chunk. He refused to remove checksum of chunk, as oppose to my suggestion.

Though he agreed the requirement of file's cryptographic hash, he could not put checksum of the file in File Packet. For "PAR inside" feature, a chunk may not have checksum for unknown data. When a chunk isn't protected, the whole file cannot be protected, too. Because of the rare case, PAR3 doesn't contain checksum of file.

Idea of Checksum Packet

Now, I suggest a new packet type; Checksum Packet. Checksum Packet contains multiple hashes of an input file. It's linked to an input file as a packet for options, as same as UNIX Permissions Packet. When a file has non-protected chunks, it cannot have Checksum Packet, because the file data is unknown. Checksum Packet behaves as if extra packet for additional insurance. If a user want to protect his files strongly, he can add favorite cryptographic hashes in this packet. PAR3 clients will be able to support many hash functions. Markus Laire would be satisfied with this solution.

The Checksum Packet has a type value of "PAR CHK\0" (ASCII). The packet's body contains the following:

*Table: Checksum Packet Body Contents*

| Length (bytes) | Type | Description |
|---------------:|:-----|:------------|
| ?*? | checksum descriptions | see below. |


*Table: Checksum Description Layout*

| Length (bytes) | Type | Description |
|---------------:|:-----|:------------|
| 1 | unsigned int | length of checksum description |
| ? | UTF-8 string | name of hash function (It's NUL-terminated as boundary.) |
| ? | checksum bytes | hash of the entire file |

For example, MD5 hash is stored as Length = 20, Name is "MD5\0", Hash is 16-bytes.
SHA-1 hash is stored as Length = 26, Name is "SHA-1\0", Hash is 20-bytes.
SHA-256 hash is stored as Length = 40, Name is "SHA-256\0", Hash is 32-bytes.
SHA-512 hash is stored as Length = 72, Name is "SHA-512\0", Hash is 64-bytes.
SHA-3 (256-bit) hash is stored as Length = 38, Name is "SHA-3\0", Hash is 32-bytes.
SHA-3 (512-bit) hash is stored as Length = 70, Name is "SHA-3\0", Hash is 64-bytes.
Whirlpool hash is stored as Length = 74, Name is "Whirlpool\0", Hash is 64-bytes.
BLAKE3 (128-bit) hash is stored as Length = 23, Name is "BLAKE3\0", Hash is 16-bytes.
BLAKE3 (256-bit) hash is stored as Length = 39, Name is "BLAKE3\0", Hash is 32-bytes.
It's possible to store arbitrary output length in a same name, as the length is different.

Change of Chunk Description

As mentioned above, I solved the fault of PAR3, which cannot check a whole file with general hash functions by compatible way. When Checksum Packet protects file data with cryptographic hash functions, checksum of chunk doesn't need so strong hash. Then, I suggest to use CRC-32 for checksum of chunks instead of current BLAKE3 hash. Though I wanted to remove checksum of chunk, I compromise the Michael Nahas's concern for internal calculation failure.

CRC-32 can be the minimum error checker for both chunks and the entire file. When a user wants speed and doesn't make Checksum Packet, CRC-32 is shown as a checksum of whole file. (A PAR3 client can concatenate CRC-32s of chunks to show CRC-32 of entire file data.) As a checksum of file, 4-bytes is very small, but it's better than nothing. (Currently, PAR3 doesn't have checksum of file at all.)

Because 32-bit is too small to indicate a non-protected chunk, it may require 1-byte flag. Even when there are 1-byte flag and 4-bytes CRC-32 for each chunk, the total 5-bytes is less than BLAKE3 hash's 16-bytes. Small chunk description is good for small packet size.

@mdnahas
Copy link
Member Author

mdnahas commented Feb 11, 2022

Par3 is about safety not speed. Saying "I can make this much faster by giving up safety" is going against the total point of Par3. Saying "safety is optional" is going against Par3.

I care about speed, but it is a far secondary consideration far below safety.

We should be discussing how to put per-file hash back in. Can we get rid of the per-chunk checksums and say that unprotected chunks are assumed to be 0-bytes when calculating the per-file hash?

@Yutaka-Sawada
Copy link
Collaborator

Can we get rid of the per-chunk checksums and say that unprotected chunks are assumed to be 0-bytes when calculating the per-file hash?

Yes, we can, though "Par Inside Another File" feature needs tweak. When a user cannot check "per-file hash" with another tool, the hash value is impossible to confirm, as same as "pre-block hash" or "per-chunk hash". If a PAR3 client shows "per-file hash" to a user, it should have a feature to reconstruct original "outside file".

For example, there is a ZIP file and its hash value. A user can check the integrity of ZIP file by re-calculating the hash. When a PAR3 client creates "PAR inside ZIP", the hash value becomes different. When a PAR3 client removes "inside PAR" from the ZIP file, the hash value returns to be same. So, PAR3 clients will have two repair mode for "PAR inside ZIP"; reconstruct original ZIP file (by removing PAR3 packets), or repair modified ZIP file (by keeping PAR3 packets).

To perform such "Remove PAR from Outside File" feature, "PAR inside" feature has some restrictions. At first, sum of protected chunks must become the original outside file. When a PAR3 client puts PAR3 packets or extra data between or after the original file, such additional bytes must belong to unprotected chunk. So that, another PAR3 client will be able to join the protected chunks to construct an original file, when a user wants to check the file with "per-file hash" manually.

@Yutaka-Sawada
Copy link
Collaborator

When I'm making packets, I found a problem in the order of packets.

"Order of Packets" in PAR3 spec is written as following;

In order for a client to do single-pass recovery, the recommended order of packets is:

  • Creator
  • Start
  • Matrix
  • Data or External Data
  • Root
  • Recovery Data

The File and Directory packets can appear in any place.

To do single-pass recovery, File and Directory and Root packets must exist before Data packets. Also, External Data should exist before Data packets. Data packets contain raw data of input blocks. A PAR3 client needs to know which input file to put the input block at recovery. A PAR3 client needs to know which input file is damaged before putting the input block. So, a PAR3 client wants to read file and directory tree information at first.

By the way, the order of File and Directory and Root packets seems to be decided at creation time. Because Directory Packet includes checksums of child files and directories, a PAR3 client needs to make packets of children at first. So, the making order of packets must be; File packets -> Directory packets -> Root packet. Normally, making order and writing order is same.

Then, the recommended order of packets would be:

  • Creator
  • Start
  • Matrix
  • File
  • Directory
  • Root
  • External Data
  • Data
  • Recovery Data

@mdnahas
Copy link
Member Author

mdnahas commented Feb 14, 2022

Can we get rid of the per-chunk checksums and say that unprotected chunks are assumed to be 0-bytes when calculating the per-file hash?

Yes, we can, though "Par Inside Another File" feature needs tweak. When a user cannot check "per-file hash" with another tool, the hash value is impossible to confirm.

I don't expect another program, like "b3sum", to be able to confirm the hashes for files with Par-inside. In most cases, we're modifying those files. It's a rare enough case that I'm not going to worry about it.

I'll try to spend some time working on putting the per-file hash back into the specification.

@Yutaka-Sawada
Copy link
Collaborator

I found my restriction against PAR3 spec : "Link Packet". It seems that Microsoft C-runtime library cannot distinguish a real file, hard or symbolic link on Windows OS. I don't know how they are searched or looked. Because adding/removing Link seems to require administrator privilege on Windows OS, it's difficult to use by normal applications like PAR3 clients. So, I won't support the Link Packet in my PAR3 client for Windows OS. Then, implementing Link Packet for Linux/UNIX OS will be another programer's task.

@Yutaka-Sawada
Copy link
Collaborator

The sparse random matrix are not expected to be solved by Gaussian Elimination except in very bad cases. Most will be solved by the LDPC algorithm of solving for one input block at a time.

This will be possible, only when damaged blocks are very few. Or else, it depends on randomness.

Also, there is a problem of memory requirement for the generator matrix. The matrix size seems to be well known problem of LDPC. There are many papers which tried to solve the problem. To decrease memory size for matrix, structured matrix is used, instead of randomly created matrix. One of them is called as Quasi-Cyclic LDPC. While I tried to read some papers, I could not understand the construction. There seems to be a fast encoding method, too.

Construction of quasi-cyclic LDPC codes from quadratic congruences

Most methods for designing good LDPC codes are based on random construction, but the lack of structure makes the encoding process complicated and the memory in the decoder required to store the parity-check matrices may be prohibitive in practical application.

Arbitrary Bit Generation and Correction Technique for Encoding QC-LDPC Codes with Dual-Diagonal Parity Structure

The quasi-cyclic LDPC codes has memory efficiency due to their algebraic structure, compared to general LDPC codes.

@Yutaka-Sawada
Copy link
Collaborator

Yutaka-Sawada commented Jul 2, 2022

that mentions limitation of "recovery_count <= original_count" but that limitation can be removed quite easily

Though I refered the original Sian-Jheng Lin's sample code and Markus Laire 's Rust implementation, it was too difficult to implement Low Rate Encoder/Decoder for me. It's not so easy, hehe. My mathematical knowledge is junior high school level.

But, I found that leopard-RS library worked with 100% over redundancy. While it's High Rate Encoder/Decoder currently, it works by zero padding internally. There seems to be speed and memory usage problem. If someone implements Low Rate Encoder/Decoder in leopard-RS library, it will solve problem.

By the way, I found a paper about this problem. There may be another solution in future.
Fast Encoding and Decoding Algorithms for Arbitrary (n,k) Reed-Solomon Codes Over F2m

@malaire
Copy link

malaire commented Jul 2, 2022

Though I refered the original Sian-Jheng Lin's sample code ...

That sample code has two bugs for encodeL path. To change that code to use encodeL:

  • comment line 230 (encodeH)
  • uncomment line 231 and change it to encodeL(&data[Size-k], k, codeword);
  • change line 233 to memcpy(data, codeword, sizeof(GFSymbol)*Size);

I also had to rename log to LOG and exp to EXP as otherwise gcc complained about similarly named functions.

ps. In the other thread I said that generally you can't use same decoder for encodeH and encodeL since they are not compatible. That sample code does use same decoder for both and in this specific example it works, but not generally.

@Yutaka-Sawada
Copy link
Collaborator

I have a question about UNIX Permissions Packet. Should I repeat the optional packet as same as File Packets in each PAR3 file ? At this time, I categolize packets as below.

One time in each PAR3 file:
Creator packet, Comment packet

Multiple times in each PAR3 file, if there are multiple Data Packets or Recovery Data Packets:
Start packet, External Data Packet, Matrix Packet, File Packet, Directory Packet, Root Packet

One time in whole PAR3 files:
Data Packet, Recovery Data Packet

@Yutaka-Sawada
Copy link
Collaborator

I implemented mtime and i_mode (permissions) in UNIX Permissions Packet. Other fields are not supported (or incompatible) on Windows OS. Feature to store/recover mtime (modification time) will be useful, as some ZIP archivers restore mod time.

By the way, changing READ or WRITE permissions of a file may fail verification and/or repair. If you don't have read permission, you may not be able to verify. If you don't have write permission, you may not be able to repair. Though I test behavior on Windows OS, I don't know what happen on Linux OS.

On Windows OS, timestamp and permissions of folders (directories) may not work. Even when it restores original timestamp (modification time) of a directory, it will be changed after repairing child files. It seems that permissions of folders are ignored. Then, I added the optional packet for files only.

@malaire
Copy link

malaire commented Jul 17, 2022

On Windows OS, timestamp and permissions of folders (directories) may not work. Even when it restores original timestamp (modification time) of a directory, it will be changed after repairing child files.

Directory timestamp needs to be restored after its child items (files & directories) have been handled.

@Yutaka-Sawada
Copy link
Collaborator

Directory timestamp needs to be restored after its child items (files & directories) have been handled.

Oh, I see. Thank you for the idea. Though I implemented the function, it doesn't work on Windows OS. I found that Microsoft C-runtime library did not support changing property of directory. I cannot change both timestamp and permissions of a directory on Windows OS. It seems that Win32API can modify timestamp, but I cannot write in standard C language. I don't know it works on Linux/UNIX.

@Yutaka-Sawada
Copy link
Collaborator

I found a problem about File System Specific packets (options in File Packet). When a File Packet includes an option (such like UNIX Permissions Packets), checksum of the File Packet becomes different. Because Root Packet includes checksums of File Packets, Root Packet becomes different, too. Then, Recovery Data happens to be imcompatible, even though file data itself is same.

This isn't so serious, when a user create PAR3 files in same setting. This may cause compatibility issue, when another user tries to create additional PAR3 files. Because it's difficult to retrive all setting of original PAR3 files, par3cmdline is good to have a special feature to create additional PAR3 files. Such like, e(xtra) command on QuickPar. But, it will be complex to implement.

@Yutaka-Sawada
Copy link
Collaborator

Ago, I posted an idea of cohorts to support many blocks. The technique seems to be known as Interleaving in general. Mostly interleaving is used to correct burst error (lost of a whole file in our case) in small Code Word. Because the advantage is prooved, it would be good to try. As one kind of Interleaving, it maps blocks to cohorts by using very simple modulo based interleaver.

For example, when there are 10 blocks (index = 0 1 2 3 4 5 6 7 8 9).
When 3 cohorts are made, the calculation is "index modulo 3".
number0 cohort : "index modulo 3 is 0", result in 4 blocks (index = 0 3 6 9)
number1 cohort : "index modulo 3 is 1", result in 3 blocks (index = 1 4 7)
number2 cohort : "index modulo 3 is 2", result in 3 blocks (index = 2 5 8)

The relation between blocks and cohorts is easy to calculate. While this method is simple, printing the verification result may become complex. I will implement and test.

@Yutaka-Sawada
Copy link
Collaborator

I implemented cohorts (interleaving blocks) to support many blocks for FFT based Reed-Solomon Codes. It seems to work well for loss (burst error) of input files. Even when Reed-Solomon Codes is calculated over 16-bit Finite Field, par3cmdline supports max 2^47 blocks currently. (2^32 cohorts * 2^15 blocks = 2^47 total blocks) The number is more than 32-bit Reed-Solomon Codes and the speed is faster. It's possible to support more number of blocks in future, when a PC contains huge RAM.

But, the recovering capability is lower than whole codes. Recovery blocks in a cohort can recover input blocks in the cohort only. If many blocks in a cohort were damaged, it may not restore the cohort. Even when total repair is impossible, other cohorts may be recoverable independently. It would work as if locally repairable codes.

I wrote how to interleave blocks and construction of FFT Matrix Packet on Appendix.txt in par3cmdline's source code.

@Yutaka-Sawada
Copy link
Collaborator

I implemented "extend" command in par3cmdline. It reads a PAR3 set and creates compatible recovery blocks. A user may add more recovery files or re-create a damaged PAR3 file newly.

Internally, the mechanism is combination of "verify" and "create". It verifies input files with the PAR3 set, and creates recovery blocks for them in same settings. So, a user cannot change arrangement of input files nor alignment of input blocks. Because a user is difficult to set unknown options by another user, this "extend" command will be useful.

QuickPar has this feature as "Extra" button. In par2cmdline, same input files and block size will make same recovery blocks. There was no compatibility issue in the age of PAR2. On the other hand, PAR3 has more factors and may be incompatible by creating PAR3 clients. So, I made this function to follow settings of original PAR3 set.

@Yutaka-Sawada
Copy link
Collaborator

I consider "PAR inside ZIP" feature (putting recovery data inside a ZIP).

From Parity_Volume_Set_Specification_v3.0:

The ZIP file format stores compressed files at the front and the list of all files at the end,
but, inbetween, space can be used for any data without affecting the ZIP file's contents.
Par3 packets can put in this space.

Simple construction of ZIP file is like below;
[ file data ] [ file data ] [ list of content ]

The "list of content" must exist at the end of ZIP file. So, par3cmdline may put PAR3 packets between "file data".

The protected ZIP construction will be like below;
[PAR3] [ file data ] [PAR3] [ file data ] [PAR3] [ list of content ]

Now, there is a restriction of position, where to put PAR3. It cannot split real "file data" in a ZIP file. When a ZIP file contains only 1 file, there are only 2 spaces to put PAR3 packets.

Construction of single file in ZIP is like below;
[ file data ] [ list of content ]

The protected ZIP construction will be like below;
[PAR3] [ file data ] [PAR3] [ list of content ]

For example, there is a ZIP file of 200 MB. Setting 3% redundancy creates 6 MB recovery data. When the ZIP file contains only 1 file, it puts 3 MB recovery data in front and back of "file data".
[3MB PAR3] [200MB file data] [3MB PAR3] [list of content]

When the ZIP file contains multiple files, it may put 2 MB recovery data in front, middle, and back of "file data".
[2MB PAR3] [100MB file data] [2MB PAR3] [100MB file data] [2MB PAR3] [list of content]

Then, I'm not sure how many space is good. Putting in 2 spaces (front and back) is simple. But, it may be easy to loss whole PAR3 packets. Deleting top of file and end of file can strip off protection with ease. Putting in 3 spaces (front, middle, and back) is more complex and hard to remove. But, is it worth to try ? As I'm lazy, simple construction (2 spaces) would be enough.

@Yutaka-Sawada
Copy link
Collaborator

I tested a method to insert 2 spaces in a ZIP file. By inserting unprotected chunk before local file data, it needs to update offset in central file directory and end of central directory record. The result is a valid ZIP file. Windows OS and 7-Zip can recognize it.

But, I found a little problem. As it updated offset in ZIP file, the hash value becomes different.

Construction of original zip file;
[ local file data ] [ list of content ]

Construction of spaced zip file;
[ space ] [ local file data ] [ space ] [ list of content ]

Construction of protected zip file;
[ PAR3 packets ] [ local file data ] [ PAR3 packets ] [ list of content ]

Also, it needs to make a temporary file, which size is larger than the original ZIP file. Though it may keep a small ZIP file on RAM, large ZIP file will require more file access. Inserting bytes between a file is a heavy task.

I feel that "PAR2 style recovery record" may be good. In the method, it just appends recovery data after a ZIP file. It's simple, fast, and easy to implement. But, it's weaken against sequential damage (burst error), because all recovery data is put at the end of file.

Construction of protected zip file with PAR2 recovery record;
[ zip file ] [ PAR2 packets ]

@Yutaka-Sawada
Copy link
Collaborator

While I tried to implement "PAR inside ZIP" feature, I found a fault in current PAR3 specifications.

In File Packet, there are 2 hash values;
"rolling hash of the first 16kB of the file" and "fingerprint hash of the protected data in the file"

Because protected outside file consists in unprotected chunks and protected chunks, these hash values are useless to detect damage of the file anyway. It's worthless to store such useless values in the packet.

These values will be;
"rolling hash of the first 16kB of the original file" and "fingerprint hash of the original data in the file"

When the outside file is a ZIP file, File Packet should contain hash values of original ZIP file, instead of protected ZIP file. To insert PAR3 packets inside a ZIP file, it may modify the original data in the ZIP file. To remove PAR3 packets from a protected ZIP file, it should know how was the file before. By checking hash value of original file, it can confirm that the removing was done correctly.

par3cmdline will have two features to "insert PAR3 packets in ZIP file" and "remove PAR3 packets from protected ZIP file". Then, even if PAR3 packets inside ZIP are damaged, it's possible to restore them anytime.

@Yutaka-Sawada
Copy link
Collaborator

Yutaka-Sawada commented Jun 2, 2023

Though I posted one idea "storing checksum of original file at PAR inside" ago, I take it back. It's hard to determine original form of a file. It's specific to file format and the PAR3 client. So, it's impossible to define the common usage of the checksum.

I implemented "PAR inside ZIP" feature. File Packet contain two checksums of an outside file. CRC-64 of the first 16 KB. BKALE3 hash of the protected data. If the first 16 KB includes any unprotected chunks, the CRC-64 value becomes useless (set to be zero). All unprotected chunks are ignored at calculating BKALE3 hash. So, a PAR3 client's verification process must consider the existence of unprotected chunk.

I tested two file formats, ZIP (.zip) and 7-Zip (.7z). The manner is same as that of PAR2 recovery record. It doesn't change any bytes of an original file. It just adds recovery record at the end of the original file.

How to add PAR3 recovery recode to ZIP file (.zip):

Construction of original .zip file;
[ local file data ] [ list of content ]

Construction of spaced .zip file;
[ local file data ] [ list of content ] [ space ] [ list of content ]

Construction of protected .zip file;
[ local file data ] [ list of content ] [ PAR3 packets ] [ list of content ]

How to add PAR3 recovery recode to 7-Zip file (.7z):

Construction of original .7z file;
[ stream data ]

Construction of spaced .7z file;
[ stream data ] [ space ]

Construction of protected .zip file;
[ stream data ] [ PAR3 packets ]

Because it doesn't modify an original file, it's easy to append or remove PAR3 packets. I made insert command and delete command for "PAR inside ZIP". I made verify itself command to specify a ZIP file as a PAR3 file. It searches PAR3 packets inside a protected ZIP file. They works well.

But, there is one problem, verification cannot state that a protected ZIP file is complete or not. Because it cannot determine the status of unprotected chunks, it checks protected chunks only. Also, it's difficult to repair the protected ZIP file in the protected state. Though it's possible to repair a damaged protected ZIP file, it will erase appended PAR3 packets in unprotected chunks. As their original data in unprotected chunks are unknown, it's difficult to copy back PAR3 packets in the repaired protected ZIP file. While it can protect a ZIP file (outside file), it cannot protect itself (PAR3 packets inside the ZIP file). This would be a fault of "PAR inside" feature.

@Yutaka-Sawada
Copy link
Collaborator

There is a defect in current PAR3 specification, which cannot determine a PAR3 file is complete or not. While a PAR3 file consists of many packets, there is no checksum of the entire file. Even though every packets have their own checksum to detect damage, there is no way to find loss/insert/replace/change of any packets. It's impossible to reconstruct the original formation of a PAR3 file. PAR3 inherits this problem from PAR2, whcih uses the same packet construction.

Thus, PAR2 clients cannot repair PAR2 files themselves, even when they can repair input files. While PAR3 introduces "PAR inside PAR" feature to protect another (outside) PAR3 file, it cannot protect itself. "PAR inside" allocates unprotected chunks to put PAR3 packets. As I wrote in the last post, state of PAR3 packets in the unprotected chunks are unknown. We should solve this problem in PAR3.

I came up with a new idea, which keeps information of each PAR3 file. An input file consist of many input blocks. PAR3 files store information of all input blocks in External Data Packet. I think that the same manner is available for PAR3 file itself. Because a PAR3 file consists of many PAR3 packets, keeping the order of packets is enough to reconstruct them in the future. I named the usage as Order Packet. It stores a PAR3 filename and all checksums of PAR3 packets in the PAR3 file. If some PAR3 packets are lost or damaged, a PAR3 client will be able to copy the packet from another position. It can arrange packets to reconstruct a PAR3 file by specified order in the Order Packet. Though it cannot recover lost or damaged packets, it just copies complete packets only. Because most PAR3 packets are duplicated many times in a set of PAR3 files, it's possible to find the same packet from another PAR3 file.

I write current details below. If someone finds a problem or better idea, please advise me.

Order Packet

This packet specifies filename and order of PAR3 packets in a PAR3 file.
The information is used to verify the PAR3 file's completeness.

The Order packet has a type value of "PAR ORD\0" (ASCII). The packet's body contains the following:

Table: Order Packet Body Contents

Length (bytes) Type Description
2 unsigned int length of filename in bytes
? UTF-8 string filename of the PAR3 file
16*? fingerprint hashes checksums of ordered packets in the PAR3 file

The first and second fields are used to store filename of each PAR3 file.
When there are multiple Order Packets in a set of PAR3 files,
PAR3 clients can distinguish them each other.
It's easy to know which Order Packet belongs to which PAR3 file.

When there is only one PAR3 file, it may not store the filename.
Because there is only one kind of Order Packet in the PAR3 file,
it doesn't need to know that the Order Packet belongs to which PAR3 file.
If PAR3 packets are made for "Par inside",
the Order Packet won't store the outside filename, too.

It's possible to correct misnamed PAR3 file by reading filename in Order Packet.
If a user wants to change filename later, he may omit the name fields.

When a PAR3 client doesn't store the PAR3 filename in Order Packet,
the first field value is zero, and the second field doesn't exist.

The number of including checksums is calculable by this packet's length.
Some packet types use identifier instead of checksum of packet.
Because a PAR3 file may include outside data in addition to PAR3 packets,
the number of checksums (or identifier) may be different from the number of packets.

Table: Identifier for Order Packet

Length (bytes) Type Description
4 byte[4] Type value = {'O', 'R', 'D', 'E'} (ASCII)
4 byte[4] The first 4 bytes of PAR3 filename's hash value
8 unsigned int The file size of the PAR3 file

Because it cannot store packet's checksum of itself, it uses identifier.
PAR3 file's name and size are used to distinguish Order Packets for which PAR3 file.
When Order Packet doesn't store filename, the hash value of filename becomes zeros.
If file size is unknown or it's impossible to pre-calculate, the size value is stored as zero.

Normally, a PAR3 file should include its own Order Packet only.
If there are 4 PAR3 files, 4 kinds of Order Packets are made independently for each PAR3 file.
Though it duplicates the Order Packet multiple times in a PAR3 file, it's same one kind.
Order Packet is specific to the belonging PAR3 file, and it doesn't affect other PAR3 files.

Example case of four PAR3 files;
"something.par3" includes Order Packet for "something.par3".
"something.vol0+1.par3" includes Order Packet for "something.vol0+1.par3".
"something.vol1+2.par3" includes Order Packet for "something.vol1+2.par3".
"something.vol3+4.par3" includes Order Packet for "something.vol3+4.par3".

But, it's possible to put Order Packets of other PAR3 files as a rare case.
In that case, it needs to distinguish their Order Packets.
When a PAR3 file contains multiple kinds of Order packets,
the order of them are recognized by their name and size.

Unimportant packets, which may not be duplicated in a set of PAR3 files.

When a PAR3 file is damaged (some PAR3 packets are missing or damaged),
it's possible to reconstruct the file by copying the packet from another PAR3 file.
In a set of PAR3 files, important packets are duplicated sometimes.
But, unimportant packets (like Creator Packet or Comment Packet) may not be duplicated.
Then, Order Packet stores both their checksum and size.
If it found a packet of same checksum, it just copies the packet.
If it didn't find same checksum, it may replace the space with dummy packet.

Table: Identifier for Creator Packet

Length (bytes) Type Description
4 byte[4] Type value = {'C', 'R', 'E', 'A'} (ASCII)
4 byte[4] The first 4 bytes of the checksum from the Creator packet
8 unsigned int Length of the packet

Because it doesn't need to protect Creator Packet, it uses identifier.
When this packet was missing, a PAR3 client may put its Creator Packet in the place.
The type value is 4 characters not to accidentally match 3 characters of packet type in Packet Header.

Table: Identifier for Comment Packet

Length (bytes) Type Description
4 byte[4] Type value = {'C', 'O', 'M', 'M'} (ASCII)
4 byte[4] The first 4 bytes of the checksum from the Comment packet
8 unsigned int Length of the packet

Because it doesn't need to protect Comment Packet, it uses identifier.
The type value is 4 characters not to accidentally match 3 characters of packet type in Packet Header.

Unique packet, which isn't duplicated in a set of PAR3 files.

Normaly PAR3 files don't contain identical Data Packet nor Recovery Data Packet.
So, a PAR3 client cannot copy them by comparing their checksums.
But, it can recreate these packets from complete input files again.
Thus, Order Packet stores their index instead of packet checksum.

Table: Identifier for Data Packet

Length (bytes) Type Description
8 byte[8] Type value = {'D', 'A', 'T', 'A', '\0', '\0', '\0', '\0'} (ASCII)
8 unsigned int The index of the input block

Table: Identifier for Recovery Data Packet

Length (bytes) Type Description
4 byte[4] Type value = {'R', 'E', 'C', 'O'} (ASCII)
2 byte[2] The first 2 bytes of the checksum from the Root packet
2 byte[2] The first 2 bytes of the checksum from the Matrix packet
8 unsigned int The index of the recovery block
Outside data, which isn't belong to PAR3 packet.

When a PAR3 file includes some data outside its PAR3 packets, they are called as "outside data".
The outside data may consist of protected chunks in "Par inside" feature.

Order Packet stores the size of the area, where PAR3 packets may not exist.
So that, it's possible to determine a PAR3 file is complete,
even if the file includes additonal unknown data.
When a PAR3 client uses "Par inside Par" feature,
outside data may include PAR3 packets of different InputSetID.

Table: Identifier for outside data

Length (bytes) Type Description
8 byte[8] Type value = {'O', 'U', 'T', 'S', 'I', 'D', 'E', '\0'} (ASCII)
8 unsigned int Length of outside data

@mdnahas
Copy link
Member Author

mdnahas commented Jun 13, 2023

Sorry, I've been inactive for a while. I didn't fall off the boat, if that's what you were wondering!

Yutaka Sawada, you say that "There is a defect in current PAR3 specification" that PAR3 files cannot repair themselves and there isn't a checksum for the entire file. I'm willing to discuss the details of implementing this, but first I want to know the use case. Where do you expect users to need this?

Right now, my use case is that someone either (1) stores data on a faulty disk or (2) transmits data over a faulty network. When data is lost and they use the PAR3 files to repair the data. I'm not sure why someone wants to repair the PAR3 file. If a user wants to store the data again or transmit the data again, any damaged PAR3 data can be deleted and new PAR3 recovery data can be generated. Can you say why a user needs the exact same PAR3 as the original?

@mdnahas
Copy link
Member Author

mdnahas commented Jun 13, 2023

I'm not clear what you mean by you implemented "cohorts". Are certain packets written out of the usual order? Is something done to the matrix?

What is the goal of cohorts? Interleaving is used to distribute errors evenly among the input blocks. Randomizing the placement of errors is useful when the recovery matrix has a harder time recovering errors near each other. The only recovery mechanisms we've talked about are either random or Cauchy. Neither of those is affected by errors being nearer to each other.

@mdnahas
Copy link
Member Author

mdnahas commented Jun 13, 2023

Regarding Par-in-ZIP, I find it is amazing you're able to do that! It was certainly not something I expected to do with the initial release of Par3.

I figured the initial versions of Par-in-ZIP would work the way you implemented: appending to the end of the file and duplicating the "list of contents" section. I'm sorry that I wasn't active and able to explain it to you at the time.

@mdnahas
Copy link
Member Author

mdnahas commented Jun 13, 2023

Regarding file/directory permissions, yes, the checksum the Root Packet was designed to change if the file permissions change. I considered it best to treat a directory snapshot with permissions as completely different from one without permissions. It's more of a "superset" relationship, but I decided it wasn't worth modifying the specification to capture that aspect of the relationship. I don't think it will come up much.

@mdnahas
Copy link
Member Author

mdnahas commented Jun 13, 2023

I don't know when I'll get time to look at your code, but I'll try to make some time soon.

@Yutaka-Sawada
Copy link
Collaborator

Welcome back, Michael Nahas.

Where do you expect users to need this?

Users want to know which PAR files are damaged. Only when a PAR file is damaged, he deletes the damaged one and will re-create it again.

Below is the user's real complaint.

I am worried there are many other instances of damaged PAR files and I've no way to detect them currently. I used Multipar on 8TB of data across hundreds of folders so would like to be able to detect and fix them before the underlying data gets corrupted.

Basically there's no way damaged Par files can be detected unless you look at the verify in real time, like how I had spotted it accidentally. I program so I accept that bugs are inevitable, however I still have this worry and would greatly welcome any suggestions on how to overcome this bug.

Because the original post disappered at the end of old web-forum, I put the MIME HTML and Web Capture (if you cannot open the old html file) on my web space. They are "Error handling does not detect damaged par files.mht" and "Error handling does not detect damaged par files.jpeg" in "MultiPar_sample" folder on OneDrive.

If a user wants to store the data again or transmit the data again, any damaged PAR3 data can be deleted and new PAR3 recovery data can be generated.

Because I'm not good at English words, I may be hard to say waht I think. I try as possible as I can. You seems to miss the first starting point; "How to find damaged PAR3 file ?". While a user doesn't know a PAR3 is damaged or not, he won't delete it nor re-create it. Then the user posted the refered claim; "Error handling does not detect damaged par files".

On the web-forum, another user suggested to make Checksum file for created PAR files. So that, he can detect damage of PAR files by the checksum file. Because he seems to be a little paranoid, he doesn't rely a solution but uses multiple checkers.

There are input files.
Create PAR files to protect input files.
Create checksum files to verify both input files and PAR files.

But, most users may use PAR3 without another checksumer. In your example use case, you would not store nor send a checksum file of PAR files. Do you say that "Detect PAR3 file's damage by using the attached checksum file" ? I think that it's useful to offer a way to detect damage of a PAR file by using itself.

Can you say why a user needs the exact same PAR3 as the original?

It's not so important to restore the original formation. But, it's important to know the difference. If a PAR file is different from the original data, he can know something wrong in the PAR file. It depends on him to recreate newly or keep still. At least, he needs a chance to action by seeing the status of PAR files.

After I read the post, I modified my PAR2 client to return status of PAR2 files. While it's impossible to determine a PAR2 file is "complete original formation", it can find damaged PAR2 packets or unknown bytes between packets. Each PAR2 file is labeled as "good (all PAR2 packets are complete)", "damaged (there are damaged packets or unknown bytes)", or "useless (it doesn't include any packets)". Though it's simple and may not detect damage rarely, it seems to be enough for most usage.

The rare case may be interrupted transmit (or save) after some PAR2 packets. Because PAR2 file consists of many packets, it will be a good PAR2 file with a few complete packets. For example, there is a PAR2 file of 100 MB. If only the first 1 MB of the file is sent or saved, it can be good status with complete packets. It may happen, when someone sends (saves) PAR2 packet one by one. Normally, it won't happen, as he would send (save) multiple PAR2 packets at once. If transmit (save) is stopped while a packet data, the packet is shorten, and it will be detected as damaged.

Now, it's more difficult to determine a PAR3 file is damaged ot not, than previous PAR2 file. PAR3 allows padding between packets. "PAR inside" feature inserts unknown bytes (outside file data) between packets. It will be hard to distinguish "remain of damaged packet" and "intended inserted bytes". Though it's possible to label a PAR3 file's status at the current time, it's not so accurate.

@Yutaka-Sawada
Copy link
Collaborator

I'm not clear what you mean by you implemented "cohorts".

The "cohort" implementation is based on John M Długosz 's idea. It distributes input blocks into some groups. It creates recovery blocks in each group independently. He named the group of input blocks and recovery blocks as "cohort". Each cohort contains a set of input blocks and recovery blocks. A recovery block in a cohort can recover an input block in the cohort only. The idea of "cohort" can be adapted to all Error Correction Codes, because the behavior is just a regular interleaver.

Is something done to the matrix?

At this time, I implemented "cohort" system (interleaving) only for FFT based Reed-Solomon Codes, because Cauchy Reed-Solomon Codes is too slow to treat many blocks. I added a new packet type "FFT Matrix Packet" to support FFT based Reed-Solomon Codes. I wrote details on "Appendix.txt" in par3cmdline's directory.

Are certain packets written out of the usual order?

For convenience, I put one stripe's Recovery Data Packets at once. PAR3 file's volume number represents the number of recovery blocks per cohort instead of number of total recovery blocks.

For example, there are 200,000 input blocks. Someone creates PAR3 file with 10% redundancy (20,000 recovery blocks). Because 16-bit Reed-Solomon Codes cannot treat so many blocks, my PAR3 client distributes them into 4 cohorts. Then, there are 50,000 input blocks per each cohort. Thus, 16-bit Reed-Solomon Codes can create 5,000 recovery blocks from the 50,000 input blocks.

If it puts 20,000 recovery blocks in usual order, it would make 15 recovery files.
something.vol00000+0001.par3
something.vol00001+0002.par3
something.vol00003+0004.par3
something.vol00007+0008.par3
something.vol00015+0016.par3
something.vol00031+0032.par3
something.vol00063+0064.par3
something.vol00127+0128.par3
something.vol00255+0256.par3
something.vol00511+0512.par3
something.vol01023+1024.par3
something.vol02047+2048.par3
something.vol04095+4096.par3
something.vol08191+8192.par3
something.vol16383+3617.par3

Because it puts 20,000 recovery blocks by considering cohorts, it makes 13 recovery files.
something.vol00000+0001.par3
something.vol00001+0002.par3
something.vol00003+0004.par3
something.vol00007+0008.par3
something.vol00015+0016.par3
something.vol00031+0032.par3
something.vol00063+0064.par3
something.vol00127+0128.par3
something.vol00255+0256.par3
something.vol00511+0512.par3
something.vol01023+1024.par3
something.vol02047+2048.par3
something.vol04095+0905.par3

This style gives smaller volume numbers with fewer PAR3 files. Even when there are many recovery blocks, the number of PAR3 files won't become so much. This numbering indicates real recovering capability. While there are 20,000 recovery blocks, it can recover only 5,000 lost input blocks per each cohort.

What is the goal of cohorts?

As I wrote above example, "cohort" supports unlimited number of input blocks, which exceeds 16-bit Reed-Solomon Codes. Because 32-bit Reed-Solomon Codes is too slow, I want a faster method to support many blocks. If there is another way (faster recovery codes such like LDPC) for many blocks, I don't use cohort for the Error Correction Codes. At this time, I didn't find a reliable good implementation of LDPC. Using cohort with FFT based Reed-Solomon Codes seems to be the best method for many blocks currently.

Interleaving is used to distribute errors evenly among the input blocks.

English wikipedia page for interleaving is hard to understand for me. But, it seems to say that "burst errors" may happen more often than "random errors". This would be true in PAR3 usage. When a file is missing, sequential input blocks in the file are lost. It's same as burst errors. Interleaving (as cohorts in PAR3) is good at treating such burst errors. Though it has a weak point (it cannot recover too many lost in each cohort), the risk may be small at very many blocks. This is why I didn't adapt cohort to Cauchy Reed-Solomon Codes for a few blocks. If 32-bit or 64-bit Reed-Solomon Codes is faster than 16-bit Codes, I don't use interleaving. To support many blocks like 2^32 or 2^64, we will need to use cohorts system.

@mdnahas
Copy link
Member Author

mdnahas commented Nov 4, 2023

Hey. Sorry for the absence.

I created a linux/ directory. It has the code from the windows/ directory, with some changes. The big change is to use the portable %" PRIu64 " to print 64-bit numbers, instead of %I64u. I wrote Automake files (Makefile.am and configure.ac) for it. It also has #ifdefs to include Linux header files and to exclude any functions that call Windows system calls.

The code compiles in Linux, but doesn't create a binary because those functions are missing.

The code should compile on Windows. I was able to get it to compile with MinGW, which allows Windows source code to compile on Linux, and run it in WINE, the Windows emulator I was able to create a par3 file and verify it! Yes!

WINE doesn't support AVX instructions, so I had to modify linux/configure.ac to disable AVX in Blake3. I don't know if I'll run into a similar problem with the Leopard library calling AVX instructions.

The functions that need to be implemented natively for Linux are:

  • check_directory
  • check_file
  • check_file_system_info
  • extra_search
  • generate_set_id
  • get_absolute_path
  • make_fat_permission_packet
  • make_unix_permission_packet
  • par_search
  • path_search
  • reset_file_system_info
  • restore_directory
    I wasn't sure what the Windows system call _stat64 does. We may not have to implement all of these.

Some things volunteers can do:

  • verify the new code compiles and runs on Windows natively
  • implement the missing Linux functions
  • write some unit tests
  • do code reviews to search for bugs

I'll see what I can do. I also need to learn about the Leopard library, reply to Yutaka's messages, and update the specification.

@mdnahas
Copy link
Member Author

mdnahas commented Nov 4, 2023

@Yutaka-Sawada Almost every C file starts with "_CRT_SECURE_NO_WARNINGS". I googled it and those warnings are associated with buffer overflows in scanf. Do we want to suppress those warnings? Especially in every file?

@Yutaka-Sawada
Copy link
Collaborator

Almost every C file starts with "_CRT_SECURE_NO_WARNINGS".
Do we want to suppress those warnings? Especially in every file?

The difinision is put to disable warning of Microsoft Visual Studio. So you may move the line into _WIN32 section. The error is like below.

error C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.

Microsoft C suggests to use fopen_s function instead of fopen. But, the function is incompatible with gcc. Thus, I still use fopen by disable the warning.

@mdnahas
Copy link
Member Author

mdnahas commented Nov 6, 2023

Okay, but one person put it "using _CRT_SECURE_NO_WARNINGS is the equivalent of turning off the alarm for both "you left your bedroom door unlocked" and "you left a nuclear bomb in the kitchen". "

Can we limit usage of the #define? Or can we call fopen_s and write our own fopen_s for Linux? I did that with _filelengthi64().

@Yutaka-Sawada
Copy link
Collaborator

Can we limit usage of the #define?

The _CRT_SECURE_NO_WARNINGS is used for Microsoft Visual Studio only. I edited source code to put the #define under _WIN32 section. It erases waring message. There is no difference in application's behavior.

Or can we call fopen_s and write our own fopen_s for Linux?

There are pages about Security Features in the CRT and Security-enhanced versions of CRT functions. There are many warned functions. No need to make Linux version for all those secure functions. At this time, I use fopen sprintf _ctime64 strcpy strcat mbstowcs wcstombs in par3cmdline. But, I don't know what secure versions do internally. It seems to check buffer size.

@mdnahas
Copy link
Member Author

mdnahas commented Nov 7, 2023

I created a pull request with my latest changes.
I'm going to help my friend with a rush job for a few months, so I'm not sure when I'm going to get to work on the code again. There are only like 3 functions that need Linux versions.

@Yutaka-Sawada
Copy link
Collaborator

I found a possible problem of _stricmp or _strnicmp in source code. Because Windows OS is case-insensitive, I use _stricmp to compare pathes often. But, Unix/Linux may be case-sensitive. I'm not sure how file/directory names are set on Linux. Is it good to make a special function like path_cmp ? Then, linux version can switch to strcmp, while windows version is _stricmp. Python language has such utility function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants