Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Creating new parity files: "panic: too many shards" #7

Open
brenthuisman opened this issue Jan 14, 2021 · 9 comments
Open

[bug] Creating new parity files: "panic: too many shards" #7

brenthuisman opened this issue Jan 14, 2021 · 9 comments

Comments

@brenthuisman
Copy link
Contributor

For larger files (the threshold is somewhere between 9.2 and 77MB) I consistently get this error when I try to create parity. Looking at memory usage, all files (one is 2.7GB) seem to be loaded in full. The error seems to come right after loading:

[1/1] Loaded data file "bsc.tar.zst" (578352090 bytes)
panic: too many data shards

goroutine 1 [running]:
main.main()
@brenthuisman brenthuisman changed the title [bug] panic: too many shards [bug] Creating new parity files: "panic: too many shards" Jan 15, 2021
@akalin
Copy link
Owner

akalin commented Jan 16, 2021

That error is when the number of data shards is >256, which is a par2 limitation. I suspect it has to do with the default block size not being intelligently picked, but fixed at 2000. Can you try with par2 c -s <n> ... where n is larger than 2000 (but a multiple of 4) and see if that fixes it?

@brenthuisman
Copy link
Contributor Author

It is also happens when -s and -c are not specified. Maybe a better default would be nice?

I'm cracking my head a second time over how par2cmdline calculates blocksize and blockcount from a redundancy level (percentage of protection). I believe the blockcount is always 2000 if you specify a redundancy:

https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1099

On the other hand, there's a recoveryblockcount being calculated, that that the one I should set for -c?

https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1223

In summary it is this:

for (file in files):
      filesize = sizeof(file)
      sourceblockcount += (filesize + blocksize-1) / blocksize;
recoveryblockcount = (sourceblockcount * redundancy + 50) / 100;

Remaining question is how to get the blocksize. Assuming blockcount = 2000, I think we're almost always in this block:

https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1151

Unfortunately I'm struggling to read along there.

Do you have any idea for a heuristic for -s and -c based on the number of files and filesize (I always use per-file parity, so the only variable is filesize)?

@brenthuisman
Copy link
Contributor Author

brenthuisman commented Jan 16, 2021

This seems pretty robust and doing what I think it does:

	def getblocksizecount(self,filename):
		f_size = os.path.getsize(filename)
		blocksize_min = f_size//2**15 # size can never be below this
		blocksize_f = (f_size*self.percentage)//100
		blockcount_max = 2**7-1 #some logic to keep blockcount and overhead for small files under control
		if f_size < 1e6:
			blockcount_max = 2**3-1
		elif f_size < 4e6:
			blockcount_max = 2**4-1
		elif f_size < 20e6:
			blockcount_max = 2**5-1
		if blocksize_f > blocksize_min:
			try:
				blockcount = min(blockcount_max,blocksize_f//blocksize_min)
				blocksize = blocksize_f/blockcount
			except ZeroDivisionError:
				blockcount = 1
				blocksize = blocksize_min
		else:
			blockcount = 1
			blocksize = 4
		blocksize = (blocksize//4+1)*4 #make multiple of 4
		return int(blocksize),int(blockcount)

@akalin
Copy link
Owner

akalin commented Jan 17, 2021

I'll keep this open as a reminder to do calculate the parameters a bit more intelligently. The snippet you posted looks plausible, I assume you're gonna calculate that in your external app and pass that in.

(Also, I misspoke above, the shard limit for par2 is 65536, not 256 (which is par1).)

@akalin akalin reopened this Jan 17, 2021
@brenthuisman
Copy link
Contributor Author

brenthuisman commented Jan 18, 2021

OK, good idea. Indeed, this is what I calculate and pass in. Made a small modification to handle very small files.

The shard limit I found in par2cmdline is 2**15 (~32k), not 65536. I tested this, and gopar too showed a threshold there. Hence the 2**15 in the snippet.

@akalin
Copy link
Owner

akalin commented Jan 18, 2021

Ah, yes you're right! Forgot there was a smaller limit for data shards.

@brenthuisman
Copy link
Contributor Author

A nicer place for the snippet would be in gopars own flags of course, but I didn't do that because I felt that having different logic from par2cmdline for the -r flag could be confusing. On the other hand, maybe that's taking legacy compatibility a bit too far. What's your opinion on that?

@akalin
Copy link
Owner

akalin commented Jan 18, 2021

Yeah I don't think there's any real need to implement par2cmdline's computation exactly -- in fact, it seems pretty ad hoc, and I think if I think about it for a bit I can come up with a more systematic way.

The calculation above is only for single files, right? In general, par2 would have handle multiple files, which might change thing a bit.

@brenthuisman
Copy link
Contributor Author

Correct, this is only for single files, and therefore not ready for inclusion. I think par2cmdline takes the largest file as a basis for a first blockcount estimate, but then there's a loop that converges on something but I'm not sure what, or what the goal was there.

I only work with single file parity (that's the whole idea of par2deep).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants