Skip to content
This repository has been archived by the owner on Apr 23, 2022. It is now read-only.

xdelta3.XDeltaError: Output of decoding delta longer than expected #2

Open
karamanolev opened this issue Dec 21, 2017 · 12 comments · May be fixed by #3
Open

xdelta3.XDeltaError: Output of decoding delta longer than expected #2

karamanolev opened this issue Dec 21, 2017 · 12 comments · May be fixed by #3

Comments

@karamanolev
Copy link

karamanolev commented Dec 21, 2017

I'm trying to use your library. Here's a simple example of how I'm trying it:

    b1 = open('b1.bin', 'rb').read()
    b2 = open('b2.bin', 'rb').read()

    d = xdelta3.encode(b1, b2)
    b3 = xdelta3.decode(b1, d)
    print(b2 == b3)

This gives me an exception: xdelta3.XDeltaError: Output of decoding delta longer than expected

I'm attaching a zip file with b1.bin and b2.bin. By the way, they're simple HTML files, I just made sure to treat them as binary so I don't trip over encodings. Also, when trying it in reverse (not b1 -> b2, but b2 -> b1) it all works well.

$ python --version
Python 3.6.1
Running on Mac OS, Python installed via pyenv. xdelta3==0.0.5.

files.zip

@samuelcolvin
Copy link
Owner

I'm getting the same error.

look here

output shouldn't be bigger than the original plus the delta, but give a little leeway

Basically I assume that the size of the output (b3 here) won't be bigger than len(b1) + len(d) plus some breaking space.

I guess that was wrong. Perhaps b1 has lots of repeated content which xdelta 3 is cleverly encoding. In short I think xdelta3 is performing better than I assumed it could.

I'm no expert C developer, but I would guess the solution is to just increase output_alloc to maybe input_size + source_size * 2.

I'll create a PR for this. Do you mind if I use the two files in files.zip for unit tests?

@samuelcolvin samuelcolvin linked a pull request Dec 21, 2017 that will close this issue
@samuelcolvin
Copy link
Owner

see #3 @karamanolev, would be great if you could try that branch and check it's working for you.

If ok, I'll merge and deploy an update.

@karamanolev
Copy link
Author

@samuelcolvin Feel free to use the files. Please mind that they were retrieved from a website that I don't own, but I don't think it's ever going to be an issue.

As for the output size, it can be very big compared to inputs. Take a look:

import xdelta3
b1 = b''
b2 = b't' * 10000
len(xdelta3.encode(b1, b2))  # This is 17

This means you can have an input of 0 bytes and 17 bytes respectively and generate an output of 10000 bytes (10k bytes). If you have 0 bytes and 1 gigabyte of repeated characters, the delta is 1923 bytes. There won't be a formula in the world that allows you to statically allocate a buffer you know is going to enough.

@samuelcolvin
Copy link
Owner

Ok, but in real work #3 is better than the current value.

Unless you want to submit a PR to dynamically allocate the output buffer size?

@karamanolev
Copy link
Author

I'll take a look. The right way to do it, I think, is to implement the streaming capabilities of xdelta3. A compromise with amortized linear complexity would be to start with a reasonably sized buffer (e.g. the old size) and then double it each time you end up with ENOSPC. That shouldn't be too hard to implement.

@samuelcolvin
Copy link
Owner

Agreed doubling would be easier. The problem is it will be slow(er) since the decode will have to retry until a suitable size is reached, what's worse this performance degradation will be silent.

I can see a situation along the lines of "Here's this sitation where xdelta three should be brilliant and save me loads of data but it's being a bit slow. Why is that?". But then I don't see a better solution.

Perhaps it would be worth asking on the main xdelta3 repo for advice.

@karamanolev
Copy link
Author

It shouldn't be slower by more than 2 times (amortized linear). And certainly better than throwing errors. As for the right way - the streaming interface + allocating buffer blocks as you go. https://github.com/jmacd/xdelta/blob/wiki/ProgrammingGuide.md it's documented here, xd3_stream/xd3_config. The implementation would be a bit more tricky though.

@samuelcolvin
Copy link
Owner

ok, makes sense. Could you create a PR for doubling buffer size?

@samuelcolvin
Copy link
Owner

ok, I've updated #3 to double output_alloc until the buffer is big enough.

@karamanolev does that look ok to you?

@karamanolev
Copy link
Author

@samuelcolvin Your PR looks OK to me. I've opened another, alternative one, which also happens to fix my other big issue with the library as is (the NoDelta exception is just surprising, in a bad way). I put argumentation in the PR, but here's some more:
If you get the NoDelta exception, you'll have to add application logic to handle it anyway (you can't just ignore it or something like this). If you're going to add app logic for that, you can just as well check for the delta length yourself.
Imagine you don't want to add logic and you just want to store the delta, even if it's a little larger, that's fine. That will likely be 0.001% of your cases and will probably not cause a big storage issue. In that case you're left to hang dry - you can't use the library, because it just throws an exception, no return value at all. This is what happened in my case.

@mwojnars
Copy link

Alternative solution: upon encoding, prepend in "delta" the size of the original string - that's a single integer. Upon decoding, read this value first, then allocate a buffer sized exactly as needed - you won't need reallocation nor compromise on speed.

@samuelcolvin
Copy link
Owner

Makes lots of sense, then default to the current value.

PR welcome.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants