Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process XOR/ROL optimisations #21

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Process XOR/ROL optimisations #21

wants to merge 2 commits into from

Conversation

arekbulski
Copy link
Member

@arekbulski arekbulski commented Apr 9, 2018

Identical code as in C# and Python runtimes, well almost identical.

I am unable to run tests so someone may need to fix type cases. Please assist.

Copy link

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really comment on the algorithmic correctness of the code, but I've got a few comments for you on the implementation itself.


while (new_len > 0 && src[new_len - 1] == pad_byte)
new_len--;

if (new_len == max_len)
return src;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this optimization have a measurable impact?

As I understand it, the src parameter is a copy of the value specified by the caller, and the value returned by this function is similarly a copy of the value specified in the return statement, so in this case, there are already two copies being performed. (If we're lucky, the compiler is optimizing away at least one of them.) Given that, I can't tell just by reading the code whether skipping the call to substr() actually saves anything. (That is, if the compiler has to copy the return value, then the overhead of using substr() to generate the value copied is probably "in the noise", but adding the optimization makes the code more complicated and introduces an extra (arguably unnecessary) branch.

In any case, you should consider passing the argument as a reference to a const value, which would remove one of the two copies. (Alternatively, if we had access to C++-11 semantics, you would want to consider using std::move() to create the return value.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the problem. In about every other language, both static like C# and dynamically typed like Python Ruby, strings are passed around in O(1). Do I understand correctly that in C++ those strings are passed by full copy in O(N)? If so, then I dont know about the optimisations...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ offers all three options:

  • pass by const reference, where the callee receives a reference (what you're calling O(1)) to the caller's object (here a string), but the callee may not modify it;
  • pass by mutable reference, where the callee receives a reference to the caller's object, and is able to modify it;
  • pass by value, where the callee receives a "copy" of the caller's object -- "copy" is in quotes, because the value is produced using the (possibly default-generated) "copy constructor", which, really, could do anything, including creating a reference to the original object.

However, you have to specify the one that you want...and, the most straightforward syntax produces arguably the safest and simplest effect: pass by value.

Copy link
Member Author

@arekbulski arekbulski Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its funny that with all that convoluted answers that you gave me, it makes me think that you actually dont know... which kind of doesnt come as a suprise, I think whoever said "nobodody really understands quantum mechanics" DEFINATELY meant C++!!

I am working on a benchmark, I will test it. Standby.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for the confusion. Which passing mechanism is used is made clear by the function's signature in the declaration/definition.

Pass by "const reference":

std::string kaitai::kstream::bytes_strip_right(const std::string& src, char pad_byte);

The callee and caller share the storage for the object (here, src), but the callee is prevented (at compile-time) from modifying it. However, it looks, inside the function, like the object was declared in the local scope. (I.e., src is not a pointer, and it is accessed with "dot" notation, as src.length().)

Pass by "reference":

std::string kaitai::kstream::bytes_strip_right(std::string& src, char pad_byte);

This is the old-fashioned "pass by pointer" hidden under a bunch of syntactic sugar -- the callee and the caller share the storage for the object, the callee accesses it as though it were declared locally, and the callee has the ability to modify it. This works well for "output parameters" and "in-out parameters", but it is somewhat dangerous, as it's not always apparent to the reader that the caller's value can be modified.

Pass by "value":

std::string kaitai::kstream::bytes_strip_right(std::string src, char pad_byte);

This one does what it says: only the "value" of the object gets passed, and the callee receives a copy-constructed version of the object. By default, that is a full and separate copy of it, which means that the caller is protected from anything that the callee might do, and the callee has full freedom to (ab)use the object as it wishes (once again, the object is accessed as if it were a local declaration); however, there is a performance penalty associated with this if the object is large or complicated to copy.

So, the code as it stands is passing by value, which is arguably the safest, most conservative, least confusing, and probably most expensive approach.

However, that said, it is possible that the compiler can apply some optimizations transparently. The most obvious is for it to allocate the return value in the caller and use that storage to receive the callee's copy, which removes the need for one of the two copy operations. (There are other options, depending on what the caller does with the result and on which version of the C++ compiler is being used.)

@@ -417,6 +421,9 @@ std::string kaitai::kstream::bytes_terminate(std::string src, char term, bool in
if (include && new_len < max_len)
new_len++;

if (new_len == max_len)
return src;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto my previous comment.


return result;
}

std::string kaitai::kstream::process_xor_many(std::string data, std::string key) {
size_t len = data.length();
size_t kl = key.length();
if (len == 1)
return process_xor_one(data, key[0]);
Copy link

@webbnh webbnh Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this optimization produce a measurable impact?

That is, is it really cheaper to execute a branch and a subroutine call than it is to execute the loop overhead for a single iteration? (I can't tell by reading the code.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All XOR and ROL optimisations are assuming that data is much larger than the key in xor, and group in rol. Therefore, the impl with cheaper main loop is considered better. Therefore, if data is passed in O(1) time, then yes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the definition of this function stands above, the input data will be copied during the invocation.

Nevertheless, if the length of the data is 1, then it will result in only one iteration of the loop, and so the question is, which is cheaper -- executing a branch and invoking a (possibly in-lined) simple function, or setting up the loop and then exiting it after the first iteration. I can't tell.

ki++;
if (ki >= kl)
ki = 0;
result[i] = data[i] ^ key[i % kl];
Copy link

@webbnh webbnh Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than allocating a single-byte string and overwriting the byte(s), I suggest allocating an empty string and using an append() operation to add byte(s) to the end of it. I don't know which is faster, but it would be worth an experiment.

Alternatively, since you know how long the input string is, you know how long the output string is, so you should be able to pre-allocate the output string. (I don't know that this will make any difference for small strings, as I suspect that std::string() pre-allocates a minimum size already, but it might be very helpful for large translations.)

Copy link

@webbnh webbnh Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I have to ask: does the introduction of the modulo operator produce a measurable impact?

It's not obvious from reading that the modulo operation will perform better than the add-and-reset operation from the original code. (The answer will likely depend on the hardware that the code is run on -- division can be a very expensive operation, whereas conditional assignment can be very cheap.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came to understanding, although I have to admit that I dont remember where exactly I learned this, so dont ask me for reference or benchmark data, that modulo operation is as cheap as any other "atomic" operation.
You see, in Python which is my primary lang, there is literally no difference in perfomance between >> or | or + or * or / or %. That is because a simple operator call involves a lot of red tape, checking if both operands/objects are integers or floats, etc. The actual "C operation" is one of many instructions.
In lower level langs like C# which is my previous primary language, the difference is a bit more pronounced but not as much as people make you believe. At university they teach you that bitwise ops are cheaper than integer addition then multiplication etc. In reality as benchmarks show, the difference is so small, that you just dont care.
So assuming that this theory holds for C++ as well, single op (modulo) is better than several ops (addition, if, branch misprediction).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh its also consistent with the other runtimes implementations. :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real answer is, there are levels of "cheap". The fact of the matter is that, while any form of division is probably several times as expensive as a shift, logical-OR, addition, or multiplication, if the operands are not already in the processor memory cache, then the difference doesn't really matter.

Analogously, it is less expensive to execute a modulo operation than it is to load an instruction which is not in the instruction memory cache. So, there is an argument in favor of fewer instructions (regardless of what they are), under the notion that it will fit better into cache. Similarly, a mis-predicted branch is a performance pothole (relative to any single operation); however, many architectures offer predicated instructions (including assignment) or speculative execution, which remove the need for branches or cover for many branching problems, respectively.

So, the cost of a modulo is probably not better than the cost of the other approach; but the cost of fetching an operand (or instruction) from L2 cache is probably 10 times more than either one; and, the cost of fetching the operand from memory is probably 100 times more than that. I would recommend whatever is most easily understood by the maintainers, and the trust the compiler to make it "go".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need a benchmark to know for sure. Please standby.

}

return result;
}

std::string kaitai::kstream::process_rotate_left(std::string data, int amount) {
// NOTE: perhaps it should be heap allocated?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you intend to deallocate it (or, unless you're going to try to allocate and initialize it lazily), I see no point in allocating it from the heap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I dont intend to deallocate it. I am concerned with permanently using stack space. This adds 2KB to stack. What is a stack capacity nowadays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer on StackOverflow ranges from 16KB to 8MB. So perma adding 2KB of precomputed data to the stack is reasonable but just might be detrimental on some less common machines.
https://stackoverflow.com/a/1826072/2375119

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not allocated on the stack. There are three basic areas (more or less) from which memory can come: dynamic allocation, stack, and static storage. This one comes from the last. (It's similar to the area where the code resides, but it's writable.) So, it won't impose on the available stack space at all (which is one of the reasons for declaring it to be static).

For the record, stack sizes (and even locations) are generally configurable if you know where the knobs are. :-)

uint8_t bits = data[i];
result[i] = (bits << amount) | (bits >> (8 - amount));
if (groupSize == 1) {
// NOTE: perhaps its `amount * 256` in the index?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this note is trying to say. (Replace the pronoun, perhaps?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please nevermind. Forget you saw it.

}

return result;
if (len % groupSize != 0)
throw std::runtime_error("process_rotate_left: data length must be a multiple of group size");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than runtime_error, I think invalid_argument would probably be more appropriate.

throw std::runtime_error("process_rotate_left: data length must be a multiple of group size");

if (amount % 8 == 0) {
size_t[groupSize] indices;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like legal C++ syntax, to me...does it compile? I would expect

        size_t indices[groupSize];

Ditto lines 520 & 521.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you meant the syntax was wrong, gosh... stupid me. Fixed!

}

{
int amount1 = amount % 8;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to write this as

       int amount1 = amount & 7;

}
}

std::string kaitai::kstream::process_rotate_left(std::string data, int amount, int groupSize = 1) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think my C++ compiler permits specifying default parameter values in the definition if they are already specified in the declaration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have absolutely no idea about this. C++ syntax is completely outside my area of expertise.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand...but, does the code as is compile?

@arekbulski
Copy link
Member Author

arekbulski commented Apr 10, 2018

All points made by @webbnh were resolved, either fixed or disregarded, aside of one: whether std::string is passed around as cheap O(1) or by full copy in O(N). As I am not proficient with C++, I wouldnt know. Oh two things actually: also whether optional arguments are declared properly.

Last commit amended.

@GreyCat
Copy link
Member

GreyCat commented Apr 10, 2018

I would be highly surprised that any precalculation like that would be faster for C++. Anything related to C++ generally needs rigorous testing, and especially these performance-related things. So, it's not even a question of running a benchmark, it needs a lot of benchmarks to be considered — i.e. different OSes / compilers combinations, x86 / x86_64 targets, etc. There's a huge potential of improvement, but

The suggested changes are not harmless, unfortunately. You allocate global array of 2KB, which would cause trouble with dynamically linked library, and it could cause trouble with super low-RAM embedded systems where we can't allow to eat extra 2KB of RAM and unconditionally allocating 2KB of static memory is a bad idea as well. There are also numerous coding style flaws, but that's relatively minor.

@arekbulski
Copy link
Member Author

arekbulski commented Apr 11, 2018

I would be highly surprised that any precalculation like that would be faster for C++.

In C# benchmarks, array lookup is definately faster than bitwise << | >> so it stands to reason that it would also be with C++. But I will get you a benchmark to confirm.

You allocate global array of 2KB, which would cause trouble with dynamically linked library, and it could cause trouble with super low-RAM embedded systems where we can't allow to eat extra 2KB of RAM

I think that is just balooney. It may be that someone working on embedded systems code would want to use Kaitai but that doesnt mean that we have to support it at the cost of making ridiculous restrictions. If they want to use Kaitai in embedded code, they can fork it and strip it of any non-used functionality. We just dont need to go to that much lenghts. Most processes on the desktop you are using have several MB of working space. 2KB is like next to nothing.

If you demand, I can change the code to generate that translation table on usage but it will lower benchmarks where the data is short, <256 bytes. Its not worth it.

@GreyCat
Copy link
Member

GreyCat commented Apr 11, 2018

In C# benchmarks, array lookup is definately faster than bitwise << | >> so it stands to reason that it would also be with C++. But I will get you a benchmark to confirm.

Benchmarking VM-powered languages like C# is hard, microbenchmarking such intricate matters as a single inlinable method is super hard. It's certainly not as simple as running a single executable with a Stopwatch. At the very least, it should be using something like https://github.com/dotnet/BenchmarkDotNet, ensure that we've past the warmup phase, that the code got compiled, etc.

@arekbulski
Copy link
Member Author

arekbulski commented Apr 18, 2018

Benchmark for C++, results identical to C# with respect to "which of the two is slower".

It shows (again) that array lookup is faster than several bitwise shifts/ors but its not faster than a simple xor. It also shows the same that modulo is slower than check-and-wrap.
But... it also discovered that passing string is by copy and reference is equally cheap!

#include <iostream>
#include <string>
#include <boost/timer/timer.hpp>
using namespace std;
using namespace boost::timer;

string pass_copy(string data) 
{
    data[0] = 1;
    return data;
}

string & pass_reference(string & data) 
{
    data[0] = 1;
    return data;
}

string loop_xor(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    uint8_t key = 1;

    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key;
    }

    return result;
}

string loop_shifts(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    int amount = 1;
    int antiamount = 7;

    for (size_t i = 0; i < len; i++) {
        uint8_t x = data[i];
        result[i] = (x << amount)|(x >> antiamount);
    }

    return result;
}

string loop_lookup(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    uint8_t translate[256];

    for (size_t i = 0; i < len; i++) {
        result[i] = translate[data[i]];
    }

    return result;
}

string loop_modulo(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    size_t keylen = 10;
    uint8_t key[10];

    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key[i % keylen];
    }

    return result;
}

string loop_checkwrap(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    size_t keylen = 10;
    uint8_t key[10];

    size_t k = 0;
    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key[k];
        k++;
        if (k == keylen)
            k = 0;
    }

    return result;
}

int main() 
{
    string data(1024*1024*1024, ' ');

    {
        cout << "Pass data by copy (current impl)" << endl;
        auto_cpu_timer timer;

        string result = pass_copy(data);
    }
    {
        cout << "Pass data by reference" << endl;
        auto_cpu_timer timer;

        string result = pass_reference(data);
    }
    {
        cout << "Loop with ^ key" << endl;
        auto_cpu_timer timer;

        string result = loop_xor(data);
    }
    {
        cout << "Loop with << >> |" << endl;
        auto_cpu_timer timer;

        string result = loop_shifts(data);
    }
    {
        cout << "Loop with translate[...]" << endl;
        auto_cpu_timer timer;

        string result = loop_lookup(data);
    }
    {
        cout << "Loop with data[i %% keylen]" << endl;
        auto_cpu_timer timer;

        string result = loop_modulo(data);
    }
    {
        cout << "Loop with data[k]; if(k==keylen)k=0" << endl;
        auto_cpu_timer timer;

        string result = loop_checkwrap(data);
    }
}
$ g++ --version
g++ (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
$ g++ test.cpp -lboost_timer
$ ./a.out 
Pass data by copy (current impl)
 0.731110s wall, 0.350000s user + 0.380000s system = 0.730000s CPU (99.8%)
Pass data by reference
 0.731649s wall, 0.310000s user + 0.420000s system = 0.730000s CPU (99.8%)
Loop with ^ key
 5.903469s wall, 5.060000s user + 0.830000s system = 5.890000s CPU (99.8%)
Loop with << >> |
 7.911201s wall, 7.080000s user + 0.830000s system = 7.910000s CPU (100.0%)
Loop with translate[...]
 6.368772s wall, 5.550000s user + 0.810000s system = 6.360000s CPU (99.9%)
Loop with data[i %% keylen]
 16.802076s wall, 15.970000s user + 0.790000s system = 16.760000s CPU (99.7%)
Loop with data[k]; if(k==keylen)k=0
 6.545717s wall, 5.700000s user + 0.830000s system = 6.530000s CPU (99.8%)

@arekbulski arekbulski force-pushed the patch-xor-rol branch 2 times, most recently from 03a9c25 to f378a58 Compare April 18, 2018 13:51
@arekbulski
Copy link
Member Author

Code was updated according to benchmarks, and is therefore READY TO MERGE, although it would be prudent for someone to check if the runtime compiles properly with some KSY-compiled cpp, just to be sure.

@webbnh
Copy link

webbnh commented Apr 18, 2018

I'm not sure the code is ready for merge: I started looking at it and I think I found an error. (I'll do a more detailed review and post comments.)

Also, I'm not convinced that your program actually demonstrates that passing by copy and reference is equally cheap: since your code doesn't do anything with the results of the subroutines, I suspect that the compiler is within its rights to optimize away the whole call (that is, the definitions are in-scope, and the implementations have no side-effects, so the compiler could be able to tell that it doesn't matter whether the functions are actually called or not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants