Process XOR/ROL optimisations #21

arekbulski · 2018-04-09T18:52:36Z

Identical code as in C# and Python runtimes, well almost identical.

I am unable to run tests so someone may need to fix type cases. Please assist.

webbnh

I can't really comment on the algorithmic correctness of the code, but I've got a few comments for you on the implementation itself.

webbnh · 2018-04-09T20:08:50Z

kaitai/kaitaistream.cpp


    while (new_len > 0 && src[new_len - 1] == pad_byte)
        new_len--;

+    if (new_len == max_len)
+        return src;


Does this optimization have a measurable impact?

As I understand it, the src parameter is a copy of the value specified by the caller, and the value returned by this function is similarly a copy of the value specified in the return statement, so in this case, there are already two copies being performed. (If we're lucky, the compiler is optimizing away at least one of them.) Given that, I can't tell just by reading the code whether skipping the call to substr() actually saves anything. (That is, if the compiler has to copy the return value, then the overhead of using substr() to generate the value copied is probably "in the noise", but adding the optimization makes the code more complicated and introduces an extra (arguably unnecessary) branch.

In any case, you should consider passing the argument as a reference to a const value, which would remove one of the two copies. (Alternatively, if we had access to C++-11 semantics, you would want to consider using std::move() to create the return value.)

I think I understand the problem. In about every other language, both static like C# and dynamically typed like Python Ruby, strings are passed around in O(1). Do I understand correctly that in C++ those strings are passed by full copy in O(N)? If so, then I dont know about the optimisations...

C++ offers all three options:

pass by const reference, where the callee receives a reference (what you're calling O(1)) to the caller's object (here a string), but the callee may not modify it;

pass by mutable reference, where the callee receives a reference to the caller's object, and is able to modify it;

pass by value, where the callee receives a "copy" of the caller's object -- "copy" is in quotes, because the value is produced using the (possibly default-generated) "copy constructor", which, really, could do anything, including creating a reference to the original object.

However, you have to specify the one that you want...and, the most straightforward syntax produces arguably the safest and simplest effect: pass by value.

Its funny that with all that convoluted answers that you gave me, it makes me think that you actually dont know... which kind of doesnt come as a suprise, I think whoever said "nobodody really understands quantum mechanics" DEFINATELY meant C++!!

I am working on a benchmark, I will test it. Standby.

I apologize for the confusion. Which passing mechanism is used is made clear by the function's signature in the declaration/definition.

Pass by "const reference":

std::string kaitai::kstream::bytes_strip_right(const std::string& src, char pad_byte);

The callee and caller share the storage for the object (here, src), but the callee is prevented (at compile-time) from modifying it. However, it looks, inside the function, like the object was declared in the local scope. (I.e., src is not a pointer, and it is accessed with "dot" notation, as src.length().)

Pass by "reference":

std::string kaitai::kstream::bytes_strip_right(std::string& src, char pad_byte);

This is the old-fashioned "pass by pointer" hidden under a bunch of syntactic sugar -- the callee and the caller share the storage for the object, the callee accesses it as though it were declared locally, and the callee has the ability to modify it. This works well for "output parameters" and "in-out parameters", but it is somewhat dangerous, as it's not always apparent to the reader that the caller's value can be modified.

Pass by "value":

std::string kaitai::kstream::bytes_strip_right(std::string src, char pad_byte);

This one does what it says: only the "value" of the object gets passed, and the callee receives a copy-constructed version of the object. By default, that is a full and separate copy of it, which means that the caller is protected from anything that the callee might do, and the callee has full freedom to (ab)use the object as it wishes (once again, the object is accessed as if it were a local declaration); however, there is a performance penalty associated with this if the object is large or complicated to copy.

So, the code as it stands is passing by value, which is arguably the safest, most conservative, least confusing, and probably most expensive approach.

However, that said, it is possible that the compiler can apply some optimizations transparently. The most obvious is for it to allocate the return value in the caller and use that storage to receive the callee's copy, which removes the need for one of the two copy operations. (There are other options, depending on what the caller does with the result and on which version of the C++ compiler is being used.)

webbnh · 2018-04-09T20:09:37Z

kaitai/kaitaistream.cpp

@@ -417,6 +421,9 @@ std::string kaitai::kstream::bytes_terminate(std::string src, char term, bool in
    if (include && new_len < max_len)
        new_len++;

+    if (new_len == max_len)
+        return src;


Ditto my previous comment.

webbnh · 2018-04-09T20:13:41Z

kaitai/kaitaistream.cpp


    return result;
 }

 std::string kaitai::kstream::process_xor_many(std::string data, std::string key) {
    size_t len = data.length();
    size_t kl = key.length();
+    if (len == 1)
+        return process_xor_one(data, key[0]);


Does this optimization produce a measurable impact?

That is, is it really cheaper to execute a branch and a subroutine call than it is to execute the loop overhead for a single iteration? (I can't tell by reading the code.)

All XOR and ROL optimisations are assuming that data is much larger than the key in xor, and group in rol. Therefore, the impl with cheaper main loop is considered better. Therefore, if data is passed in O(1) time, then yes.

As the definition of this function stands above, the input data will be copied during the invocation.

Nevertheless, if the length of the data is 1, then it will result in only one iteration of the loop, and so the question is, which is cheaper -- executing a branch and invoking a (possibly in-lined) simple function, or setting up the loop and then exiting it after the first iteration. I can't tell.

webbnh · 2018-04-09T20:16:29Z

kaitai/kaitaistream.cpp

-        ki++;
-        if (ki >= kl)
-            ki = 0;
+        result[i] = data[i] ^ key[i % kl];


Rather than allocating a single-byte string and overwriting the byte(s), I suggest allocating an empty string and using an append() operation to add byte(s) to the end of it. I don't know which is faster, but it would be worth an experiment.

Alternatively, since you know how long the input string is, you know how long the output string is, so you should be able to pre-allocate the output string. (I don't know that this will make any difference for small strings, as I suspect that std::string() pre-allocates a minimum size already, but it might be very helpful for large translations.)

Again, I have to ask: does the introduction of the modulo operator produce a measurable impact?

It's not obvious from reading that the modulo operation will perform better than the add-and-reset operation from the original code. (The answer will likely depend on the hardware that the code is run on -- division can be a very expensive operation, whereas conditional assignment can be very cheap.)

I came to understanding, although I have to admit that I dont remember where exactly I learned this, so dont ask me for reference or benchmark data, that modulo operation is as cheap as any other "atomic" operation.
You see, in Python which is my primary lang, there is literally no difference in perfomance between >> or | or + or * or / or %. That is because a simple operator call involves a lot of red tape, checking if both operands/objects are integers or floats, etc. The actual "C operation" is one of many instructions.
In lower level langs like C# which is my previous primary language, the difference is a bit more pronounced but not as much as people make you believe. At university they teach you that bitwise ops are cheaper than integer addition then multiplication etc. In reality as benchmarks show, the difference is so small, that you just dont care.
So assuming that this theory holds for C++ as well, single op (modulo) is better than several ops (addition, if, branch misprediction).

Oh its also consistent with the other runtimes implementations. :)

The real answer is, there are levels of "cheap". The fact of the matter is that, while any form of division is probably several times as expensive as a shift, logical-OR, addition, or multiplication, if the operands are not already in the processor memory cache, then the difference doesn't really matter.

Analogously, it is less expensive to execute a modulo operation than it is to load an instruction which is not in the instruction memory cache. So, there is an argument in favor of fewer instructions (regardless of what they are), under the notion that it will fit better into cache. Similarly, a mis-predicted branch is a performance pothole (relative to any single operation); however, many architectures offer predicated instructions (including assignment) or speculative execution, which remove the need for branches or cover for many branching problems, respectively.

So, the cost of a modulo is probably not better than the cost of the other approach; but the cost of fetching an operand (or instruction) from L2 cache is probably 10 times more than either one; and, the cost of fetching the operand from memory is probably 100 times more than that. I would recommend whatever is most easily understood by the maintainers, and the trust the compiler to make it "go".

I agree that we need a benchmark to know for sure. Please standby.

webbnh · 2018-04-09T20:17:46Z

kaitai/kaitaistream.cpp

    }

    return result;
 }

-std::string kaitai::kstream::process_rotate_left(std::string data, int amount) {
+// NOTE: perhaps it should be heap allocated?


Unless you intend to deallocate it (or, unless you're going to try to allocate and initialize it lazily), I see no point in allocating it from the heap.

No I dont intend to deallocate it. I am concerned with permanently using stack space. This adds 2KB to stack. What is a stack capacity nowadays?

The answer on StackOverflow ranges from 16KB to 8MB. So perma adding 2KB of precomputed data to the stack is reasonable but just might be detrimental on some less common machines.
https://stackoverflow.com/a/1826072/2375119

It's not allocated on the stack. There are three basic areas (more or less) from which memory can come: dynamic allocation, stack, and static storage. This one comes from the last. (It's similar to the area where the code resides, but it's writable.) So, it won't impose on the available stack space at all (which is one of the reasons for declaring it to be static).

For the record, stack sizes (and even locations) are generally configurable if you know where the knobs are. :-)

webbnh · 2018-04-09T21:00:58Z

kaitai/kaitaistream.cpp

-        uint8_t bits = data[i];
-        result[i] = (bits << amount) | (bits >> (8 - amount));
+    if (groupSize == 1) {
+        // NOTE: perhaps its `amount * 256` in the index?


I don't understand what this note is trying to say. (Replace the pronoun, perhaps?)

Please nevermind. Forget you saw it.

webbnh · 2018-04-09T21:04:16Z

kaitai/kaitaistream.cpp

    }

-    return result;
+    if (len % groupSize != 0)
+        throw std::runtime_error("process_rotate_left: data length must be a multiple of group size");


Rather than runtime_error, I think invalid_argument would probably be more appropriate.

webbnh · 2018-04-09T21:09:27Z

kaitai/kaitaistream.cpp

+        throw std::runtime_error("process_rotate_left: data length must be a multiple of group size");
+
+    if (amount % 8 == 0) {
+        size_t[groupSize] indices;


This doesn't look like legal C++ syntax, to me...does it compile? I would expect

size_t indices[groupSize];

Ditto lines 520 & 521.

Ah you meant the syntax was wrong, gosh... stupid me. Fixed!

webbnh · 2018-04-09T21:15:13Z

kaitai/kaitaistream.cpp

+    }
+
+    {
+        int amount1 = amount % 8;


I would be tempted to write this as

int amount1 = amount & 7;

webbnh · 2018-04-09T21:19:31Z

kaitai/kaitaistream.cpp

+    }
+}
+
+std::string kaitai::kstream::process_rotate_left(std::string data, int amount, int groupSize = 1) {


I don't think my C++ compiler permits specifying default parameter values in the definition if they are already specified in the declaration.

I have absolutely no idea about this. C++ syntax is completely outside my area of expertise.

I understand...but, does the code as is compile?

arekbulski · 2018-04-10T01:51:22Z

All points made by @webbnh were resolved, either fixed or disregarded, aside of one: whether std::string is passed around as cheap O(1) or by full copy in O(N). As I am not proficient with C++, I wouldnt know. Oh two things actually: also whether optional arguments are declared properly.

Last commit amended.

GreyCat · 2018-04-10T11:05:22Z

I would be highly surprised that any precalculation like that would be faster for C++. Anything related to C++ generally needs rigorous testing, and especially these performance-related things. So, it's not even a question of running a benchmark, it needs a lot of benchmarks to be considered — i.e. different OSes / compilers combinations, x86 / x86_64 targets, etc. There's a huge potential of improvement, but

The suggested changes are not harmless, unfortunately. You allocate global array of 2KB, which would cause trouble with dynamically linked library, and it could cause trouble with super low-RAM embedded systems where we can't allow to eat extra 2KB of RAM and unconditionally allocating 2KB of static memory is a bad idea as well. There are also numerous coding style flaws, but that's relatively minor.

arekbulski · 2018-04-11T03:31:48Z

I would be highly surprised that any precalculation like that would be faster for C++.

In C# benchmarks, array lookup is definately faster than bitwise << | >> so it stands to reason that it would also be with C++. But I will get you a benchmark to confirm.

You allocate global array of 2KB, which would cause trouble with dynamically linked library, and it could cause trouble with super low-RAM embedded systems where we can't allow to eat extra 2KB of RAM

I think that is just balooney. It may be that someone working on embedded systems code would want to use Kaitai but that doesnt mean that we have to support it at the cost of making ridiculous restrictions. If they want to use Kaitai in embedded code, they can fork it and strip it of any non-used functionality. We just dont need to go to that much lenghts. Most processes on the desktop you are using have several MB of working space. 2KB is like next to nothing.

If you demand, I can change the code to generate that translation table on usage but it will lower benchmarks where the data is short, <256 bytes. Its not worth it.

GreyCat · 2018-04-11T10:09:01Z

In C# benchmarks, array lookup is definately faster than bitwise << | >> so it stands to reason that it would also be with C++. But I will get you a benchmark to confirm.

Benchmarking VM-powered languages like C# is hard, microbenchmarking such intricate matters as a single inlinable method is super hard. It's certainly not as simple as running a single executable with a Stopwatch. At the very least, it should be using something like https://github.com/dotnet/BenchmarkDotNet, ensure that we've past the warmup phase, that the code got compiled, etc.

arekbulski · 2018-04-18T13:32:28Z

Benchmark for C++, results identical to C# with respect to "which of the two is slower".

It shows (again) that array lookup is faster than several bitwise shifts/ors but its not faster than a simple xor. It also shows the same that modulo is slower than check-and-wrap.
But... it also discovered that passing string is by copy and reference is equally cheap!

#include <iostream>
#include <string>
#include <boost/timer/timer.hpp>
using namespace std;
using namespace boost::timer;

string pass_copy(string data) 
{
    data[0] = 1;
    return data;
}

string & pass_reference(string & data) 
{
    data[0] = 1;
    return data;
}

string loop_xor(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    uint8_t key = 1;

    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key;
    }

    return result;
}

string loop_shifts(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    int amount = 1;
    int antiamount = 7;

    for (size_t i = 0; i < len; i++) {
        uint8_t x = data[i];
        result[i] = (x << amount)|(x >> antiamount);
    }

    return result;
}

string loop_lookup(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    uint8_t translate[256];

    for (size_t i = 0; i < len; i++) {
        result[i] = translate[data[i]];
    }

    return result;
}

string loop_modulo(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    size_t keylen = 10;
    uint8_t key[10];

    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key[i % keylen];
    }

    return result;
}

string loop_checkwrap(string data) 
{
    size_t len = data.length();
    string result(len, ' ');
    size_t keylen = 10;
    uint8_t key[10];

    size_t k = 0;
    for (size_t i = 0; i < len; i++) {
        result[i] = data[i] ^ key[k];
        k++;
        if (k == keylen)
            k = 0;
    }

    return result;
}

int main() 
{
    string data(1024*1024*1024, ' ');

    {
        cout << "Pass data by copy (current impl)" << endl;
        auto_cpu_timer timer;

        string result = pass_copy(data);
    }
    {
        cout << "Pass data by reference" << endl;
        auto_cpu_timer timer;

        string result = pass_reference(data);
    }
    {
        cout << "Loop with ^ key" << endl;
        auto_cpu_timer timer;

        string result = loop_xor(data);
    }
    {
        cout << "Loop with << >> |" << endl;
        auto_cpu_timer timer;

        string result = loop_shifts(data);
    }
    {
        cout << "Loop with translate[...]" << endl;
        auto_cpu_timer timer;

        string result = loop_lookup(data);
    }
    {
        cout << "Loop with data[i %% keylen]" << endl;
        auto_cpu_timer timer;

        string result = loop_modulo(data);
    }
    {
        cout << "Loop with data[k]; if(k==keylen)k=0" << endl;
        auto_cpu_timer timer;

        string result = loop_checkwrap(data);
    }
}

$ g++ --version
g++ (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
$ g++ test.cpp -lboost_timer
$ ./a.out 
Pass data by copy (current impl)
 0.731110s wall, 0.350000s user + 0.380000s system = 0.730000s CPU (99.8%)
Pass data by reference
 0.731649s wall, 0.310000s user + 0.420000s system = 0.730000s CPU (99.8%)
Loop with ^ key
 5.903469s wall, 5.060000s user + 0.830000s system = 5.890000s CPU (99.8%)
Loop with << >> |
 7.911201s wall, 7.080000s user + 0.830000s system = 7.910000s CPU (100.0%)
Loop with translate[...]
 6.368772s wall, 5.550000s user + 0.810000s system = 6.360000s CPU (99.9%)
Loop with data[i %% keylen]
 16.802076s wall, 15.970000s user + 0.790000s system = 16.760000s CPU (99.7%)
Loop with data[k]; if(k==keylen)k=0
 6.545717s wall, 5.700000s user + 0.830000s system = 6.530000s CPU (99.8%)

arekbulski · 2018-04-18T13:53:11Z

Code was updated according to benchmarks, and is therefore READY TO MERGE, although it would be prudent for someone to check if the runtime compiles properly with some KSY-compiled cpp, just to be sure.

webbnh · 2018-04-18T15:54:02Z

I'm not sure the code is ready for merge: I started looking at it and I think I found an error. (I'll do a more detailed review and post comments.)

Also, I'm not convinced that your program actually demonstrates that passing by copy and reference is equally cheap: since your code doesn't do anything with the results of the subroutines, I suspect that the compiler is within its rights to optimize away the whole call (that is, the definitions are in-scope, and the implementations have no side-effects, so the compiler could be able to tell that it doesn't matter whether the functions are actually called or not).

shortcuts, return full string instead of copying it

14b9bd8

webbnh reviewed Apr 9, 2018

View reviewed changes

arekbulski force-pushed the patch-xor-rol branch from 0fca875 to 2c8e817 Compare April 10, 2018 01:49

arekbulski force-pushed the patch-xor-rol branch from 2c8e817 to 109e2bc Compare April 10, 2018 01:55

arekbulski added the enhancement label Apr 11, 2018

arekbulski force-pushed the patch-xor-rol branch 2 times, most recently from 03a9c25 to f378a58 Compare April 18, 2018 13:51

arekbulski force-pushed the patch-xor-rol branch from f378a58 to bc8baa5 Compare April 18, 2018 13:54

process xor/rol optimised and fully implemented (thanks @webbnh)

6fccdb8

arekbulski force-pushed the patch-xor-rol branch from bc8baa5 to 6fccdb8 Compare April 18, 2018 22:21

Process XOR/ROL optimisations #21

Are you sure you want to change the base?

Process XOR/ROL optimisations #21

Conversation

arekbulski commented Apr 9, 2018 • edited Loading

webbnh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arekbulski Apr 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webbnh Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

webbnh Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

webbnh Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arekbulski commented Apr 10, 2018 • edited Loading

GreyCat commented Apr 10, 2018

arekbulski commented Apr 11, 2018 • edited Loading

GreyCat commented Apr 11, 2018

arekbulski commented Apr 18, 2018 • edited Loading

arekbulski commented Apr 18, 2018

webbnh commented Apr 18, 2018

arekbulski commented Apr 9, 2018 •

edited

Loading

arekbulski Apr 11, 2018 •

edited

Loading

webbnh Apr 9, 2018 •

edited

Loading

webbnh Apr 9, 2018 •

edited

Loading

webbnh Apr 9, 2018 •

edited

Loading

arekbulski commented Apr 10, 2018 •

edited

Loading

arekbulski commented Apr 11, 2018 •

edited

Loading

arekbulski commented Apr 18, 2018 •

edited

Loading