Skip to content
This repository has been archived by the owner on Apr 13, 2023. It is now read-only.

make ceylon.language::sum fast #5606

Closed
CeylonMigrationBot opened this issue Aug 13, 2015 · 22 comments
Closed

make ceylon.language::sum fast #5606

CeylonMigrationBot opened this issue Aug 13, 2015 · 22 comments

Comments

@CeylonMigrationBot
Copy link

[@jvasileff] This implementation avoids using a boxed type for the sum accumulator variable when summing Integers and Floats.

If you want to merge this, I can also make the change for product.

Benchmark Script

shared void sumTest() {
    for (size in [1, 2, 4, 8, 32, 128, 256, 1024]) {
        value ints = (1..size).collect(identity);
        value floats = (1..size).collect(Integer.float);
        value repeat = 2^27 / size;

        benchmark {
            scale = repeat * size  / 10;
            warmupIter = 3;
            benchIter = 5;
            ["lang::sum(ints); size=``size``:", () => (0:repeat).each((_) => sum(ints))],
            ["math::sum(ints); size=``size``:", () => (0:repeat).each((_) => mathSum(ints))],
            ["optimized(ints); size=``size``:", () => (0:repeat).each((_) => sumNew4(ints))]
        };

        benchmark {
            scale = repeat*size  / 10;
            warmupIter = 3;
            benchIter = 5;
            ["lang::sum(floats); size=``size``:", () => (0:repeat).each((_) => sum(floats))],
            ["math::sum(floats); size=``size``:", () => (0:repeat).each((_) => mathSumF(floats))],
            ["optimized(floats); size=``size``:", () => (0:repeat).each((_) => sumNew4(floats))]
        };
    }
}

Integer Results

Summary min/max/avg/rstddev/pct
math::sum(ints); size=1: 116/124/118/3% (100%)
optimized(ints); size=1: 120/127/123/2% (103%)
lang::sum(ints); size=1: 139/141/140/0% (119%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=2: 65/69/66/1% (100%)
lang::sum(ints); size=2: 69/71/70/1% (105%)
math::sum(ints); size=2: 73/76/74/1% (112%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=4: 47/47/47/0% (100%)
math::sum(ints); size=4: 52/53/52/1% (109%)
lang::sum(ints); size=4: 57/59/58/1% (121%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=8: 37/38/37/1% (100%)
math::sum(ints); size=8: 41/42/41/0% (111%)
lang::sum(ints); size=8: 53/55/54/1% (143%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=32: 32/32/32/0% (100%)
math::sum(ints); size=32: 33/37/34/4% (103%)
lang::sum(ints); size=32: 55/56/56/0% (172%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=128: 29/33/30/5% (100%)
math::sum(ints); size=128: 30/31/30/0% (105%)
lang::sum(ints); size=128: 50/52/51/1% (173%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=256: 29/33/31/4% (99%)
math::sum(ints); size=256: 30/35/32/6% (102%)
lang::sum(ints); size=256: 51/57/55/4% (173%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=1024: 29/30/29/2% (100%)
math::sum(ints); size=1024: 29/30/30/0% (102%)
lang::sum(ints); size=1024: 51/52/51/1% (175%)

Float Results

Summary min/max/avg/rstddev/pct
lang::sum(floats); size=1: 92/97/95/1% (100%)
optimized(floats); size=1: 126/130/128/1% (135%)
math::sum(floats); size=1: 149/154/152/1% (160%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=2: 68/70/69/1% (100%)
lang::sum(floats); size=2: 68/72/70/2% (100%)
math::sum(floats); size=2: 92/94/93/0% (134%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=4: 50/52/51/2% (100%)
lang::sum(floats); size=4: 59/63/61/3% (117%)
math::sum(floats); size=4: 63/66/65/1% (127%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=8: 39/40/40/0% (100%)
math::sum(floats); size=8: 48/50/49/1% (122%)
lang::sum(floats); size=8: 55/57/56/1% (138%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=32: 32/33/33/1% (100%)
math::sum(floats); size=32: 37/37/37/0% (113%)
lang::sum(floats); size=32: 57/59/58/1% (174%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=128: 29/30/29/1% (100%)
math::sum(floats); size=128: 33/34/34/1% (113%)
lang::sum(floats); size=128: 53/55/54/1% (184%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=256: 28/30/29/2% (100%)
math::sum(floats); size=256: 32/33/33/1% (115%)
lang::sum(floats); size=256: 53/54/53/0% (188%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=1024: 28/29/28/1% (100%)
math::sum(floats); size=1024: 32/34/32/2% (115%)
lang::sum(floats); size=1024: 53/54/54/0% (190%)

[Migrated from ceylon/ceylon.language#728]
[Closed at 2015-10-01 15:46:20]

@CeylonMigrationBot
Copy link
Author

[@quintesse] Hmm I should test how my experimental boxing/unboxing code handles this. Because although it's a clever trick what you are doing here it depends completely on implementation details that could change at any time.

@CeylonMigrationBot
Copy link
Author

[@jvasileff] I beg to differ.

  1. sum cannot possibly be fast without testing for Integer and Float; Value on its own will never be a primitive.
  2. "Trick 1" is to test first rather than the Iterable or Iterator. This is solely to avoid the huge isReified cost
  3. "Trick 2" is to explicitly assign values to explicit Integers. This should always result in primitives, unless boxing optimizations are totally broken.

For the last item, certainly a compiler improvement would be to treat Value&Integer as a primitive, which it can do since Integer is final. But that would not change the behavior of this code.

Again, for the last item, boxing logic could be improved such that only the accumulator would need to explicitly be an Integer (IOW, primitive + boxed always unboxes rather than sometimes boxing the primitive). But again, that improvement would not "break" this code.

@CeylonMigrationBot
Copy link
Author

[@jvasileff] Minor addition: I suppose pure Integers don't have to be primitives, with a "don't unbox" strategy. But then my other point above would have to be fixed for such a strategy to work, and in the end, this would result in this code performing as expected.

@CeylonMigrationBot
Copy link
Author

[@quintesse] I also beg to differ with your point 3, you don't get to decide what the compiler should always do, that's the compiler's business :)
It's not that this code will code will work or not, of course it works, and I agree with "Trick 1" (although hopefully even that we can improve in the future) but it's "Trick 2" that just looks nasty, even if it works.
That's why I'd be interested to know if my experimental code makes it unnecessary (because it already figured out by itself that the best thing to do would be to have an unboxed accumulator).

@CeylonMigrationBot
Copy link
Author

[@jvasileff] Ok, yeah, I agree that it looks ugly.

@CeylonMigrationBot
Copy link
Author

[@quintesse] Sorry, it's not the accumulator that "bothers" me, it's the unboxed value, that's the part that should be obvious to the compiler.

@CeylonMigrationBot
Copy link
Author

[@gavinking]
@jvasileff I don't love this implementation, but the performance results are significant.

My feeling is that if we're going to do an optimization like this I would prefer to do it by making sum() native and writing the optimized version in Java. I think that would be a cleaner way to wring every last ounce of performance out. But perhaps I'm wrong about that...

@CeylonMigrationBot
Copy link
Author

[@gavinking] P.S. I agree that this hack:

Integer unboxed = val;

Should not be necessary and that the Java backend should recognize that val is unboxed. Please open an issue against ceylon-compiler.

@CeylonMigrationBot
Copy link
Author

[@jvasileff]

this hack

See #2270

@CeylonMigrationBot
Copy link
Author

[@quintesse] Just a question @gavinking , but why is this an issue for the compiler? Couldn't the typechecker just immediately simplify a Value&Integer to plain Integer?

@CeylonMigrationBot
Copy link
Author

[@gavinking] @quintesse No, of course not. Value is not a subtype of Integer!

@CeylonMigrationBot
Copy link
Author

[@gavinking] I mean Integer is not a subtype of Value.

@CeylonMigrationBot
Copy link
Author

[@jvasileff] AFAICT, a native implementation would:

  1. Hide the ugliness
  2. Optimize away assert (is Value result = sum);

But I'm not seeing further optimization opportunities. The second item, while significant for small streams, may also be eliminated with #4505 and #2270 .

Some results:

Summary min/max/avg/rstddev/pct
lang::sum(ints); size=1: 85/95/91/3% (100%)
nativeSum(ints); size=1: 98/102/100/1% (114%)
math::sum(ints); size=1: 114/117/116/1% (133%)
optimized(ints); size=1: 167/175/169/1% (194%)

Summary min/max/avg/rstddev/pct
nativeSum(ints); size=2: 62/63/63/1% (100%)
optimized(ints); size=2: 64/66/65/1% (104%)
lang::sum(ints); size=2: 70/71/70/0% (113%)
math::sum(ints); size=2: 73/74/73/0% (117%)

Summary min/max/avg/rstddev/pct
nativeSum(ints); size=8: 35/37/36/1% (100%)
optimized(ints); size=8: 36/37/37/1% (100%)
math::sum(ints); size=8: 40/41/41/1% (112%)
lang::sum(ints); size=8: 52/53/53/1% (145%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=1024: 27/29/28/2% (100%)
nativeSum(ints); size=1024: 27/29/28/1% (100%)
math::sum(ints); size=1024: 27/30/28/3% (100%)
lang::sum(ints); size=1024: 48/52/50/2% (175%)

for:

package sandbox.ceylon.snap.base;

import com.redhat.ceylon.compiler.java.metadata.Ignore;
import com.redhat.ceylon.compiler.java.metadata.Name;
import com.redhat.ceylon.compiler.java.metadata.TypeInfo;
import com.redhat.ceylon.compiler.java.metadata.TypeParameter;
import com.redhat.ceylon.compiler.java.metadata.TypeParameters;
import com.redhat.ceylon.compiler.java.metadata.Variance;
import com.redhat.ceylon.compiler.java.runtime.model.TypeDescriptor;

import ceylon.language.Float;
import ceylon.language.Integer;
import ceylon.language.Iterable;
import ceylon.language.Iterator;
import ceylon.language.SharedAnnotation$annotation$;
import ceylon.language.Summable;

@com.redhat.ceylon.compiler.java.metadata.Ceylon(major = 8)
@com.redhat.ceylon.compiler.java.metadata.Method
public class nativeJavaSum_ {
    private nativeJavaSum_() {
    }

    @SharedAnnotation$annotation$
    @TypeInfo(
            value = "Value",
            erased = true)
    @TypeParameters({@TypeParameter(
            value = "Value",
            variance = Variance.NONE,
            satisfies = {"ceylon.language::Summable<Value>"},
            caseTypes = {})})
    public static <Value extends Summable<Value>>Value nativeJavaSum(@Ignore
    final TypeDescriptor $reified$Value, @Name("values")
    @TypeInfo(
            value = "{Value+}",
            erased = true)
    final Iterable<? extends Value, ? extends Object> values) {
        final Iterator<? extends Value> it = values.iterator();
        final Object first = it.next();
        if (first instanceof Integer) {
            long sum = ((Integer)first).longValue();
            while (true) {
                Object val$670;
                if ((val$670 = it.next()) instanceof Integer) {
                    sum += ((Integer)val$670).longValue();
                } else {
                    break;
                }
            }
            return (Value)Integer.instance(sum);
        } else {
            if (first instanceof Float) {
                double sum = ((Float)first).doubleValue();
                while (true) {
                    Object val$665;
                    if ((val$665 = it.next()) instanceof Float) {
                        sum += ((Float)val$665).doubleValue();
                    } else {
                        break;
                    }
                }
                return (Value) Float.instance(sum);
            } else {
                throw new UnsupportedOperationException();
            }
        }
    }
}

@CeylonMigrationBot
Copy link
Author

[@gavinking] Just to throw something extra out there, it seems to me that this (slightly ugly) implementation using each() is significantly faster than the version with a for loop, at least for streams with an optimized each() implementation:

shared Value sum2<Value>({Value+} values) 
        given Value satisfies Summable<Value> {
    variable value sum = values.first;
    variable value first = true;
    values.each(void (element) {
        if (first) {
            sum = element;
            first = false;
        }
        else {
            sum+=element;
        }
    });
    return sum;
}

@CeylonMigrationBot
Copy link
Author

[@gavinking] Scratch that: with a correct implementation it doesn't seem to make much difference.

However this implementation does seem to be faster:

shared Value sum4<Value>({Value+} values) 
        given Value satisfies Summable<Value> 
        => values.reduce(plus);

... at least in some cases.

@CeylonMigrationBot
Copy link
Author

[@gavinking] Yeah, that makes sense, since there is an optimized implementation of reduce() for collections backed by arrays.

@CeylonMigrationBot
Copy link
Author

[@jvasileff]
each on a Sequence is much faster than the while loop for the Integer optimized case:

    if (is Integer first) {
        // unbox; don't infer type Value&Integer
        variable Integer sum = 0;
        values.each(void (item) {
            assert (is Integer item);
            Integer unboxed = item;
            sum += unboxed;
        });
        assert (is Value result = sum);
        return result;
    }
Summary min/max/avg/rstddev/pct
opti-Each(ints); size=1024: 20/20/20/1% (100%)
optimized(ints); size=1024: 39/41/40/2% (197%)

but the overhead kills performance for n<=8:

Summary min/max/avg/rstddev/pct
optimized(ints); size=1: 129/151/142/7% (100%)
opti-Each(ints); size=1: 369/435/399/7% (284%)

and of course this ugly impl reads the first element twice, so it isn't really an option anyway. But for ceylon.math.integer::sum, maybe?

I wasn't able to see an advantage with reduce over the standard while loop:

Summary min/max/avg/rstddev/pct
lang::sum(ints); size=1: 102/118/109/5% (100%)
   Reduce(ints); size=1: 125/134/131/2% (122%)

Summary min/max/avg/rstddev/pct
lang::sum(ints); size=8: 73/76/75/1% (100%)
   Reduce(ints); size=8: 86/99/93/5% (117%)

Summary min/max/avg/rstddev/pct
lang::sum(ints); size=1024: 70/79/73/5% (100%)
   Reduce(ints); size=1024: 72/75/74/2% (103%)

@CeylonMigrationBot
Copy link
Author

[@gavinking] reduce() is much faster in the case of an ArraySequence.

@CeylonMigrationBot
Copy link
Author

[@jvasileff] Ok, weird.

print(type(ints));
// ceylon.language::ArraySequence<ceylon.language::Integer>

But hotspot is a strange beast.

@CeylonMigrationBot
Copy link
Author

[@gavinking] Apologies, @jvasileff, I have pushed a copy/paste of your code instead of merging this PR because I wanted to make it a native implementation. I know it's a bit rude. I try not to do that.

@CeylonMigrationBot
Copy link
Author

[@gavinking] P.S. @jvasileff thanks for the contribution!

@CeylonMigrationBot
Copy link
Author

[@jvasileff] No problem at all. Looks like the approach saved us both some time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants