Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make ceylon.language::sum fast #728

Closed
wants to merge 1 commit into from

Conversation

jvasileff
Copy link
Member

This implementation avoids using a boxed type for the sum accumulator variable when summing Integers and Floats.

If you want to merge this, I can also make the change for product.

Benchmark Script

shared void sumTest() {
    for (size in [1, 2, 4, 8, 32, 128, 256, 1024]) {
        value ints = (1..size).collect(identity);
        value floats = (1..size).collect(Integer.float);
        value repeat = 2^27 / size;

        benchmark {
            scale = repeat * size  / 10;
            warmupIter = 3;
            benchIter = 5;
            ["lang::sum(ints); size=``size``:", () => (0:repeat).each((_) => sum(ints))],
            ["math::sum(ints); size=``size``:", () => (0:repeat).each((_) => mathSum(ints))],
            ["optimized(ints); size=``size``:", () => (0:repeat).each((_) => sumNew4(ints))]
        };

        benchmark {
            scale = repeat*size  / 10;
            warmupIter = 3;
            benchIter = 5;
            ["lang::sum(floats); size=``size``:", () => (0:repeat).each((_) => sum(floats))],
            ["math::sum(floats); size=``size``:", () => (0:repeat).each((_) => mathSumF(floats))],
            ["optimized(floats); size=``size``:", () => (0:repeat).each((_) => sumNew4(floats))]
        };
    }
}

Integer Results

Summary min/max/avg/rstddev/pct
math::sum(ints); size=1: 116/124/118/3% (100%)
optimized(ints); size=1: 120/127/123/2% (103%)
lang::sum(ints); size=1: 139/141/140/0% (119%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=2: 65/69/66/1% (100%)
lang::sum(ints); size=2: 69/71/70/1% (105%)
math::sum(ints); size=2: 73/76/74/1% (112%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=4: 47/47/47/0% (100%)
math::sum(ints); size=4: 52/53/52/1% (109%)
lang::sum(ints); size=4: 57/59/58/1% (121%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=8: 37/38/37/1% (100%)
math::sum(ints); size=8: 41/42/41/0% (111%)
lang::sum(ints); size=8: 53/55/54/1% (143%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=32: 32/32/32/0% (100%)
math::sum(ints); size=32: 33/37/34/4% (103%)
lang::sum(ints); size=32: 55/56/56/0% (172%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=128: 29/33/30/5% (100%)
math::sum(ints); size=128: 30/31/30/0% (105%)
lang::sum(ints); size=128: 50/52/51/1% (173%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=256: 29/33/31/4% (99%)
math::sum(ints); size=256: 30/35/32/6% (102%)
lang::sum(ints); size=256: 51/57/55/4% (173%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=1024: 29/30/29/2% (100%)
math::sum(ints); size=1024: 29/30/30/0% (102%)
lang::sum(ints); size=1024: 51/52/51/1% (175%)

Float Results

Summary min/max/avg/rstddev/pct
lang::sum(floats); size=1: 92/97/95/1% (100%)
optimized(floats); size=1: 126/130/128/1% (135%)
math::sum(floats); size=1: 149/154/152/1% (160%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=2: 68/70/69/1% (100%)
lang::sum(floats); size=2: 68/72/70/2% (100%)
math::sum(floats); size=2: 92/94/93/0% (134%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=4: 50/52/51/2% (100%)
lang::sum(floats); size=4: 59/63/61/3% (117%)
math::sum(floats); size=4: 63/66/65/1% (127%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=8: 39/40/40/0% (100%)
math::sum(floats); size=8: 48/50/49/1% (122%)
lang::sum(floats); size=8: 55/57/56/1% (138%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=32: 32/33/33/1% (100%)
math::sum(floats); size=32: 37/37/37/0% (113%)
lang::sum(floats); size=32: 57/59/58/1% (174%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=128: 29/30/29/1% (100%)
math::sum(floats); size=128: 33/34/34/1% (113%)
lang::sum(floats); size=128: 53/55/54/1% (184%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=256: 28/30/29/2% (100%)
math::sum(floats); size=256: 32/33/33/1% (115%)
lang::sum(floats); size=256: 53/54/53/0% (188%)

Summary min/max/avg/rstddev/pct
optimized(floats); size=1024: 28/29/28/1% (100%)
math::sum(floats); size=1024: 32/34/32/2% (115%)
lang::sum(floats); size=1024: 53/54/54/0% (190%)

This implementation avoids using a boxed type for the `sum`
accumulator variable when summing Integers and Floats.
@quintesse
Copy link
Member

Hmm I should test how my experimental boxing/unboxing code handles this. Because although it's a clever trick what you are doing here it depends completely on implementation details that could change at any time.

@jvasileff
Copy link
Member Author

I beg to differ.

  1. sum cannot possibly be fast without testing for Integer and Float; Value on its own will never be a primitive.
  2. "Trick 1" is to test first rather than the Iterable or Iterator. This is solely to avoid the huge isReified cost
  3. "Trick 2" is to explicitly assign values to explicit Integers. This should always result in primitives, unless boxing optimizations are totally broken.

For the last item, certainly a compiler improvement would be to treat Value&Integer as a primitive, which it can do since Integer is final. But that would not change the behavior of this code.

Again, for the last item, boxing logic could be improved such that only the accumulator would need to explicitly be an Integer (IOW, primitive + boxed always unboxes rather than sometimes boxing the primitive). But again, that improvement would not "break" this code.

@jvasileff
Copy link
Member Author

Minor addition: I suppose pure Integers don't have to be primitives, with a "don't unbox" strategy. But then my other point above would have to be fixed for such a strategy to work, and in the end, this would result in this code performing as expected.

@quintesse quintesse added this to the 1.2 milestone Aug 13, 2015
@quintesse
Copy link
Member

I also beg to differ with your point 3, you don't get to decide what the compiler should always do, that's the compiler's business :)
It's not that this code will code will work or not, of course it works, and I agree with "Trick 1" (although hopefully even that we can improve in the future) but it's "Trick 2" that just looks nasty, even if it works.
That's why I'd be interested to know if my experimental code makes it unnecessary (because it already figured out by itself that the best thing to do would be to have an unboxed accumulator).

@jvasileff
Copy link
Member Author

Ok, yeah, I agree that it looks ugly.

@quintesse
Copy link
Member

Sorry, it's not the accumulator that "bothers" me, it's the unboxed value, that's the part that should be obvious to the compiler.

@gavinking
Copy link
Member

@jvasileff I don't love this implementation, but the performance results are significant.

My feeling is that if we're going to do an optimization like this I would prefer to do it by making sum() native and writing the optimized version in Java. I think that would be a cleaner way to wring every last ounce of performance out. But perhaps I'm wrong about that...

@gavinking
Copy link
Member

P.S. I agree that this hack:

Integer unboxed = val;

Should not be necessary and that the Java backend should recognize that val is unboxed. Please open an issue against ceylon-compiler.

@jvasileff
Copy link
Member Author

this hack

See ceylon/ceylon-compiler#2270

@quintesse
Copy link
Member

Just a question @gavinking , but why is this an issue for the compiler? Couldn't the typechecker just immediately simplify a Value&Integer to plain Integer?

@gavinking
Copy link
Member

@quintesse No, of course not. Value is not a subtype of Integer!

@gavinking
Copy link
Member

I mean Integer is not a subtype of Value.

@jvasileff
Copy link
Member Author

AFAICT, a native implementation would:

  1. Hide the ugliness
  2. Optimize away assert (is Value result = sum);

But I'm not seeing further optimization opportunities. The second item, while significant for small streams, may also be eliminated with ceylon/ceylon-spec#1399 and ceylon/ceylon-compiler#2270 .

Some results:

Summary min/max/avg/rstddev/pct
lang::sum(ints); size=1: 85/95/91/3% (100%)
nativeSum(ints); size=1: 98/102/100/1% (114%)
math::sum(ints); size=1: 114/117/116/1% (133%)
optimized(ints); size=1: 167/175/169/1% (194%)

Summary min/max/avg/rstddev/pct
nativeSum(ints); size=2: 62/63/63/1% (100%)
optimized(ints); size=2: 64/66/65/1% (104%)
lang::sum(ints); size=2: 70/71/70/0% (113%)
math::sum(ints); size=2: 73/74/73/0% (117%)

Summary min/max/avg/rstddev/pct
nativeSum(ints); size=8: 35/37/36/1% (100%)
optimized(ints); size=8: 36/37/37/1% (100%)
math::sum(ints); size=8: 40/41/41/1% (112%)
lang::sum(ints); size=8: 52/53/53/1% (145%)

Summary min/max/avg/rstddev/pct
optimized(ints); size=1024: 27/29/28/2% (100%)
nativeSum(ints); size=1024: 27/29/28/1% (100%)
math::sum(ints); size=1024: 27/30/28/3% (100%)
lang::sum(ints); size=1024: 48/52/50/2% (175%)

for:

package sandbox.ceylon.snap.base;

import com.redhat.ceylon.compiler.java.metadata.Ignore;
import com.redhat.ceylon.compiler.java.metadata.Name;
import com.redhat.ceylon.compiler.java.metadata.TypeInfo;
import com.redhat.ceylon.compiler.java.metadata.TypeParameter;
import com.redhat.ceylon.compiler.java.metadata.TypeParameters;
import com.redhat.ceylon.compiler.java.metadata.Variance;
import com.redhat.ceylon.compiler.java.runtime.model.TypeDescriptor;

import ceylon.language.Float;
import ceylon.language.Integer;
import ceylon.language.Iterable;
import ceylon.language.Iterator;
import ceylon.language.SharedAnnotation$annotation$;
import ceylon.language.Summable;

@com.redhat.ceylon.compiler.java.metadata.Ceylon(major = 8)
@com.redhat.ceylon.compiler.java.metadata.Method
public class nativeJavaSum_ {
    private nativeJavaSum_() {
    }

    @SharedAnnotation$annotation$
    @TypeInfo(
            value = "Value",
            erased = true)
    @TypeParameters({@TypeParameter(
            value = "Value",
            variance = Variance.NONE,
            satisfies = {"ceylon.language::Summable<Value>"},
            caseTypes = {})})
    public static <Value extends Summable<Value>>Value nativeJavaSum(@Ignore
    final TypeDescriptor $reified$Value, @Name("values")
    @TypeInfo(
            value = "{Value+}",
            erased = true)
    final Iterable<? extends Value, ? extends Object> values) {
        final Iterator<? extends Value> it = values.iterator();
        final Object first = it.next();
        if (first instanceof Integer) {
            long sum = ((Integer)first).longValue();
            while (true) {
                Object val$670;
                if ((val$670 = it.next()) instanceof Integer) {
                    sum += ((Integer)val$670).longValue();
                } else {
                    break;
                }
            }
            return (Value)Integer.instance(sum);
        } else {
            if (first instanceof Float) {
                double sum = ((Float)first).doubleValue();
                while (true) {
                    Object val$665;
                    if ((val$665 = it.next()) instanceof Float) {
                        sum += ((Float)val$665).doubleValue();
                    } else {
                        break;
                    }
                }
                return (Value) Float.instance(sum);
            } else {
                throw new UnsupportedOperationException();
            }
        }
    }
}

@gavinking
Copy link
Member

Just to throw something extra out there, it seems to me that this (slightly ugly) implementation using each() is significantly faster than the version with a for loop, at least for streams with an optimized each() implementation:

shared Value sum2<Value>({Value+} values) 
        given Value satisfies Summable<Value> {
    variable value sum = values.first;
    variable value first = true;
    values.each(void (element) {
        if (first) {
            sum = element;
            first = false;
        }
        else {
            sum+=element;
        }
    });
    return sum;
}

@gavinking
Copy link
Member

Scratch that: with a correct implementation it doesn't seem to make much difference.

However this implementation does seem to be faster:

shared Value sum4<Value>({Value+} values) 
        given Value satisfies Summable<Value> 
        => values.reduce(plus);

... at least in some cases.

@gavinking
Copy link
Member

Yeah, that makes sense, since there is an optimized implementation of reduce() for collections backed by arrays.

@jvasileff
Copy link
Member Author

each on a Sequence is much faster than the while loop for the Integer optimized case:

    if (is Integer first) {
        // unbox; don't infer type Value&Integer
        variable Integer sum = 0;
        values.each(void (item) {
            assert (is Integer item);
            Integer unboxed = item;
            sum += unboxed;
        });
        assert (is Value result = sum);
        return result;
    }
Summary min/max/avg/rstddev/pct
opti-Each(ints); size=1024: 20/20/20/1% (100%)
optimized(ints); size=1024: 39/41/40/2% (197%)

but the overhead kills performance for n<=8:

Summary min/max/avg/rstddev/pct
optimized(ints); size=1: 129/151/142/7% (100%)
opti-Each(ints); size=1: 369/435/399/7% (284%)

and of course this ugly impl reads the first element twice, so it isn't really an option anyway. But for ceylon.math.integer::sum, maybe?

I wasn't able to see an advantage with reduce over the standard while loop:

Summary min/max/avg/rstddev/pct
lang::sum(ints); size=1: 102/118/109/5% (100%)
   Reduce(ints); size=1: 125/134/131/2% (122%)

Summary min/max/avg/rstddev/pct
lang::sum(ints); size=8: 73/76/75/1% (100%)
   Reduce(ints); size=8: 86/99/93/5% (117%)

Summary min/max/avg/rstddev/pct
lang::sum(ints); size=1024: 70/79/73/5% (100%)
   Reduce(ints); size=1024: 72/75/74/2% (103%)

@gavinking
Copy link
Member

reduce() is much faster in the case of an ArraySequence.

@jvasileff
Copy link
Member Author

Ok, weird.

print(type(ints));
// ceylon.language::ArraySequence<ceylon.language::Integer>

But hotspot is a strange beast.

@gavinking gavinking self-assigned this Sep 18, 2015
gavinking added a commit that referenced this pull request Oct 1, 2015
this is based on code + ideas by @jvasileff in #728, but using native
@gavinking
Copy link
Member

Apologies, @jvasileff, I have pushed a copy/paste of your code instead of merging this PR because I wanted to make it a native implementation. I know it's a bit rude. I try not to do that.

@gavinking gavinking closed this Oct 1, 2015
@gavinking
Copy link
Member

P.S. @jvasileff thanks for the contribution!

@jvasileff
Copy link
Member Author

No problem at all. Looks like the approach saved us both some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants