Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Haskell version should be more idiomatic, use proper tools & threading, use benchmarking #26

Merged
merged 13 commits into from
Feb 15, 2016

Conversation

bitemyapp
Copy link
Contributor

Fixes #17, #24, #25

@bitemyapp
Copy link
Contributor Author

New version (Based on @bartavelle's solution) takes 51ms instead of 2.5 seconds on my machine.

@bitemyapp
Copy link
Contributor Author

@atemerev is there something blocking this PR?

@codygman
Copy link

@atemerev I'm looking forward to seeing this Haskell version in the benchmark, if there's anything blocking this PR I can help as well!

@bitemyapp
Copy link
Contributor Author

I added more results for running these on Ubuntu and tried to get at least basic intervals for the non-Haskell stuff (I don't know of criterion equivalents for the others). Ideally so people can get a sense of the upper and lower bounds.

@ChristopherKing42
Copy link
Contributor

The current version of Parallel.hs is wrong. It outputs 500000500000 instead of 499999500000. This is fixed in https://github.com/bitemyapp/skynet/pull/1.

@Gabriella439
Copy link

I like this version because it shows the performance trade off of various points on the design spectrum

@Rydgel
Copy link

Rydgel commented Feb 15, 2016

I really want to see this merged as well.

cabal-version: >=1.10

executable skynet
ghc-options: -O2 -threaded -rtsopts
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥

@ghost
Copy link

ghost commented Feb 15, 2016

+1, go gets an unfair advantage in current master since it's using channels, while Haskell simply uses waits. The proper comparison should be between languages using the same paradigm (and since Haskell can use multiple, that offers an advantage here).

version: 0.1.0.0
synopsis: Simple project template from stack
description: Please see README.md
homepage: http://github.com/bitemyapp/skynet#readme
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be @atemerev's repo? (I'm not sure what the conventions for gitHub and cabal are.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these fields are optional and can be removed. Only the name field is mandatory IIRC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from the stack template, an oversight. I can remove it if it's an issue. I didn't expect it would get uploaded to Hackage so shouldn't have any relevance.

@ChristopherKing42
Copy link
Contributor

@wizzard0 is there anything else needed to be done with this Pull Request?

atemerev added a commit that referenced this pull request Feb 15, 2016
Haskell version should be more idiomatic, use proper tools & threading, use benchmarking
@atemerev atemerev merged commit 82f318a into atemerev:master Feb 15, 2016
@bitemyapp
Copy link
Contributor Author

@atemerev Thank you :)

@oblitum
Copy link

oblitum commented Feb 16, 2016

chan version of this is returning a wrong result.

After analysing the parallel version.... I'm contemplating whether a C++ pure compile time version is doable and fair.

@oblitum
Copy link

oblitum commented Feb 16, 2016

OK, this is essentially the same algorithm that is being used in the new haskell "parallel" version that is being unfairly compared to all other non-parallel schemes, only that it's ported to C++ and, hey, it's a compile-time version, isn't it also fair? I'm not bothering constructing a PR for this, except in case the owner is interested in merging this to compare it side-by-side, just like haskell is doing, because, hey, it's a idiomatic language feature! why not 😄

#include <chrono>
#include <cstdint>
#include <iostream>

constexpr int64_t skynet(int64_t levels, int64_t children, int64_t position = 0) {
    if (levels == 0)
        return position;
    int64_t sum = 0;
    for(int64_t i = 0; i < children; ++i)
        sum += skynet(levels - 1, children, position*children + i);
    return sum;
}

int main() {
    using namespace std;
    using namespace std::chrono;

    auto start = high_resolution_clock::now();
    constexpr auto result = skynet(6, 10);  // remove constexpr to eval at runtime
    auto elapsed = high_resolution_clock::now() - start;

    cout << "Result: " << result << '\n';
    cout << "time    " << duration_cast<milliseconds>(elapsed).count() << " ms" << endl;
}

With clang, you should compile this with:

  • clang++ -fconstexpr-steps=10000000 -std=c++14 -O3 skynet.cpp -o skynet

with gcc, do this:

  • g++ -std=c++14 -O3 skynet.cpp -o skynet

Clang is smarter and compiles this in quite a short time compared to GCC. Compilation time for clang should be shorter than what haskell/stack takes to compile its stuff.

Time to run: 0 ms

Switching to runtime version is easy, as I've commented in the code:

Time for runtime version: 10 ms

In all cases, this one has evolved like the haskell version, and not only deprecated concurrency/channels/CSP/actors/whatever, but also parallelism! As they say: "less is exponentially more".

@oblitum
Copy link

oblitum commented Feb 16, 2016

cc #2, #11.

@bartavelle
Copy link

What makes you think the Haskell version is not parallel, or is not computing its result at runtime? The version I wrote that is very close to what has been merged does!

@bartavelle
Copy link

Here is a profiling run without the benchmarking:

        Tue Feb 16 08:29 2016 Time and Allocation Profiling Report  (Final)

           skynet +RTS -p -RTS

        total time  =        0.12 secs   (125 ticks @ 1000 us, 1 processor)
        total alloc = 182,304,880 bytes  (excludes profiling overheads)

COST CENTRE  MODULE   %time %alloc

skynet.sky   Parallel  57.6   70.7
skynet.sky.\ Parallel  42.4   29.3


                                                               individual     inherited
COST CENTRE          MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                 MAIN                     60           0    0.0    0.0   100.0  100.0
 run                 Parallel                122           0    0.0    0.0     0.0    0.0
  skynet             Parallel                123           1    0.0    0.0     0.0    0.0
 CAF:main1           Main                    118           0    0.0    0.0     0.0    0.0
  main               Main                    120           1    0.0    0.0     0.0    0.0
 CAF:run1            Parallel                110           0    0.0    0.0     0.0    0.0
  run                Parallel                121           1    0.0    0.0     0.0    0.0
 CAF:run4            Parallel                109           0    0.0    0.0   100.0  100.0
  run                Parallel                124           0    0.0    0.0   100.0  100.0
   skynet            Parallel                125           0    0.0    0.0   100.0  100.0
    skynet.sky       Parallel                126     1111111   57.6   70.7   100.0  100.0
     skynet.sky.\    Parallel                130     1111110   42.4   29.3    42.4   29.3
 CAF:childnums_r4kW  Parallel                108           0    0.0    0.0     0.0    0.0
  run                Parallel                127           0    0.0    0.0     0.0    0.0
   skynet            Parallel                128           0    0.0    0.0     0.0    0.0
    skynet.childnums Parallel                129           1    0.0    0.0     0.0    0.0
 CAF:run2            Parallel                107           0    0.0    0.0     0.0    0.0
  run                Parallel                132           0    0.0    0.0     0.0    0.0
 CAF:run3            Parallel                106           0    0.0    0.0     0.0    0.0
  run                Parallel                131           0    0.0    0.0     0.0    0.0
 CAF                 Data.Time.Clock.UTC     105           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Handle.FD        100           0    0.0    0.0     0.0    0.0
 CAF                 GHC.Event.Thread         98           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Encoding          95           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Handle.Text       92           0    0.0    0.0     0.0    0.0
 CAF                 GHC.Conc.Signal          88           0    0.0    0.0     0.0    0.0
 CAF                 GHC.IO.Encoding.Iconv    85           0    0.0    0.0     0.0    0.0

You can see that the sky function is executed the correct amount of times. Also:

$ time .stack-work/install/x86_64-linux/lts-5.1/7.10.3/bin/skynet +RTS -N8 -p
Result: 499999500000 in 0.128098s
 +RTS -N8 -p  0,93s user 0,05s system 735% cpu 0,134 total

It is using all cores on my laptop.

@oblitum
Copy link

oblitum commented Feb 16, 2016

@bartavelle I didn't state it's not parallel, I've stated it is, meaning that this turn in the implementation has changed dramatically what's being compared, whatever this bench is trying to compare. Camparing Go using channels and goroutines with this (haskell version), which is just trivial parallel code without any channel, etc, makes as much sense as comparing the haskell parallel version to the c++ compile-time/runtime version (I can't even state at what extent the compiler would unroll, vectorize, etc, the linear c++ code [runtime], effectively making it parallel to some extent). Despite this, maybe it's still possible to shoehorn something like Boost.Coroutine2 into this just for sake of claiming to be using coroutines.

@bartavelle
Copy link

Ah sorry, I didn't get your point. I think you are right, but this is a benchmark game ;)

@oblitum
Copy link

oblitum commented Feb 16, 2016

@bartavelle so, is there anyone left to beat 0? ;-)

@bitemyapp
Copy link
Contributor Author

@oblitum I already rejected a PR to calculate it at compile-time in Haskell on my fork. There's your "0". I wouldn't accept anything not parallel/concurrent either.

@oblitum
Copy link

oblitum commented Feb 16, 2016

@bitemyapp you mean compile-time?

@bitemyapp
Copy link
Contributor Author

@oblitum It's 0142 in my timezone and not far off in yours. Slow down son.

@oblitum
Copy link

oblitum commented Feb 16, 2016

@bitemyapp I'm not into this bench anymore, I'd just like to point out that after this pull the README has been left inconsistent/misleading. The implementation has been massaged for parallelism, for which numbers were published, while claiming it "coroutine/channel" has been left untouched. And, as said before, the internal chan version is also returning wrong results.

@ChristopherKing42
Copy link
Contributor

@oblitum See #54

@jb55
Copy link

jb55 commented Feb 17, 2016

👎 How is this in the spirit of the competition at all?

The main benchmark should be changed to use concurrency (forkIO), not tight-loop parallelism.

@Gabriella439
Copy link

So I think the number to report should be the version using concurrency (i.e. forkIO) but I believe the other versions should still be retained to show better ways to solve the same problem

@oblitum
Copy link

oblitum commented May 29, 2016

@ghost
Copy link

ghost commented May 29, 2016

@oblitum I disagree.

Just to be clear, Haskell lists are not iterators. Consider the following sequential Haskell version:

main :: IO ()
main = print $ sum [0..999999]
$ ghc -O2 slow.hs
$ time ./slow
499999500000

real 0m0.040s
user 0m0.038s
sys  0m0.004s

Let's compare that with a sequential Rust version:

fn main() {
    let sum: i64 = (0..1000000).fold(0, |sum, x| sum+x);
    println!("{}",sum);
}
$ rustc -O test.rs
$ time ./test
499999500000
real 0m0.003s
user 0m0.002s
sys  0m0.001s

You might be tempted to think that Haskell is slower. But it's not! Lists are not loops. Consider a fairer comparison:

summation :: Int -> Int -> Int
summation accum end
    | end == 0  = accum
    | otherwise = summation (accum+end) (end-1)

main :: IO ()
main = print $ summation 0 999999
$ ghc -O2 fast.hs
$ time ./fast
499999500000

real 0m0.003s
user 0m0.002s
sys  0m0.001s

The sequential Haskell version is pretty close to the sequential Rust version, despite the fact that the Haskell version also needed to spawn a garbage collector. You might get slightly better performance in a language without a garbage collector, such as Rust, but it's not going to be a huge difference, given that you properly optimize the code.

I do agree that rpar does work stealing, and it is slightly unfair to compare it to other languages which are not doing the same thing. However, I don't think the Rust version you wrote is equivalent, because it's using iterators instead of lists.

@oblitum
Copy link

oblitum commented May 29, 2016

@siddhanathan 😄 thanks for your analysis. Sorry for not being fair with Haskell now, I guess we are even then?

To tell the truth I just cared to use Haskell's parallel version as-is because it's what is present here in the repo, whether lists are not loops, are lazy evaluated, etc, seems a bit far from the rpar topic, and a Haskell specific optimization.

IIRC, Haskell offers mechanisms for non lazy evaluated lists, so, I wonder whether just using it would lead to the same improvement you got, even though it would still be different from Rust's iterator-over-range version which should probably be reduced to a loop automatically by the compiler (unsure when using Rayon, but it's a common expectation from Rust compiler to do that), anyway, in Rust I'm still working in the language with the high level concept of a list (a range and a iterator is not conceptually the same as a loop) while knowing the compiler is smart enough to turn it into a loop in machine code.

@ghost
Copy link

ghost commented May 30, 2016

@oblitum I doubt the strictness (strict vs lazy) or type of data structure (linked list vs arrays) would yield that sort of speedup. There's stream fusion, but that's a whole new topic.

I played around with the MVar code a little, and managed to reduce the time it spent in garbage collection:

{-# LANGUAGE BangPatterns #-}

import Control.Concurrent (forkIO)
import Control.Concurrent.MVar (MVar, newEmptyMVar, putMVar, takeMVar)
import Data.Time.Clock (getCurrentTime, diffUTCTime)

loop :: Int -> Int -> Int -> Int -> Int -> IO Int
loop !accum num size div !i
    | i == 0 = return accum
    | otherwise = do
        c <- newEmptyMVar
        forkIO (skynet c subNum sizeDiv div)
        s <- takeMVar c
        loop (accum+s) num size div (i-1)
      where
        !subNum = num + (i-1) * sizeDiv
        !sizeDiv = size `quot` div

skynet :: MVar Int -> Int -> Int -> Int -> IO ()
skynet c num size div
    | size == 1 = putMVar c num
    | otherwise = do
        result <- loop 0 num size div div
        putMVar c result

main :: IO ()
main = do
    start <- getCurrentTime
    c <- newEmptyMVar
    forkIO (skynet c 0 1000000 10)
    result <- takeMVar c
    end <- getCurrentTime
    putStrLn $ concat [ "Result: "
                      , show result
                      , " in "
                      , show (diffUTCTime end start) ]

Definitely not the prettiest code, but it yields similar performance compared to the go version:

$ go run skynet.go 
Result: 499999500000 in 1221 ms.
$ ghc -O2 -threaded -rtsopts mvar.hs
$ ./mvar +RTS -N4
Result: 499999500000 in 1.343052s

It's still spending way too much time in garbage collection. I'm sure the numbers can definitely improve further.

As @jb55 mentioned earlier, perhaps it's best to replace the numbers from the current haskell benchmarks with these numbers.

@oblitum
Copy link

oblitum commented May 30, 2016

@siddhanathan Yes, de facto. For example, the following Rust version, which creates a vector instead of using a range is an order of magnitude slower, or par with the current Haskell version:

extern crate time;
extern crate rayon;

use time::PreciseTime;
use rayon::prelude::*;

type T = usize;

fn skynet(levels: T, children: T) -> T {
    fn sky(levels: T, children: T, position: T) -> T {
        let childnums: Vec<_> = (0..children).collect();
        match levels {
            0 => position,
            _ => childnums.par_iter()
                 .map(|cn| sky(levels - 1, children, position * children + cn)).sum()
        }
    }
    sky(levels, children, 0)
}

fn main() {
    let start = PreciseTime::now();
    let result = skynet(6, 10);
    let end = PreciseTime::now();
    println!("Result: {} in {} ms", result, start.to(end).num_milliseconds());
}

I'd like to have a short way for creating an in-stack initialized array instead of using a vector, but still, I guess it would not change things much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants