-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Haskell version should be more idiomatic, use proper tools & threading, use benchmarking #26
Conversation
New version (Based on @bartavelle's solution) takes 51ms instead of 2.5 seconds on my machine. |
@atemerev is there something blocking this PR? |
@atemerev I'm looking forward to seeing this Haskell version in the benchmark, if there's anything blocking this PR I can help as well! |
I added more results for running these on Ubuntu and tried to get at least basic intervals for the non-Haskell stuff (I don't know of criterion equivalents for the others). Ideally so people can get a sense of the upper and lower bounds. |
The current version of |
I like this version because it shows the performance trade off of various points on the design spectrum |
I really want to see this merged as well. |
cabal-version: >=1.10 | ||
|
||
executable skynet | ||
ghc-options: -O2 -threaded -rtsopts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💥
+1, go gets an unfair advantage in current master since it's using channels, while Haskell simply uses waits. The proper comparison should be between languages using the same paradigm (and since Haskell can use multiple, that offers an advantage here). |
version: 0.1.0.0 | ||
synopsis: Simple project template from stack | ||
description: Please see README.md | ||
homepage: http://github.com/bitemyapp/skynet#readme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be @atemerev's repo? (I'm not sure what the conventions for gitHub and cabal are.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these fields are optional and can be removed. Only the name
field is mandatory IIRC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from the stack template, an oversight. I can remove it if it's an issue. I didn't expect it would get uploaded to Hackage so shouldn't have any relevance.
Add stack hidden files to .gitignore
@wizzard0 is there anything else needed to be done with this Pull Request? |
Haskell version should be more idiomatic, use proper tools & threading, use benchmarking
@atemerev Thank you :) |
chan version of this is returning a wrong result. After analysing the parallel version.... I'm contemplating whether a C++ pure compile time version is doable and fair. |
OK, this is essentially the same algorithm that is being used in the new haskell "parallel" version that is being unfairly compared to all other non-parallel schemes, only that it's ported to C++ and, hey, it's a compile-time version, isn't it also fair? I'm not bothering constructing a PR for this, except in case the owner is interested in merging this to compare it side-by-side, just like haskell is doing, because, hey, it's a idiomatic language feature! why not 😄 #include <chrono>
#include <cstdint>
#include <iostream>
constexpr int64_t skynet(int64_t levels, int64_t children, int64_t position = 0) {
if (levels == 0)
return position;
int64_t sum = 0;
for(int64_t i = 0; i < children; ++i)
sum += skynet(levels - 1, children, position*children + i);
return sum;
}
int main() {
using namespace std;
using namespace std::chrono;
auto start = high_resolution_clock::now();
constexpr auto result = skynet(6, 10); // remove constexpr to eval at runtime
auto elapsed = high_resolution_clock::now() - start;
cout << "Result: " << result << '\n';
cout << "time " << duration_cast<milliseconds>(elapsed).count() << " ms" << endl;
} With clang, you should compile this with:
with gcc, do this:
Clang is smarter and compiles this in quite a short time compared to GCC. Compilation time for clang should be shorter than what haskell/stack takes to compile its stuff. Time to run: 0 ms Switching to runtime version is easy, as I've commented in the code: Time for runtime version: 10 ms In all cases, this one has evolved like the haskell version, and not only deprecated concurrency/channels/CSP/actors/whatever, but also parallelism! As they say: "less is exponentially more". |
What makes you think the Haskell version is not parallel, or is not computing its result at runtime? The version I wrote that is very close to what has been merged does! |
Here is a profiling run without the benchmarking:
You can see that the
It is using all cores on my laptop. |
@bartavelle I didn't state it's not parallel, I've stated it is, meaning that this turn in the implementation has changed dramatically what's being compared, whatever this bench is trying to compare. Camparing Go using channels and goroutines with this (haskell version), which is just trivial parallel code without any channel, etc, makes as much sense as comparing the haskell parallel version to the c++ compile-time/runtime version (I can't even state at what extent the compiler would unroll, vectorize, etc, the linear c++ code [runtime], effectively making it parallel to some extent). Despite this, maybe it's still possible to shoehorn something like Boost.Coroutine2 into this just for sake of claiming to be using coroutines. |
Ah sorry, I didn't get your point. I think you are right, but this is a benchmark game ;) |
@bartavelle so, is there anyone left to beat 0? ;-) |
@oblitum I already rejected a PR to calculate it at compile-time in Haskell on my fork. There's your "0". I wouldn't accept anything not parallel/concurrent either. |
@bitemyapp you mean compile-time? |
@oblitum It's 0142 in my timezone and not far off in yours. Slow down son. |
@bitemyapp I'm not into this bench anymore, I'd just like to point out that after this pull the README has been left inconsistent/misleading. The implementation has been massaged for parallelism, for which numbers were published, while claiming it "coroutine/channel" has been left untouched. And, as said before, the internal chan version is also returning wrong results. |
👎 How is this in the spirit of the competition at all? The main benchmark should be changed to use concurrency (forkIO), not tight-loop parallelism. |
So I think the number to report should be the version using concurrency (i.e. |
@oblitum I disagree. Just to be clear, Haskell lists are not iterators. Consider the following sequential Haskell version: main :: IO ()
main = print $ sum [0..999999] $ ghc -O2 slow.hs
$ time ./slow
499999500000
real 0m0.040s
user 0m0.038s
sys 0m0.004s Let's compare that with a sequential Rust version: fn main() {
let sum: i64 = (0..1000000).fold(0, |sum, x| sum+x);
println!("{}",sum);
} $ rustc -O test.rs
$ time ./test
499999500000
real 0m0.003s
user 0m0.002s
sys 0m0.001s You might be tempted to think that Haskell is slower. But it's not! Lists are not loops. Consider a fairer comparison: summation :: Int -> Int -> Int
summation accum end
| end == 0 = accum
| otherwise = summation (accum+end) (end-1)
main :: IO ()
main = print $ summation 0 999999 $ ghc -O2 fast.hs
$ time ./fast
499999500000
real 0m0.003s
user 0m0.002s
sys 0m0.001s The sequential Haskell version is pretty close to the sequential Rust version, despite the fact that the Haskell version also needed to spawn a garbage collector. You might get slightly better performance in a language without a garbage collector, such as Rust, but it's not going to be a huge difference, given that you properly optimize the code. I do agree that |
@siddhanathan 😄 thanks for your analysis. Sorry for not being fair with Haskell now, I guess we are even then? To tell the truth I just cared to use Haskell's parallel version as-is because it's what is present here in the repo, whether lists are not loops, are lazy evaluated, etc, seems a bit far from the IIRC, Haskell offers mechanisms for non lazy evaluated lists, so, I wonder whether just using it would lead to the same improvement you got, even though it would still be different from Rust's iterator-over-range version which should probably be reduced to a loop automatically by the compiler (unsure when using Rayon, but it's a common expectation from Rust compiler to do that), anyway, in Rust I'm still working in the language with the high level concept of a list (a range and a iterator is not conceptually the same as a loop) while knowing the compiler is smart enough to turn it into a loop in machine code. |
@oblitum I doubt the strictness (strict vs lazy) or type of data structure (linked list vs arrays) would yield that sort of speedup. There's stream fusion, but that's a whole new topic. I played around with the MVar code a little, and managed to reduce the time it spent in garbage collection: {-# LANGUAGE BangPatterns #-}
import Control.Concurrent (forkIO)
import Control.Concurrent.MVar (MVar, newEmptyMVar, putMVar, takeMVar)
import Data.Time.Clock (getCurrentTime, diffUTCTime)
loop :: Int -> Int -> Int -> Int -> Int -> IO Int
loop !accum num size div !i
| i == 0 = return accum
| otherwise = do
c <- newEmptyMVar
forkIO (skynet c subNum sizeDiv div)
s <- takeMVar c
loop (accum+s) num size div (i-1)
where
!subNum = num + (i-1) * sizeDiv
!sizeDiv = size `quot` div
skynet :: MVar Int -> Int -> Int -> Int -> IO ()
skynet c num size div
| size == 1 = putMVar c num
| otherwise = do
result <- loop 0 num size div div
putMVar c result
main :: IO ()
main = do
start <- getCurrentTime
c <- newEmptyMVar
forkIO (skynet c 0 1000000 10)
result <- takeMVar c
end <- getCurrentTime
putStrLn $ concat [ "Result: "
, show result
, " in "
, show (diffUTCTime end start) ] Definitely not the prettiest code, but it yields similar performance compared to the go version: $ go run skynet.go
Result: 499999500000 in 1221 ms.
$ ghc -O2 -threaded -rtsopts mvar.hs
$ ./mvar +RTS -N4
Result: 499999500000 in 1.343052s It's still spending way too much time in garbage collection. I'm sure the numbers can definitely improve further. As @jb55 mentioned earlier, perhaps it's best to replace the numbers from the current haskell benchmarks with these numbers. |
@siddhanathan Yes, de facto. For example, the following Rust version, which creates a vector instead of using a range is an order of magnitude slower, or par with the current Haskell version: extern crate time;
extern crate rayon;
use time::PreciseTime;
use rayon::prelude::*;
type T = usize;
fn skynet(levels: T, children: T) -> T {
fn sky(levels: T, children: T, position: T) -> T {
let childnums: Vec<_> = (0..children).collect();
match levels {
0 => position,
_ => childnums.par_iter()
.map(|cn| sky(levels - 1, children, position * children + cn)).sum()
}
}
sky(levels, children, 0)
}
fn main() {
let start = PreciseTime::now();
let result = skynet(6, 10);
let end = PreciseTime::now();
println!("Result: {} in {} ms", result, start.to(end).num_milliseconds());
} I'd like to have a short way for creating an in-stack initialized array instead of using a vector, but still, I guess it would not change things much. |
Fixes #17, #24, #25