Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance TODO #19

Open
treeowl opened this issue Jul 24, 2021 · 11 comments
Open

Performance TODO #19

treeowl opened this issue Jul 24, 2021 · 11 comments

Comments

@treeowl
Copy link
Collaborator

treeowl commented Jul 24, 2021

I'm opening this ticket to track performance improvement ideas for a bit.

  1. Investigate using the static argument transformation to enhance inlining in (>>=) and so on. In particular, this would allow GHC to inline f in m >>= f. Does this actually help? We'll need measurements to see.
  2. Investigate alternative queue implementations. Discussion on this in Do we need to reimplement sequence or type-aligned dependencies for performance reasons? #4.
  3. What rewrite rules do we want, if any? See RULES exploration #34. Do we want a full fusion framework? Caution: either or both of these could lead to optimization-dependent behavior in the face of unlawful Monad instances.
@dagit
Copy link
Owner

dagit commented Jul 24, 2021

Do people expect monad transformers to work with unlawful monads as a regular thing? I was hoping we could just tell people they'll get undefined behavior if they do that. If it's a big concern, we could use the module system to make the optimization "opt-in". RULES can appear in any module and get imported similar to instances.

@treeowl
Copy link
Collaborator Author

treeowl commented Jul 24, 2021

As long as the expectations and consequences are adequately documented, the sheriff won't come and drag you away. I'm mostly imagining issues with others' test suites that intentionally use (slightly) unlawful instances. That said, if you're actually able to get reliably good performance in realistic situations using rewrite rules, it's hard to argue with that result.

@treeowl
Copy link
Collaborator Author

treeowl commented Jul 24, 2021

I attempted a lift m >>= \x -> lift (f x) ==> lift (m >>= \x -> f x) rule, but for whatever reasons I could never get it to fire. I tried using auxiliary specialized bindSeqT and liftSeqT functions, messing with INLINE and NOINLINE, and tweaking arity, but got nowhere.

@dagit
Copy link
Owner

dagit commented Jul 25, 2021

I'm having the same issue with a rewrite rule for <|>.

I'll explain how far I made it with stream fusion.

First I added these definitions, which I believe to be a pretty straightforward aspect of the stream fusion literature, although I'm new to this so I might have done something goofy here without realizing it:

data Step s a = Done | Yield a s
data Stream a = forall s. Stream (s -> Step s a) s

type ViewS m a = Stream (m (View m a))

stream :: SeqT m a -> ViewS m a
stream (SeqT m) = Stream next m where
  {-# INLINE next #-}
  next s = case viewl s of
    EmptyL -> Done
    h S.:< t -> Yield h t
{-# INLINE[0] stream #-}

unstream :: ViewS m a -> SeqT m a
unstream (Stream next s0) = SeqT (unfold s0)
  where
  unfold s = case next s of
    Done -> S.empty
    Yield x xs -> x S.<| unfold xs
{-# INLINE[0] unstream #-}

{-# RULES
      "stream-unstream" [1] forall (s::ViewS m a). stream (unstream s) = s;
  #-}

For a warm up, I tried to make a fusable version of <|>:

alt :: SeqT m a -> SeqT m a -> SeqT m a
alt a b = unstream (alt_s (stream a) (stream b))
{-# INLINE[2] alt #-}

alt_s :: ViewS m a -> ViewS m a -> ViewS m a
alt_s (Stream next_a a) (Stream next_b b) = Stream next (a,b)
  where
  {-# INLINE next #-}
  next (s_a,s_b) = case next_a s_a of
    Done -> case next_b s_b of
      Done -> Done
      Yield y ys -> Yield y (s_a, ys)
    Yield x xs -> Yield x (xs, s_b)
{-# INLINE[2] alt_s #-}

I'm not a huge fan of how next came out inside of alt_s, but conceptually it makes sense. I just don't love that each time it's called keeps trying to step s_a after it's exhausted. There might be ways around that, but this type checks and has the desired behavior when I test in ghci.

Next I thought I would attempt >>=. And that's where I got stuck. This type checks but is definitely incorrect:

bind :: Monad m => SeqT m a -> (a -> SeqT m b) -> SeqT m b
bind m f = unstream (bind_s (stream m) (stream . f))

bind_s :: Monad m => ViewS m a -> (a -> ViewS m b) -> ViewS m b
bind_s (Stream next_a a) f = Stream next a
  where
  {-# INLINE next #-}
  next s_a = case next_a s_a of
    Done -> Done
    Yield y ys -> Yield (y >>= go ys) ys
  go ys Empty = return Empty
  go ys (b :< bs) = case f b `alt_s` (bind_s (stream bs) f) of
    Stream next_fb fb -> case next_fb fb of
      Done -> return Empty
      Yield h t -> h

In simple test cases like observeAllT ((pure 1 >> pure 2) <|> (pure 3 >> pure 4)) (where >> is using the default definition and not the specialized one you wrote), it actually returns the expected [2,4]. However, when I use this version with real programs they don't produce the expected results.

I think the issue is in the last line of bind_s, specifically Yield h t -> h. It type checks, but I think it needs to be doing something else entirely. Possibly involving the state t. I think this is missing some sort of join or flatten operation. Conceptually, I'd rather have go ys (b :< bs) = f b `alt_s` (bind_s (stream bs) f and not case on the result, but having the call of go inside a Yield makes it hard to do that. Or at least, I couldn't come up with a variation that avoided it and still type checked.

I think to fix the code above, I would need something of type:

flatten :: ViewS m a -> m (View m a)

And I'm not sure that function makes sense. I thought "okay, maybe deal with >>= later and benchmark <|> for now." And that's when I ran into the same issue you ran into. The GHC userguide gives advice no how to apply RULES to type class methods but the techniques they suggest didn't seem to apply and I don't have any idea how to debug that. I looked at the stats and the code after each simplification step but nothing jumped out to me.

That said, I'm curious though, if we only could fuse <|> would there be any benefit? So I exported alt directly and tried to see if that would fuse. That only seemed to use when I wrote simple thing like foo = pure 1 `alt` pure 2 `alt` pure 3. Things like choose that use alt with foldr don't seem to benefit. I even tried manually writing the fusable version of choose like this:

chooseSeqT :: (Foldable t, Monad m) => t a -> SeqT m a
chooseSeqT xs = unstream (foldr (alt_s . stream . pure) done xs)
{-# INLINEABLE chooseSeqT #-}
{-# SPECIALIZE chooseSeqT :: [a] -> Seq a #-}
{-# SPECIALIZE chooseSeqT :: Foldable t => t a -> Seq a #-}
done :: ViewS m a
done = Stream (const Done) Empty
{-# INLINE CONLIKE [0] done #-}

However, that doesn't appear to lead to any better performance than just using the existing stuff with good inlining/specialization. However, I did learn one thing along the way. That functions like alt_s should inline late (ie., INLINE[0]) and functions like alt should inline early (ie., INLINE[2]) relative to the stream-unstream rewrite rule. So I adjusted the pragmas accordingly compared to what you see earlier in this post.

@treeowl
Copy link
Collaborator Author

treeowl commented Jul 25, 2021

Ah, you're looking at stream fusion of the queue.... I was thinking more about LogicT or similar. s >>= f = fromLogicT $ toLogicT m >>= toLogicT . f and so on, with various rules. The goal (I don't know if it would work/win) would be to handle parts of the computation between/around msplit/toView/whatever in LogicT, only breaking out the big guns when necessary.

@dagit
Copy link
Owner

dagit commented Jul 25, 2021

That's an interesting avenue but I would expect it to also be more brittle and harder to ensure correctness. I guess when you have multiple rules you need to ensure they are confluent. And that's one of the nice things about using a single general rule like stream-unstream and manually writing the fuseable definitions. You don't have to think about making things confluent and there are fewer rules to potentially not apply for random reasons.

@treeowl
Copy link
Collaborator Author

treeowl commented Jul 26, 2021 via email

@dagit
Copy link
Owner

dagit commented Jul 26, 2021

I worked on this a bit more today after all. A friend (dmwit) helped me with it a bit. It's in the stream branch if you want to see it. It's back to the original performance of SeqT before we started optimizing things. So that's a bummer, however I did see that stream-unstream was firing on my local example code. So that's kind of cool.

@treeowl
Copy link
Collaborator Author

treeowl commented Jul 26, 2021 via email

@dagit
Copy link
Owner

dagit commented Jul 26, 2021

Okay, I really goofed implementing >>= yesterday. I just pushed a new version thanks to dmwit. The stream version is now the fastest version by quite a bit.

Stream version:

SeqT IO with msplit:                                                                                                                               
Batch generated in: 0.24157 seconds                                                                                                                
Batch generated in: 0.236334 seconds                                                                                                               
Batch generated in: 0.22816799999999998 seconds                                                                                                    
Batch generated in: 0.228131 seconds                                                                                                               
Batch generated in: 0.22116 seconds                                                                                                                
Total time: 1.155495 seconds                                                                                                                       

LogicT:

LogicT Identity without msplit:
Batch generated in: 0.14227499999999998 seconds
Batch generated in: 0.124216 seconds 
Batch generated in: 0.1135 seconds                                       
Batch generated in: 0.159997 seconds 
Batch generated in: 0.120277 seconds 
Total time: 0.660371 seconds

Basically, the it's within 2x of LogicT now.

@treeowl
Copy link
Collaborator Author

treeowl commented Jul 26, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants