Performance TODO #19

treeowl · 2021-07-24T02:31:39Z

I'm opening this ticket to track performance improvement ideas for a bit.

Investigate using the static argument transformation to enhance inlining in (>>=) and so on. In particular, this would allow GHC to inline f in m >>= f. Does this actually help? We'll need measurements to see.
Investigate alternative queue implementations. Discussion on this in Do we need to reimplement sequence or type-aligned dependencies for performance reasons? #4.
What rewrite rules do we want, if any? See RULES exploration #34. Do we want a full fusion framework? Caution: either or both of these could lead to optimization-dependent behavior in the face of unlawful Monad instances.

The text was updated successfully, but these errors were encountered:

dagit · 2021-07-24T07:08:26Z

Do people expect monad transformers to work with unlawful monads as a regular thing? I was hoping we could just tell people they'll get undefined behavior if they do that. If it's a big concern, we could use the module system to make the optimization "opt-in". RULES can appear in any module and get imported similar to instances.

treeowl · 2021-07-24T07:36:28Z

As long as the expectations and consequences are adequately documented, the sheriff won't come and drag you away. I'm mostly imagining issues with others' test suites that intentionally use (slightly) unlawful instances. That said, if you're actually able to get reliably good performance in realistic situations using rewrite rules, it's hard to argue with that result.

treeowl · 2021-07-24T20:26:59Z

I attempted a lift m >>= \x -> lift (f x) ==> lift (m >>= \x -> f x) rule, but for whatever reasons I could never get it to fire. I tried using auxiliary specialized bindSeqT and liftSeqT functions, messing with INLINE and NOINLINE, and tweaking arity, but got nowhere.

dagit · 2021-07-25T20:54:40Z

I'm having the same issue with a rewrite rule for <|>.

I'll explain how far I made it with stream fusion.

First I added these definitions, which I believe to be a pretty straightforward aspect of the stream fusion literature, although I'm new to this so I might have done something goofy here without realizing it:

data Step s a = Done | Yield a s
data Stream a = forall s. Stream (s -> Step s a) s

type ViewS m a = Stream (m (View m a))

stream :: SeqT m a -> ViewS m a
stream (SeqT m) = Stream next m where
  {-# INLINE next #-}
  next s = case viewl s of
    EmptyL -> Done
    h S.:< t -> Yield h t
{-# INLINE[0] stream #-}

unstream :: ViewS m a -> SeqT m a
unstream (Stream next s0) = SeqT (unfold s0)
  where
  unfold s = case next s of
    Done -> S.empty
    Yield x xs -> x S.<| unfold xs
{-# INLINE[0] unstream #-}

{-# RULES
      "stream-unstream" [1] forall (s::ViewS m a). stream (unstream s) = s;
  #-}

For a warm up, I tried to make a fusable version of <|>:

alt :: SeqT m a -> SeqT m a -> SeqT m a
alt a b = unstream (alt_s (stream a) (stream b))
{-# INLINE[2] alt #-}

alt_s :: ViewS m a -> ViewS m a -> ViewS m a
alt_s (Stream next_a a) (Stream next_b b) = Stream next (a,b)
  where
  {-# INLINE next #-}
  next (s_a,s_b) = case next_a s_a of
    Done -> case next_b s_b of
      Done -> Done
      Yield y ys -> Yield y (s_a, ys)
    Yield x xs -> Yield x (xs, s_b)
{-# INLINE[2] alt_s #-}

I'm not a huge fan of how next came out inside of alt_s, but conceptually it makes sense. I just don't love that each time it's called keeps trying to step s_a after it's exhausted. There might be ways around that, but this type checks and has the desired behavior when I test in ghci.

Next I thought I would attempt >>=. And that's where I got stuck. This type checks but is definitely incorrect:

bind :: Monad m => SeqT m a -> (a -> SeqT m b) -> SeqT m b
bind m f = unstream (bind_s (stream m) (stream . f))

bind_s :: Monad m => ViewS m a -> (a -> ViewS m b) -> ViewS m b
bind_s (Stream next_a a) f = Stream next a
  where
  {-# INLINE next #-}
  next s_a = case next_a s_a of
    Done -> Done
    Yield y ys -> Yield (y >>= go ys) ys
  go ys Empty = return Empty
  go ys (b :< bs) = case f b `alt_s` (bind_s (stream bs) f) of
    Stream next_fb fb -> case next_fb fb of
      Done -> return Empty
      Yield h t -> h

In simple test cases like observeAllT ((pure 1 >> pure 2) <|> (pure 3 >> pure 4)) (where >> is using the default definition and not the specialized one you wrote), it actually returns the expected [2,4]. However, when I use this version with real programs they don't produce the expected results.

I think the issue is in the last line of bind_s, specifically Yield h t -> h. It type checks, but I think it needs to be doing something else entirely. Possibly involving the state t. I think this is missing some sort of join or flatten operation. Conceptually, I'd rather have go ys (b :< bs) = f b `alt_s` (bind_s (stream bs) f and not case on the result, but having the call of go inside a Yield makes it hard to do that. Or at least, I couldn't come up with a variation that avoided it and still type checked.

I think to fix the code above, I would need something of type:

flatten :: ViewS m a -> m (View m a)

And I'm not sure that function makes sense. I thought "okay, maybe deal with >>= later and benchmark <|> for now." And that's when I ran into the same issue you ran into. The GHC userguide gives advice no how to apply RULES to type class methods but the techniques they suggest didn't seem to apply and I don't have any idea how to debug that. I looked at the stats and the code after each simplification step but nothing jumped out to me.

That said, I'm curious though, if we only could fuse <|> would there be any benefit? So I exported alt directly and tried to see if that would fuse. That only seemed to use when I wrote simple thing like foo = pure 1 `alt` pure 2 `alt` pure 3. Things like choose that use alt with foldr don't seem to benefit. I even tried manually writing the fusable version of choose like this:

chooseSeqT :: (Foldable t, Monad m) => t a -> SeqT m a
chooseSeqT xs = unstream (foldr (alt_s . stream . pure) done xs)
{-# INLINEABLE chooseSeqT #-}
{-# SPECIALIZE chooseSeqT :: [a] -> Seq a #-}
{-# SPECIALIZE chooseSeqT :: Foldable t => t a -> Seq a #-}
done :: ViewS m a
done = Stream (const Done) Empty
{-# INLINE CONLIKE [0] done #-}

However, that doesn't appear to lead to any better performance than just using the existing stuff with good inlining/specialization. However, I did learn one thing along the way. That functions like alt_s should inline late (ie., INLINE[0]) and functions like alt should inline early (ie., INLINE[2]) relative to the stream-unstream rewrite rule. So I adjusted the pragmas accordingly compared to what you see earlier in this post.

treeowl · 2021-07-25T23:36:16Z

Ah, you're looking at stream fusion of the queue.... I was thinking more about LogicT or similar. s >>= f = fromLogicT $ toLogicT m >>= toLogicT . f and so on, with various rules. The goal (I don't know if it would work/win) would be to handle parts of the computation between/around msplit/toView/whatever in LogicT, only breaking out the big guns when necessary.

dagit · 2021-07-25T23:53:00Z

That's an interesting avenue but I would expect it to also be more brittle and harder to ensure correctness. I guess when you have multiple rules you need to ensure they are confluent. And that's one of the nice things about using a single general rule like stream-unstream and manually writing the fuseable definitions. You don't have to think about making things confluent and there are fewer rules to potentially not apply for random reasons.

treeowl · 2021-07-26T00:06:27Z

I believe what I'm talking about should be confluent too—very much like fold/build fusion. The bigger question is what approach works better (in different contexts).

…

On Sun, Jul 25, 2021, 7:53 PM Jason Dagit ***@***.***> wrote: That's an interesting avenue but I would expect it to also be more brittle and harder to ensure correctness. I guess when you have multiple rules you need to ensure they are confluent. And that's one of the nice things about using a single general rule like stream-unstream and manually writing the fuseable definitions. You don't have to think about making things confluent and there are fewer rules to potentially not apply for random reasons. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOOF7KEBWAZEFFBUXW6NFTTZSPWPANCNFSM5A474RSA> .

dagit · 2021-07-26T03:44:57Z

I worked on this a bit more today after all. A friend (dmwit) helped me with it a bit. It's in the stream branch if you want to see it. It's back to the original performance of SeqT before we started optimizing things. So that's a bummer, however I did see that stream-unstream was firing on my local example code. So that's kind of cool.

treeowl · 2021-07-26T03:46:12Z

Ah, I know Daniel Wagner too, though I haven't seen him since the pandemic started.

…

On Sun, Jul 25, 2021, 11:45 PM Jason Dagit ***@***.***> wrote: I worked on this a bit more today after all. A friend (dmwit) helped me with it a bit. It's in the stream branch if you want to see it. It's back to the original performance of SeqT before we started optimizing things. So that's a bummer, however I did see that stream-unstream was firing on my local example code. So that's kind of cool. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOOF7N2KXX5DX2NP3D2OOLTZTK4HANCNFSM5A474RSA> .

dagit · 2021-07-26T18:15:17Z

Okay, I really goofed implementing >>= yesterday. I just pushed a new version thanks to dmwit. The stream version is now the fastest version by quite a bit.

Stream version:

SeqT IO with msplit:                                                                                                                               
Batch generated in: 0.24157 seconds                                                                                                                
Batch generated in: 0.236334 seconds                                                                                                               
Batch generated in: 0.22816799999999998 seconds                                                                                                    
Batch generated in: 0.228131 seconds                                                                                                               
Batch generated in: 0.22116 seconds                                                                                                                
Total time: 1.155495 seconds

LogicT:

LogicT Identity without msplit:
Batch generated in: 0.14227499999999998 seconds
Batch generated in: 0.124216 seconds 
Batch generated in: 0.1135 seconds                                       
Batch generated in: 0.159997 seconds 
Batch generated in: 0.120277 seconds 
Total time: 0.660371 seconds

Basically, the it's within 2x of LogicT now.

treeowl · 2021-07-26T18:23:10Z

Nice!!!

…

On Mon, Jul 26, 2021, 2:15 PM Jason Dagit ***@***.***> wrote: Okay, I really goofed implementing >>= yesterday. I just pushed a new version thanks to dmwit. The stream version is now the fastest version by quite a bit. Stream version: SeqT IO with msplit: Batch generated in: 0.24157 seconds Batch generated in: 0.236334 seconds Batch generated in: 0.22816799999999998 seconds Batch generated in: 0.228131 seconds Batch generated in: 0.22116 seconds Total time: 1.155495 seconds LogicT: LogicT Identity without msplit: Batch generated in: 0.14227499999999998 seconds Batch generated in: 0.124216 seconds Batch generated in: 0.1135 seconds Batch generated in: 0.159997 seconds Batch generated in: 0.120277 seconds Total time: 0.660371 seconds Basically, the it's within 2x of LogicT now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOOF7LVTCWEDLUNXE5C3STTZWQ4BANCNFSM5A474RSA> .

treeowl added the performance label Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance TODO #19

Performance TODO #19

treeowl commented Jul 24, 2021 •

edited

Loading

dagit commented Jul 24, 2021

treeowl commented Jul 24, 2021

treeowl commented Jul 24, 2021

dagit commented Jul 25, 2021

treeowl commented Jul 25, 2021

dagit commented Jul 25, 2021

treeowl commented Jul 26, 2021 via email

dagit commented Jul 26, 2021

treeowl commented Jul 26, 2021 via email

dagit commented Jul 26, 2021

treeowl commented Jul 26, 2021 via email

Performance TODO #19

Performance TODO #19

Comments

treeowl commented Jul 24, 2021 • edited Loading

dagit commented Jul 24, 2021

treeowl commented Jul 24, 2021

treeowl commented Jul 24, 2021

dagit commented Jul 25, 2021

treeowl commented Jul 25, 2021

dagit commented Jul 25, 2021

treeowl commented Jul 26, 2021 via email

dagit commented Jul 26, 2021

treeowl commented Jul 26, 2021 via email

dagit commented Jul 26, 2021

treeowl commented Jul 26, 2021 via email

treeowl commented Jul 24, 2021 •

edited

Loading