Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the time performance of Data.List.unsnoc #307

Open
mpilgrem opened this issue Nov 21, 2024 · 10 comments
Open

Improve the time performance of Data.List.unsnoc #307

mpilgrem opened this issue Nov 21, 2024 · 10 comments

Comments

@mpilgrem
Copy link

Motivation:

Specifically, @meooow25 observed:

There is a performance penalty to using unsnoc in place of init or last, because of all the Maybes and (,)s. It is particularly bad for last because allocations increase from zero to O(n). ... the Core [can be checked at]: https://play.haskell.org/saved/ySiaRuho

It seemed that 'naive':

import Data.List.NonEmpty ( NonEmpty (..) )
import qualified Data.List.NonEmpty as NE

unsnoc :: [a] -> Maybe ([a], a)
unsnoc [] = Nothing
unsnoc (x : xs) = let l = x :| xs in Just (NE.init l, NE.last l)

had similar time performance to partial:

partial :: [a] -> Maybe ([a], a)
partial [] = Nothing
partial s = Just (init s, last s)

while existing Data.List.unsnoc was much slower than both.

@Bodigrim observed:

These are cheap, short-living allocations in the nursery, so they are not materially worse than O(n) time.

The existing implementation of unsnoc wraps and unwraps both Maybe and (,) on each iteration, which is expensive ... If we make a special case for unsnoc [] = Nothing (thus sacrificing list fusion), we can avoid wrapping / unwrapping of Maybe. If we additionally sacrifice the requirement of traversing the list only once, we can indeed avoid wrapping / unwrapping of (,) as in Just (NE.init l, NE.last l) ...

Sacrificing list fusion is probably reasonable: in cases where list fusion used to happen we will lose one allocation on materializing (:) but save one allocation on Maybe, which is no worse; and we simply win in cases where fusion didn't use to happen.

Traversing the list twice is less palatable to my taste, because it can double maximum memory residence: we would materialise the input in full while computing NE.init l, then keep both l and NE.init l in memory while computing NE.last l (after which l can be finally freed).

The following experiment (with GHC 9.8.3) using the criterion package:

module Main (main) where

import Criterion.Main
import Data.List.NonEmpty ( NonEmpty (..) )
import qualified Data.List.NonEmpty as NE

partial :: [a] -> Maybe ([a], a)
partial [] = Nothing
partial s = Just (init s, last s)

naive ::  [a] -> Maybe ([a], a)
naive [] = Nothing
naive (x : xs) = let l = x :| xs in Just ( NE.init l, NE.last l)

-- The existing unsnoc, copied from the base package:
unsnoc :: [a] -> Maybe ([a], a)
unsnoc = foldr (\x -> Just . maybe ([], x) (\(~(a, b)) -> (x : a, b))) Nothing
{-# INLINABLE unsnoc #-}

-- The existing unsnoc, copied from the base package, but with the tilde removed:
unsnocNoTilde :: [a] -> Maybe ([a], a)
unsnocNoTilde = foldr (\x -> Just . maybe ([], x) (\((a, b)) -> (x : a, b))) Nothing
{-# INLINABLE unsnocNoTilde #-}

unsnocGo :: [a] -> Maybe ([a], a)
unsnocGo [] = Nothing
unsnocGo (x : xs) = Just $ go (x :| xs)
 where
  go (y :| []) = ([], y)
  go (y1 :| (y2 : ys)) = (\((a, b)) -> (y1 : a, b)) (go (y2 :| ys))
{-# INLINABLE unsnocGo #-}

list :: [Int]
list = [1 .. 10000]

main :: IO ()
main = defaultMain
  [ bgroup "unsnoc" [ bench "partial" $ nf partial list
                    , bench "naive" $ nf naive list
                    , bench "unsnoc" $ nf unsnoc list
                    , bench "unsnocNoTilde" $ nf unsnocNoTilde list
                    , bench "unsnocGo" $ nf unsnocGo list
                    ]
  ]

had results that were typically like this:

benchmarking unsnoc/partial
time                 86.54 μs   (83.61 μs .. 89.08 μs)
                     0.992 R²   (0.989 R² .. 0.995 R²)
mean                 87.59 μs   (85.35 μs .. 91.38 μs)
std dev              10.66 μs   (5.786 μs .. 17.82 μs)
variance introduced by outliers: 88% (severely inflated)

benchmarking unsnoc/naive
time                 83.51 μs   (81.15 μs .. 86.66 μs)
                     0.994 R²   (0.993 R² .. 0.996 R²)
mean                 85.34 μs   (83.90 μs .. 86.86 μs)
std dev              4.928 μs   (4.369 μs .. 5.827 μs)
variance introduced by outliers: 60% (severely inflated)

benchmarking unsnoc/unsnoc
time                 320.9 μs   (315.8 μs .. 326.5 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 320.9 μs   (317.6 μs .. 325.3 μs)
std dev              12.70 μs   (8.529 μs .. 16.35 μs)
variance introduced by outliers: 35% (moderately inflated)

benchmarking unsnoc/unsnocNoTilde
time                 269.5 μs   (265.0 μs .. 278.3 μs)
                     0.996 R²   (0.991 R² .. 0.999 R²)
mean                 266.3 μs   (263.9 μs .. 270.5 μs)
std dev              11.94 μs   (8.054 μs .. 20.08 μs)
variance introduced by outliers: 42% (moderately inflated)

benchmarking unsnoc/unsnocGo
time                 115.9 μs   (114.1 μs .. 117.9 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 114.8 μs   (113.9 μs .. 116.4 μs)
std dev              3.864 μs   (2.812 μs .. 5.289 μs)
variance introduced by outliers: 32% (moderately inflated)

That is, the 'naive' total version is about as quick as a 'partial' version. The copy of base unsnoc is well over three times slower. Dropping the tilde saves about 15% in time. Finally, the 'go' version is about 1.4 times slower than the 'naive' total version.

@Bodigrim
Copy link
Collaborator

Dropping the tilde saves about 15% in time.

Dropping the tilde changes the behavior, namely,

head . fst <$> unsnoc (1 : 2 : undefined)

will diverge. See the original discussion of implementation in #165 (comment)

I'm supportive of changing the definition of unsnoc to unsnocGo: as quoted above, losing potential list fusion (= saving allocation of (:) cell) allows to eliminate Maybe wrapper unconditionally, so even in the worst case we do not use more memory.

naive isn't that much faster than unsnocGo to pay for traversing the list twice, which has detrimental effect on maximum residency.

(In the context of #292, it is plausible that we might want to add lastMaybe :: [a] -> Maybe a with a hand-rolled implementation instead of reusing fmap snd . unsnoc, but that's really a topic for #292)

@mpilgrem
Copy link
Author

@Bodigrim, is that a green light to create a merge request consistent with the unsnocGo alternative?

@Bodigrim
Copy link
Collaborator

From my personal perspective - yes, although I’d love to hear from others. But you can fire up an MR - maybe it will help to gather more feedback.

@mpilgrem
Copy link
Author

@Bodigrim
Copy link
Collaborator

I commented on the MR, but overall looks good to me.

Dear CLC members, any more opinions before we vote?

@meooow25
Copy link

 benchmarking unsnoc/unsnoc
 time                 320.9 μs   (315.8 μs .. 326.5 μs)
 ...
 benchmarking unsnoc/unsnocGo
 time                 115.9 μs   (114.1 μs .. 117.9 μs)

This difference is likely because unsnocGo is not lazy like unsnoc.

https://gitlab.haskell.org/ghc/ghc/-/merge_requests/13730

The version here is different from the benchmarked function above and appropriately lazy.
But I see almost no difference in running time between this version and the current unsnoc on my machine. I guess GHC is very good with allocating and deallocating those Maybes.

benchmarking unsnoc/unsnoc
time                 210.2 μs   (209.8 μs .. 210.7 μs)
...
benchmarking unsnoc/unsnocGo
time                 203.0 μs   (202.9 μs .. 203.2 μs)

Not surprisingly, it is worse when there is fusion.

   -- a simple case where we expect fusion is [1..n]
  , bgroup "unsnoc fusion" $
    let n = 10000 :: Int in
    [ bench "unsnoc" $ nf (\i -> unsnoc [1..i]) n
    , bench "unsnocGo" $ nf (\i -> unsnocGo [1..i]) n
    ]
benchmarking unsnoc fusion/unsnoc
time                 215.0 μs   (214.4 μs .. 216.1 μs)
...
benchmarking unsnoc fusion/unsnocGo
time                 268.5 μs   (268.0 μs .. 268.9 μs)

So I don't think this change would be an improvement.

@meooow25
Copy link

I doubt it's possible to make unsnoc have good performance (but someone can correct me if I'm wrong). It is trying to do too many things at once.

  • Calculate both init and last
  • Make one pass over the list
  • Generate the init lazily

If anything, the benchmarks in the OP indicate that using unsnoc is a poor choice compared to alternatives, performance-wise. If one wants init or last, they are better off using that. If they want both, it's still better to use init and last, if the potentially increased residency is not a problem. If it is, a strict version of unsnoc is better. If that too is not desirable, the current unsnoc is (finally) suitable.

@mpilgrem
Copy link
Author

I can reproduce @meooow25's observation that adding the tilde to unsnocGo above significantly reduces time performance (which has surprised me). On my laptop (a different machine to the one above):

benchmarking unsnoc/unsnoc
time                 514.9 μs   (500.3 μs .. 538.6 μs)
                     0.989 R²   (0.976 R² .. 0.998 R²)
mean                 503.0 μs   (496.7 μs .. 519.2 μs)
std dev              34.75 μs   (23.44 μs .. 60.91 μs)
variance introduced by outliers: 60% (severely inflated)

benchmarking unsnoc/unsnocNoTilde
time                 385.2 μs   (382.0 μs .. 388.7 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 386.6 μs   (383.8 μs .. 391.0 μs)
std dev              11.42 μs   (7.929 μs .. 19.98 μs)
variance introduced by outliers: 22% (moderately inflated)

benchmarking unsnoc/unsnocGoNoTilde
time                 154.1 μs   (152.6 μs .. 155.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 153.9 μs   (152.8 μs .. 155.2 μs)
std dev              4.251 μs   (3.234 μs .. 5.443 μs)
variance introduced by outliers: 23% (moderately inflated)

benchmarking unsnoc/unsnocGoTilde
time                 520.4 μs   (495.2 μs .. 551.1 μs)
                     0.966 R²   (0.946 R² .. 0.987 R²)
mean                 587.0 μs   (565.2 μs .. 609.0 μs)
std dev              83.77 μs   (75.80 μs .. 94.62 μs)
variance introduced by outliers: 87% (severely inflated)

If anything, unsnocGoTilde has slightly worse time performance than current unsnoc.

@mpilgrem
Copy link
Author

mpilgrem commented Dec 17, 2024

This issue may be misconceived (sorry). If:

>>> head . fst <$> unsnoc (1 : 2 : undefined)
Just 1

is a requirement,§ then perhaps the (slow) existing unsnoc cannot be materially improved upon?

§ EDIT: The GHC project's version (GHC.Utils.Misc.snocView) is strict:

-- | Split a list into its last element and the initial part of the list.
-- @snocView xs = Just (init xs, last xs)@ for non-empty lists.
-- @snocView xs = Nothing@ otherwise.
-- Unless both parts of the result are guaranteed to be used
-- prefer separate calls to @last@ + @init@.
-- If you are guaranteed to use both, this will
-- be more efficient.
snocView :: [a] -> Maybe ([a],a)
snocView = fmap go . nonEmpty
  where
    go :: NonEmpty a -> ([a],a)
    go (x:|xs) = case nonEmpty xs of
        Nothing -> ([],x)
        Just xs -> case go xs of !(xs', x') -> (x:xs', x')

@Bodigrim
Copy link
Collaborator

I think for the purposes of #292 it might be more fruitful to focus on pushing for separate initMaybe / lastMaybe, because otherwise it will be very challenging to satisfy all requirements for uncons performance and semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants