Super slow lockfile parsing #1262

finnhodgkin · 2024-08-01T11:08:46Z

At my work we have a very large workspace with many libs.

Adding or removing any modules from any of the libs takes more than three minutes. I've narrowed most of this time down to lockfile generation.

Adding a bunch of extra caching and combining queries gets that time down to <1.5s.

finnhodgkin · 2024-08-19T11:01:58Z

A somewhat linked issue to this is that the lockfile can become insanely large, and even reading and parsing it becomes pretty slow. My work codebase has a lockfile with 123k lines and takes this long to parse & decode:

This wouldn't be too bad if it was occasional but every spago command triggers a parse+decode. 3 seconds on every command is definitely noticeable.

finnhodgkin · 2024-08-19T11:25:42Z

This may be a me problem and not something that's going to impact the average user...

Our lockfile is made up of 120k lines of workspace.package.[package with build plan] and then a few thousand lines for everything else.

@f-f I know very little about the internals so this might be a stupid question, but are the build plans necessary or could they be worked out at runtime by looking up the transient deps and nubbing the results? Here's one example from our lockfile:

dependencies:
  - argonaut-codecs
  - bifunctors
  - console
  - effect
  - either
  - foreign
  - foreign-generic
  - graphql-client
  - oa-make-fixture
  - prelude
  - transformers
test_dependencies:
  - oa-no-tests
build_plan:
  - aff
  - aff-promise
  - affjax
  - affjax-node
  - affjax-web
  - ansi
  - argonaut
  - argonaut-codecs
  - argonaut-core
  - argonaut-traversals
  - arraybuffer-types
  - arrays
  - avar
  - bifunctors
  - catenable-lists
  - colors
  - console
  - const
  - contravariant
  - control
  - datetime
  - debug
  - distributive
  - effect
  - either
  - enums
  - exceptions
  - exists
  - filterable
  - foldable-traversable
  - foreign
  - foreign-generic
  - foreign-object
  - fork
  - form-urlencoded
  - free
  - functions
  - functors
  - gen
  - graphql-client
  - halogen-subscriptions
  - heterogeneous
  - http-methods
  - identity
  - integers
  - invariant
  - js-date
  - js-uri
  - lazy
  - lcg
  - lists
  - maybe
  - media-types
  - mmorph
  - newtype
  - nonempty
  - now
  - nullable
  - numbers
  - oa-make-fixture
  - ordered-collections
  - orders
  - parallel
  - partial
  - pipes
  - prelude
  - profunctor
  - profunctor-lenses
  - psci-support
  - quickcheck
  - random
  - record
  - refs
  - safe-coerce
  - spec
  - spec-discovery
  - st
  - string-parsers
  - strings
  - tailrec
  - transformers
  - tuples
  - type-equality
  - typelevel-lists
  - typelevel-peano
  - typelevel-prelude
  - unfoldable
  - unicode
  - unsafe-coerce
  - unsafe-reference
  - variant
  - web-dom
  - web-events
  - web-file
  - web-html
  - web-storage
  - web-xhr

f-f · 2024-08-19T15:06:19Z

I know very little about the internals so this might be a stupid question, but are the build plans necessary or could they be worked out at runtime by looking up the transient deps and nubbing the results?

The reason why we have a lockfile is to enable reprodicible and offline builds in a self-contained manner - we need a certain amount of information to be able to have them work both in Spago and in other tools such as Nix (where one can just read the lockfile and not have to invoke spago at all) without hitting the internet or some other local cache.

We might be able to get away with less info, but the current structure stems from how the code is structured to come up with a build plan with a package set or with the solver, and with and without the lockfile. May be worth trying to have a look to see if we can have less info, but the Nix case needs to be taken into account (cc @thomashoneyman)

This may be a me problem and not something that's going to impact the average user...

It's a general problem. I have also noticed how slow the yaml parsing is. A few ideas about that:

we could implement some custom yaml parsing
we could just change the format. This option has my preference - again here we need to keep in mind the usecase where other tools need to be able to read the lockfile, so JSON is our first candidate here since Nix can read that natively

thomashoneyman · 2024-08-19T15:19:14Z

We could definitely get away with a more compressed lockfile format. My primary considerations from the Nix side:

We need to be able to determine all of the packages necessary to produce a build, at specific versions and with hashes, from the lockfile alone. But Nix is a programming language, so it's fine to have to do some work on the Nix side. For example, it should be possible to figure out the transient dependencies from the build dependencies, so long as we still keep build and test deps separate.
We need to be able to read the lockfile format, as @f-f noted. As far as I'm aware, all that Nix natively supports is JSON, though it's possible to read other formats so long as they can be converted to JSON first. JSON may not be a suitable file format if you want it to be as fast as possible, though.

We might not want to take too much information out of the lockfile because then Spago has to compute it at runtime. But at a glance it seems we're on the wrong side of the trade right now.

f-f · 2024-08-23T13:48:14Z

@finnhodgkin now that #1261 is merged I pivoted this issue to track slow parsing of the lockfiles.

f-f · 2024-09-05T23:04:14Z

I had a look at the code that uses the lockfile, and I think we can let go of having a build_plan section in the lockfile - which should save us a lot of time when decoding the file. We'll have to recompute the build plans on every execution, but hopefully that'll be quicker.

It should not be too hard to patch either - sketch:

This is the only place where we use the build plans from the lockfile:

spago/src/Spago/Command/Fetch.purs

Lines 556 to 564 in 188e61d

    
           -- If we have a lockfile we can just use that - we don't need build a plan, since we store it for every workspace 
        
           -- package, so we can just filter out the packages we need. 
        
           Right lockfile -> do 
        
             case Map.lookup workspacePackage.package.name lockfile.workspace.packages of 
        
               Nothing -> 
        
                 die $ "Package " <> PackageName.print workspacePackage.package.name <> " not found in lockfile" 
        
               Just envs -> 
        
                 pure 
        
                   { core: fromBuildPlan envs.core.build_plan

If we'd remove the build_plan then we'd have to recompute it on the spot, and I think we could use the same code as the case with no lockfile and a package set build (because in the lockfile we basically store a package set even in the case of a solver build):

spago/src/Spago/Command/Fetch.purs

Lines 598 to 600 in 188e61d

    
           PackageSetBuild _info set -> do 
        
             depsRanges # onEachEnvM \depsRanges' -> 
        
               getTransitiveDepsFromPackageSet set $ (Array.fromFoldable $ Map.keys depsRanges')

At this point we could try to adapt the code in getTransitiveDepsFromPackageSet to accept a pre-populated packageDependenciesCache (instead of creating the Ref inside the function), since we have all that data and we shouldn't hit the network
As an aside, we should patch memoisedGetPackageDependencies to respect the offline flag, since something that is not cached might hit the network. We can push the failures even lower in the stack, but I don't know off the top of my head if that would cause trouble (e.g. if we gate all the Spago.Registry functions with it). The Spago.Git module has it at least, we just haven't propagated in all the low-level places that hit the network since we usually handle it at high level.
Another possibility would be to implement something like a getTransitiveDepsFromLockfile that doesn't need to go through a PackageSet at all and can work with the lockfile data structures. I have the impression that would be more code or a more complicated implementation, since we'd need to share this code that collects the transitive deps. No strong opinion though, it might turn out fine in the end

@finnhodgkin I won't manage to get to this for a while, so feel free to give it a spin if you'd like

finnhodgkin changed the title ~~Super slow lockfile parsing and generating~~ Super slow lockfile generating on large projects Aug 1, 2024

finnhodgkin mentioned this issue Aug 1, 2024

Slow lockfile gen #1261

Merged

3 tasks

f-f changed the title ~~Super slow lockfile generating on large projects~~ Super slow lockfile parsing Aug 23, 2024

f-f added bug performance labels Aug 23, 2024

f-f mentioned this issue Aug 23, 2024

Allow to publish packages with test dependencies not in the registry #1270

Merged

2 tasks

f-f mentioned this issue Aug 31, 2024

JSON lockfile #1280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super slow lockfile parsing #1262

Super slow lockfile parsing #1262

finnhodgkin commented Aug 1, 2024

finnhodgkin commented Aug 19, 2024

finnhodgkin commented Aug 19, 2024

f-f commented Aug 19, 2024

thomashoneyman commented Aug 19, 2024 •

edited

Loading

f-f commented Aug 23, 2024

f-f commented Sep 5, 2024

Super slow lockfile parsing #1262

Super slow lockfile parsing #1262

Comments

finnhodgkin commented Aug 1, 2024

finnhodgkin commented Aug 19, 2024

finnhodgkin commented Aug 19, 2024

f-f commented Aug 19, 2024

thomashoneyman commented Aug 19, 2024 • edited Loading

f-f commented Aug 23, 2024

f-f commented Sep 5, 2024

thomashoneyman commented Aug 19, 2024 •

edited

Loading