Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super slow lockfile parsing #1262

Open
finnhodgkin opened this issue Aug 1, 2024 · 6 comments
Open

Super slow lockfile parsing #1262

finnhodgkin opened this issue Aug 1, 2024 · 6 comments

Comments

@finnhodgkin
Copy link
Contributor

At my work we have a very large workspace with many libs.

Adding or removing any modules from any of the libs takes more than three minutes. I've narrowed most of this time down to lockfile generation.

Adding a bunch of extra caching and combining queries gets that time down to <1.5s.

@finnhodgkin finnhodgkin changed the title Super slow lockfile parsing and generating Super slow lockfile generating on large projects Aug 1, 2024
@finnhodgkin finnhodgkin mentioned this issue Aug 1, 2024
3 tasks
@finnhodgkin
Copy link
Contributor Author

A somewhat linked issue to this is that the lockfile can become insanely large, and even reading and parsing it becomes pretty slow. My work codebase has a lockfile with 123k lines and takes this long to parse & decode:

image

This wouldn't be too bad if it was occasional but every spago command triggers a parse+decode. 3 seconds on every command is definitely noticeable.

@finnhodgkin
Copy link
Contributor Author

This may be a me problem and not something that's going to impact the average user...

Our lockfile is made up of 120k lines of workspace.package.[package with build plan] and then a few thousand lines for everything else.

@f-f I know very little about the internals so this might be a stupid question, but are the build plans necessary or could they be worked out at runtime by looking up the transient deps and nubbing the results? Here's one example from our lockfile:

dependencies:
  - argonaut-codecs
  - bifunctors
  - console
  - effect
  - either
  - foreign
  - foreign-generic
  - graphql-client
  - oa-make-fixture
  - prelude
  - transformers
test_dependencies:
  - oa-no-tests
build_plan:
  - aff
  - aff-promise
  - affjax
  - affjax-node
  - affjax-web
  - ansi
  - argonaut
  - argonaut-codecs
  - argonaut-core
  - argonaut-traversals
  - arraybuffer-types
  - arrays
  - avar
  - bifunctors
  - catenable-lists
  - colors
  - console
  - const
  - contravariant
  - control
  - datetime
  - debug
  - distributive
  - effect
  - either
  - enums
  - exceptions
  - exists
  - filterable
  - foldable-traversable
  - foreign
  - foreign-generic
  - foreign-object
  - fork
  - form-urlencoded
  - free
  - functions
  - functors
  - gen
  - graphql-client
  - halogen-subscriptions
  - heterogeneous
  - http-methods
  - identity
  - integers
  - invariant
  - js-date
  - js-uri
  - lazy
  - lcg
  - lists
  - maybe
  - media-types
  - mmorph
  - newtype
  - nonempty
  - now
  - nullable
  - numbers
  - oa-make-fixture
  - ordered-collections
  - orders
  - parallel
  - partial
  - pipes
  - prelude
  - profunctor
  - profunctor-lenses
  - psci-support
  - quickcheck
  - random
  - record
  - refs
  - safe-coerce
  - spec
  - spec-discovery
  - st
  - string-parsers
  - strings
  - tailrec
  - transformers
  - tuples
  - type-equality
  - typelevel-lists
  - typelevel-peano
  - typelevel-prelude
  - unfoldable
  - unicode
  - unsafe-coerce
  - unsafe-reference
  - variant
  - web-dom
  - web-events
  - web-file
  - web-html
  - web-storage
  - web-xhr

@f-f
Copy link
Member

f-f commented Aug 19, 2024

I know very little about the internals so this might be a stupid question, but are the build plans necessary or could they be worked out at runtime by looking up the transient deps and nubbing the results?

The reason why we have a lockfile is to enable reprodicible and offline builds in a self-contained manner - we need a certain amount of information to be able to have them work both in Spago and in other tools such as Nix (where one can just read the lockfile and not have to invoke spago at all) without hitting the internet or some other local cache.

We might be able to get away with less info, but the current structure stems from how the code is structured to come up with a build plan with a package set or with the solver, and with and without the lockfile. May be worth trying to have a look to see if we can have less info, but the Nix case needs to be taken into account (cc @thomashoneyman)

This may be a me problem and not something that's going to impact the average user...

It's a general problem. I have also noticed how slow the yaml parsing is. A few ideas about that:

  • we could implement some custom yaml parsing
  • we could just change the format. This option has my preference - again here we need to keep in mind the usecase where other tools need to be able to read the lockfile, so JSON is our first candidate here since Nix can read that natively

@thomashoneyman
Copy link
Member

thomashoneyman commented Aug 19, 2024

We could definitely get away with a more compressed lockfile format. My primary considerations from the Nix side:

  1. We need to be able to determine all of the packages necessary to produce a build, at specific versions and with hashes, from the lockfile alone. But Nix is a programming language, so it's fine to have to do some work on the Nix side. For example, it should be possible to figure out the transient dependencies from the build dependencies, so long as we still keep build and test deps separate.

  2. We need to be able to read the lockfile format, as @f-f noted. As far as I'm aware, all that Nix natively supports is JSON, though it's possible to read other formats so long as they can be converted to JSON first. JSON may not be a suitable file format if you want it to be as fast as possible, though.

We might not want to take too much information out of the lockfile because then Spago has to compute it at runtime. But at a glance it seems we're on the wrong side of the trade right now.

@f-f f-f changed the title Super slow lockfile generating on large projects Super slow lockfile parsing Aug 23, 2024
@f-f
Copy link
Member

f-f commented Aug 23, 2024

@finnhodgkin now that #1261 is merged I pivoted this issue to track slow parsing of the lockfiles.

@f-f
Copy link
Member

f-f commented Sep 5, 2024

I had a look at the code that uses the lockfile, and I think we can let go of having a build_plan section in the lockfile - which should save us a lot of time when decoding the file. We'll have to recompute the build plans on every execution, but hopefully that'll be quicker.

It should not be too hard to patch either - sketch:

  • This is the only place where we use the build plans from the lockfile:
    -- If we have a lockfile we can just use that - we don't need build a plan, since we store it for every workspace
    -- package, so we can just filter out the packages we need.
    Right lockfile -> do
    case Map.lookup workspacePackage.package.name lockfile.workspace.packages of
    Nothing ->
    die $ "Package " <> PackageName.print workspacePackage.package.name <> " not found in lockfile"
    Just envs ->
    pure
    { core: fromBuildPlan envs.core.build_plan
  • If we'd remove the build_plan then we'd have to recompute it on the spot, and I think we could use the same code as the case with no lockfile and a package set build (because in the lockfile we basically store a package set even in the case of a solver build):
    PackageSetBuild _info set -> do
    depsRanges # onEachEnvM \depsRanges' ->
    getTransitiveDepsFromPackageSet set $ (Array.fromFoldable $ Map.keys depsRanges')
  • At this point we could try to adapt the code in getTransitiveDepsFromPackageSet to accept a pre-populated packageDependenciesCache (instead of creating the Ref inside the function), since we have all that data and we shouldn't hit the network
  • As an aside, we should patch memoisedGetPackageDependencies to respect the offline flag, since something that is not cached might hit the network. We can push the failures even lower in the stack, but I don't know off the top of my head if that would cause trouble (e.g. if we gate all the Spago.Registry functions with it). The Spago.Git module has it at least, we just haven't propagated in all the low-level places that hit the network since we usually handle it at high level.
  • Another possibility would be to implement something like a getTransitiveDepsFromLockfile that doesn't need to go through a PackageSet at all and can work with the lockfile data structures. I have the impression that would be more code or a more complicated implementation, since we'd need to share this code that collects the transitive deps. No strong opinion though, it might turn out fine in the end

@finnhodgkin I won't manage to get to this for a while, so feel free to give it a spin if you'd like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants