Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use binary cache at store_uri for certain operations #1360

Open
wants to merge 4 commits into
base: nix-next
Choose a base branch
from

Conversation

Ma27
Copy link
Member

@Ma27 Ma27 commented Feb 12, 2024

Based on #1359. Trivial now that we have Perl bindings that accept a store URI.

This changes the following operations to not use the local store, but the binary cache store instead:

  • Check if build products were GCed
  • Check for valid paths when generating a nix channel
  • Generating the runtime dependencies
  • Hydra's builtin binary cache

Also switched from a global variable to subs to keep the include test working that needs to read config (for the store_uri) before the files to be included are created.

Closes #1352 as it solves the same issue.

cc @Ericson2314

Ma27 added 3 commits February 12, 2024 18:52
Replaces / Closes NixOS#1352

Consider the following setup:

* `store_uri` points to a "special" store, e.g. `s3://` or `file://`.
* Hydra runs inside e.g. an nspawn container and exclusively builds
  on remote systems.

Then, build products are never in the nspawn's `/nix/store`. This in
turn means that `$MACHINE_LOCAL_STORE->isValidPath` is always `false`
(or `0` 🤷) for output paths and thus Hydra wrongly claims that the
output got garbage collected.

Instead, use the BINARY_CACHE_STORE to look for the availability in the
correct store.

Everything touching the `drv` rather than the output paths still uses
MACHINE_LOCAL_STORE: this is because `hydra-eval-jobs` does `openStore()`
on `auto`, so the derivations end up there rather than on the
BINARY_CACHE_STORE.
Now that we don't assume the local store anymore, but use the
BINARY_CACHE_STORE which is effectively `openStore(store_uri)`, it
doesn't really matter where the store is.

Also, this doesn't seem too expensive for at most 5 build outputs.
Otherwise the `include`-test would fail: this test defines a `hydra.conf`
with an `include` statement. However, the file to be included is created
after the test base is set up. This means that - among others - the
`Hydra::Helper::Nix` module is loaded before the file to include exists.
Since the global variables are directly initialized and try to read from
the config, the test fails, because loading the config at that time
doesn't work. Delaying this into a subrouting solves the issue.

I decided against messing around with the config mechanism here since I
consider this behavior to be correct.

The local store gets (i.e. `Hydra::Store->new()` w/o args) cached in the
Perl bindings already[1], however this isn't the case for the binary
cache store. Because of that, a `state` variable is used to cache the
binary cache store. At runtime, the location of that store doesn't
change anyways.

[1] https://github.com/NixOS/nix/blob/e3ccb7e42a55529d8d50544e7e0cebb5cbc606cf/perl/lib/Nix/Store.xs#L88-L96
@Ma27 Ma27 force-pushed the nix-perl-bindings-with-bugfix branch from 1d5b3d5 to 67de7a3 Compare February 12, 2024 17:53
@Ericson2314 Ericson2314 changed the base branch from nix-next to master February 12, 2024 17:56
@Ericson2314 Ericson2314 changed the base branch from master to nix-next February 12, 2024 17:57
@Ericson2314
Copy link
Member

I'm still sorta confused how it ever worked before / are the performance ramifications of looking up in the binary cache when we previously looked up in the local store (presumably sometimes successfully?) OK.

@Ma27
Copy link
Member Author

Ma27 commented Feb 12, 2024

I'm still sorta confused how it ever worked before

Given how the Perl bindings used to look like, my best guess is that /nix/store was always used (IIRC there was no real way around that in pre Nix 2.0 unless you change some configure flags) and store_uri was added later on, but a lot of the Perl code was never updated accordingly.

Perhaps @edolstra knows a little more?

are the performance ramifications of looking up in the binary cache when we previously looked up in the local store (presumably sometimes successfully

It should be OK for file://, but I'm absolutely not sure about e.g. s3:// (on hydra.nixos.org), that's part of the reason why I put that change into its own commit (also, I don't have an environment to reasonably test this).

Perhaps we want to leave that part out for now or split it up once more even?

@Ericson2314
Copy link
Member

@Ma27 Right. Yeah the fact that Hydra presumably supported binary caches long before Nix had the store abstraction that included binary caches, is what makes things confusing --- why does it seems like the old way of working with the binary cache is wrong even though we've had it for quite a while?

This method is applicable on all kinds of stores, so let's
give it a store rather than letting it decide which store to
use.
@Ma27
Copy link
Member Author

Ma27 commented Feb 13, 2024

why does it seems like the old way of working with the binary cache is wrong even though we've had it for quite a while?

Yeah, I tried to use that when I first set up a Hydra and I think I never managed to set up binary caching back then.

Considering that binary_cache_secret_key_file raises a deprecation anyways, I'm wondering if we should rip that out instead and suggest to serve a flat-file binary cache via nginx. See also #1331.

I wouldn't expect local stores to be much of an impact, but you know the implementations better than me. Also, the most used thing is probably the build overview which is basically a few isValidPath() calls and I can't remember derivations with >10 outputs.

OTOH I'm not sure doing that for remote stores is sensible because of potential latency and because S3 is pay by requester now: NixOS/infra#299
cc @NixOS/infra for opinions as well.

So perhaps we want to exclude 44c98e1 from this PR @Ericson2314 ?

@delroth
Copy link
Contributor

delroth commented Feb 14, 2024

OTOH I'm not sure doing that for remote stores is sensible because of potential latency and because S3 is pay by requester now: NixOS/infra#299

Pay by requester shouldn't be relevant here - I doubt anyone other than us is directly using s3://nix-cache for their Hydra instance, and when the requester is us (NixOS Hydra) we'd be paying either way even if it wasn't pay by requester. Plus, the operations here are all super cheap, it's bandwidth that's the expensive part.

Latency could indeed be a problem: this is switching from an operation that completes in milliseconds (checking the existence of a local path) to an operation that completes in hundreds of milliseconds (cross-atlantic roundtrip), and there's no parallelism or batching anywhere to try and alleviate this. Of course, I'm not saying that it is a problem - I don't know nearly enough about the Hydra codebase to even understand what's being changed here. But if we were before checking 100+ paths in the local store, and we switch to checking 100+ paths in a remote store, we're going from something near-instant (sub-second) to something that needs roughly a minute to run.

@Ma27
Copy link
Member Author

Ma27 commented Mar 4, 2024

Yep, agreed about the latency part.

  • the queryPathFromHashPart in Controller/Root.pm is only for Hydra's builtin binary cache, so irrelevant for hno. In fact, if infra-only Hydra becomes a thing, that's something I'd suggest to rip out.
  • I think the same applies to the NixChannel change, but somebody should probably double-check that.
  • The change in build_GET in Controller/Build.pm is used to check if a product was GCed. This only affects the outputs of a single derivation, i.e. O(n) with n being number of outputs. Can't recall an instance where this was >10. This will also be helpful for hno I think because then you'll see on Hydra's build overview whether a certain build was part of the upcoming GC.
  • The output sub in Coontroller/Build.pm also works on a single derivation only.

These are the changes that I'd consider good to go.

While revisiting build_deps & runtime_deps in Controller/Build.pm, I'm not sure anymore if that's something we want, especially on hno which is already rather loaded AFAIK. Perhaps we want to split this off for now @delroth @Ericson2314 ?

@lheckemann
Copy link
Member

Regarding the latency, I think this should be fine because a lot of the queries are going to be repeats and the narinfo cache prevents repeated checks? Not 100% sure but I'm in favour of merging it without the split and seeing how it impacts performance.

@Ma27
Copy link
Member Author

Ma27 commented Mar 7, 2024

Fine by me, but that call should be made by the operators of hno

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants