queue-runner: resolve CA derivations after dependency outputs land#1629
queue-runner: resolve CA derivations after dependency outputs land#1629amaanq wants to merge 1 commit intoNixOS:masterfrom
Conversation
There was a problem hiding this comment.
At the point where we try_resolve before, we do this:
-
If result is
Noneor "same derivation":just continue with this step, same as we always did
-
If result is
Some(drv)and thatdrvis not the same as the current drv:Step finished, status "resolved", fill in the DB pointer to the new step, and the new step now builds, this step is done.
the current "try resolve reverse dependencies" is a possible optimisation --- incrementally resolve reverse dependencies, reminiscent of union find --- but that's a more complex thing we shouldn't worry about at this time.
ae692ac to
05eef39
Compare
05eef39 to
de9aed8
Compare
0768499 to
80430bf
Compare
|
I think I have this working, but there are a few things I'm unclear on and would like to figure out:
I also believe my change needs substantial refactoring, as I'm using only parts of several functions (e.g. |
3c7b586 to
7eb2319
Compare
|
Rebased onto #1642 |
7eb2319 to
62abe6f
Compare
| } else { | ||
| None | ||
| } | ||
| }; | ||
|
|
||
| if let Some(resolved_path) = resolved { |
There was a problem hiding this comment.
I think we can combine this with the previous branch? Or at least we could compute a "do we want to resolve" bool, and then have the actual resolving (which much succeed) and this stuff in the same branch.
There was a problem hiding this comment.
This is doable, though it will take out a lock on the step's derivation for longer than strictly necessary.
| resolved_result.set_start_time_now(); | ||
| resolved_result.set_stop_time_now(); | ||
| resolved_result.log_file.clone_from(&job.result.log_file); | ||
| finish_build_step( |
There was a problem hiding this comment.
Let's try to do everything in one transaction here. finish_build_step current creates its own transaction, so let's make take the transaction as an argument instead.
|
|
||
| tx.set_resolved_to(build_id, step_nr, nr).await?; | ||
| tx.commit().await?; | ||
|
|
There was a problem hiding this comment.
TODO think about restart resiliency. I believe everything else after this point in this branch doesn't use the database. So that's good, we aren't logically splitting a transaction in two (assuming the previous stuff is done).
When is left to think about is ---- if the queue runner was restarted at this point, would the general "rebuild state machine from build database state" logic get us into the same step as what this code right here does?
There was a problem hiding this comment.
We could make a variation of our CA tests where where the queue runner dies here and then restarts to test this
There was a problem hiding this comment.
Unfortunately it looks like hydra-queue-runner is making duplicate BuildStep rows in the database on restart. I'm unsure if this also occurs for non-CA derivations, but it should be fixed. Restart currently doesn't fail, but it leaves the DB in a not very clean state.
There was a problem hiding this comment.
Oh, that may be an issue with CA builds, since in create_step we assume that all CAFloating derivations are unfinished, instead of checking if the outputs exist.
It is still true that the queue runner does not attempt to load steps from the database on restart, it recalculates them.
| Arc::new(parking_lot::RwLock::new(HashSet::new())), | ||
| Arc::new(parking_lot::RwLock::new(HashSet::new())), | ||
| Arc::new(parking_lot::RwLock::new(HashSet::new())), |
There was a problem hiding this comment.
We should have a big fat comment why it is correct to not give this function the mutable state it expects
| // If we're the root of a build then we need to become the new root, | ||
| // lest the Step get garbage collected. |
There was a problem hiding this comment.
Comment should be clear this is about Arc keeping things alive, not store keeping ,drv files alive.
We should also be clear that in the non-root case, the idea is that we registered some reverse des above that will keep us alive. Maybe we should assert that we did in fact go through that loop.
There was a problem hiding this comment.
We will need to figure out how to keep the .drv file alive for all resolved derivations, and the GC roots that nix-eval-jobs made will not do the job since these are new drvs.
| foreign key (build) references Builds(id) on delete cascade, | ||
| foreign key (propagatedFrom) references Builds(id) on delete cascade | ||
| foreign key (propagatedFrom) references Builds(id) on delete cascade, | ||
| foreign key (build, resolvedToStep) references BuildSteps(build, stepnr) on delete cascade |
There was a problem hiding this comment.
So long term, when the database distinguishes derivations from build attempts, this should actually point to a derivation, not another build attempt.
A build attempt samples the state of the database at that point to resolve, and creates a new (resolved) derivation. That derivation can be attempted to be built multiple times (We don't really care to privilage the first attempt with a foreign key). However, subsequent attempts to build the original derivation --- if we rebuild other upstream derivations and they are content-addressed and non-deterministic --- will result in a different resolved derivations, and that too can be attempted to build multiple times.
Memoizing realization in this way does give us something close a derived build trace, but the fact that our build trace is thus keyed off attempts not not derivations gives us some ability to represent (and not choke on) incoherence.
4be2bbb to
0344518
Compare
|
Unfortunately, once #1667 is merged in, this PR will hang. |
1f18bcf to
f040eaa
Compare
676817a to
90af869
Compare
Instead of resolving at StepInfo construction and carrying two drv identities through the gRPC layer, resolve in realise_drv_on_valid_machine once all deps are built. If resolution yields a different drv, the original step is marked Resolved and a new DB step is created for the resolved drv with a resolvedTo FK linking them. The builder only ever sees one drv. We create a new Step for that resolution and bunt it back to the scheduler. This grants us more flexibility in execution and the method can be used in the future for dynamic derivations, which won't map 1:1 with the original derivations. We also now only create database steps when we are sure they are necessary, reducing the number of duplicate `BuildStep` rows. In order to make tests more consistent, CA derivations will fail if they cannot be fully resolved. Otherwise, there could be inconsistent successes depending on which builder a step was performed on. As part of this, add local outputs to resolution table With the current queue-runner design, all dependency outputs of a CAFloating derivation must be recorded in the hydra database. This is true for things built or substituted by hydra, but until now not by things found on the local nix store. This may occur for outputs that are part of the system configuration. Therefore, add all local outputs that are not already in the database to the resolution table upon creating a step. This makes it possible to build derivations from `contentAddressedByDefault` nixpkgs. Co-Authored-By: Artemis Tosini <artemis.tosini@obsidian.systems> Co-Authored-By: John Ericson <John.Ericson@Obsidian.Systems> Co-Authored-By: Amaan Qureshi <git@amaanq.com>
90af869 to
3e6dae1
Compare
Instead of resolving at dispatch time and carrying two drv identities through the gRPC layer, resolve in
succeed_steponce outputs are in the DB. The resolved drv becomes a freshStepthat dispatch picks up normally; the builder only ever sees one drv. A newresolvedFromBuild/resolvedFromStepforeign keyBuildStepsties the resolved step back to the dependency that triggered it.