Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store path provenance tracking #11749

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

edolstra
Copy link
Member

Motivation

Nix historically has been bad at being able to answer the question "where did this store path come from", i.e. to provide traceability from a store path back to the Nix expression from which is was built. Nix tracks the "deriver" of a store path (the .drv file that built it) but that's pretty useless in practice, since it doesn't link back to the Nix expressions.

So this PR adds a "provenance" field (a JSON object) to the ValidPaths table and to .narinfo files that describes where the store path came from and how it can be reproduced.

There are currently 3 types of provenance:

  • copied: Records that the store path was copied or substituted from another store (typically a binary cache). Its "from" field is the URL of the origin store. Its "provenance" field propagates the provenance of the store path on the origin store.

  • derivation: Records that the store path is the output of a .drv file. This is equivalent for the "deriver" field, but it has a
    nested "provenance" field that records how the .drv file was created.

  • flake: Records that the store path was created during the evaluation of a flake output.

Example:

$ nix path-info --json /nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0
{
  "/nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0": {
    "provenance": {
      "from": "https://cache.example.org/",
      "provenance": {
        "drv": "rlabxgjx88bavjkc694v1bqbwslwivxs-patchelf-0.18.0.drv",
        "output": "out",
        "provenance": {
          "flake": {
            "lastModified": 1729856604,
            "narHash": "sha256-obmE2ZI9sTPXczzGMerwQX4SALF+ABL9J0oB371yvZE=",
            "owner": "NixOS",
            "repo": "patchelf",
            "rev": "689f19e499caee8e5c3d387008bbd4ed7f8dc3a9",
            "type": "github",
          },
          "output": "packages.x86_64-linux.default",
          "type": "flake"
        },
        "type": "derivation"
       },
       "type": "copied"
    },
    ...
  }
}

This specifies that the store path was copied from the binary cache https://cache.example.org/ and it's the "out" output of a store derivation that was produced by evaluating the flake ouput packages.x86_64-linux.default of some revision of the patchelf GitHub repository.

Depends on #11668.

Context

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added store Issues and pull requests concerning the Nix store fetching Networking with the outside (non-Nix) world, input locking labels Oct 25, 2024
@edolstra edolstra marked this pull request as draft October 25, 2024 12:53
@edolstra edolstra force-pushed the provenance branch 3 times, most recently from e08ec75 to 0956b7e Compare October 25, 2024 16:50
Backward-compatible schema changes (e.g. those that add tables or
nullable columns) now no longer need a change to the global schema
file (/nix/var/nix/db/schema). Thus, old Nix versions can continue to
access the database.

This is especially useful for schema changes required by experimental
features. In particular, it replaces the ad-hoc handling of the schema
changes for CA derivations (i.e. the file /nix/var/nix/db/ca-schema).

Schema versions 8 and 10 could have been handled by this mechanism in
a backward-compatible way as well.
@edolstra edolstra force-pushed the provenance branch 2 times, most recently from f2b796f to 31d1d7e Compare October 26, 2024 15:49
@github-actions github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Oct 26, 2024
@johnrichardrinehart
Copy link

johnrichardrinehart commented Oct 26, 2024

This looks like a cool idea. How does it help me determine which expression (which line of which file) in the checkout of some repository defines the .drv?

Like, you implied this would support tracking the store path back to the expression. And, in the flake case I guess someone could make an argument that that's good enough. But, what about in the case of an ad-hoc derivation floating around on my filesystem that I realise with nix-build and which gets post-build-hooked to a substituter? Seems like the provenance might be hard in that case? I should play around with this because I'll probably be able to answer my own questions.

Nix historically has been bad at being able to answer the question
"where did this store path come from", i.e. to provide traceability
from a store path back to the Nix expression from which is was
built. Nix tracks the "deriver" of a store path (the .drv file that
built it) but that's pretty useless in practice, since it doesn't link
back to the Nix expressions.

So this PR adds a "provenance" field (a JSON object) to the ValidPaths
table and to .narinfo files that describes where the store path came
from and how it can be reproduced.

There are currently 3 types of provenance:

* "copied": Records that the store path was copied or substituted from
  another store (typically a binary cache). Its "from" field is the
  URL of the origin store. Its "provenance" field propagates the
  provenance of the store path on the origin store.

* "derivation": Records that the store path is the output of a .drv
  file. This is equivalent for the "deriver" field, but it has a
  nested "provenance" field that records how the .drv file was
  created.

* "flake": Records that the store path was created during the
  evaluation of a flake output.

Example:

  $ nix path-info --json /nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0
  {
    "/nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0": {
      "provenance": {
        "from": "https://cache.example.org",
        "provenance": {
          "drv": "rlabxgjx88bavjkc694v1bqbwslwivxs-patchelf-0.18.0.drv",
          "output": "out",
          "provenance": {
            "flake": {
              "lastModified": 1729856604,
              "narHash": "sha256-obmE2ZI9sTPXczzGMerwQX4SALF+ABL9J0oB371yvZE=",
              "owner": "NixOS",
              "repo": "patchelf",
              "rev": "689f19e499caee8e5c3d387008bbd4ed7f8dc3a9",
              "type": "github",
            },
            "output": "packages.x86_64-linux.default",
            "type": "flake"
          },
          "type": "derivation"
        },
        "type": "copied"
      },
      ...
    }
  }

This specifies that the store path was copied from the binary cache
https://cache.example.org and it's the "out" output of a store
derivation that was produced by evaluating the flake ouput
`packages.x86_64-linux.default` of some revision of the patchelf
GitHub repository.
@edolstra
Copy link
Member Author

How does it help me determine which expression (which line of which file) in the checkout of some repository defines the .drv?

It doesn't currently, since that information wouldn't be enough to reproduce the store derivation (i.e. a package function in Nixpkgs requires arguments to be able to reproduce its output, not to mention stuff like overrides). But storing the top-level flake + flake output name that caused the store derivation to be created does allow the store derivation to be reproduced.

But, what about in the case of an ad-hoc derivation floating around on my filesystem

The problem there is that evaluation of non-flake expressions is not hermetic, so we really do need something like flakes for provenance.

@roberth
Copy link
Member

roberth commented Nov 6, 2024

not hermetic

It will be less likely that you can verify the provenance, but something could be recorded nonetheless.
Expressions written with "purity" in mind may actually verify just fine if, say, a git revision is stored when e.g. a default.nix is in a git repo.

@roberth
Copy link
Member

roberth commented Nov 6, 2024

(I haven't read the whole diff yet, so apologies for questions I could have answered myself, but these will need to be documented anyway, so also you're welcome :) )

  • flake: Records that the store path was created during the evaluation of a flake output.

Many evaluations will produce the same paths. How do we deal with that? I suppose we only need a flake provenance for the outputs that are immediately in the flake outputs, and we can find provenance of the closure by following the referrers relation.
Denormalizing all this into the closure is too expensive.

Another solution is to only store the first provenance, but this is too arbitrary IMO, and can also be achieved with a first referrer field if we feel like storing all referrers edges is too expensive or impractical for "non-enumerating" stores like the binary cache stores.

Putting new appendable data into the stores including the binary caches stores is quite a step.

Do we really need this to be in the binary cache?

A lot of the value of this feature could instead be produced by a local database, since that's where evaluation and realisation ultimately happen anyway.
It's only when you're doing deployments with store-level-only operations like closure copying that you lose this info, but I think this is fine. Deployment targets don't need to know their evaluation provenance; only the machines that manage those targets really need to know.

Some questions

Things to be documented and/or implemented

  • How do we deal with the many-to-one relationship between evaluations and a product of those evaluations?
  • How does this work for ca-derivations realisations?
  • Documentation in the protocols section of the manual

struct ProvFlake
{
std::shared_ptr<nlohmann::json> flake; // FIXME: change to Attrs
std::string flakeOutput;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string flakeOutput;
std::vector<std::string> flakeOutput;

* derivation input source) that was produced by the evaluation of
* a flake.
*/
struct ProvFlake
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a layer violation. We could define something like struct ProvOther { std::string type; nlohmann::json value; } at the store layer and refine this in upper layers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about getting rid of all the Prov* types and just passing provenance around as a JSON value.

@edolstra
Copy link
Member Author

edolstra commented Nov 6, 2024

It will be less likely that you can verify the provenance, but something could be recorded nonetheless.

Indeed provenance doesn't need to be hermetic or reproducible, so we could certainly have a provenance type for non-flake evaluations.

Many evaluations will produce the same paths. How do we deal with that?

The provenance is the evaluation that produced the store path, i.e. the first one. There can of course be many other evaluations that produce the same store path, but those are not the provenance for that particular store / binary cache. (The same applies to other types of provenance like substitution: a path can be substituted from many binary caches, but we only record the one we actually used.)

Recording other provenances makes the metadata for a store path potentially grow without bounds. And in the case of .narinfo files, we really don't want to update them after creation due to caching etc.

This is the same semantics as the deriver field BTW.

Do we really need this to be in the binary cache?

I think so, because without that you can't query the ultimate provenance of a store path in a binary cache like cache.nixos.org.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fetching Networking with the outside (non-Nix) world, input locking store Issues and pull requests concerning the Nix store with-tests Issues related to testing. PRs with tests have some priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants