Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages: enable cross-compilation #275560

Conversation

ConnorBaker
Copy link
Contributor

@ConnorBaker ConnorBaker commented Dec 20, 2023

Description of changes

  • Switches to makeScopeWithSplicing' to attempt to support cross-compilation
  • Pushes additional configuration data (like gpus.nix and nvcc-compatibilities.nix) into the module system, allowing us to define its structure
  • Module system handles simple data loading and transformation necessary for us to produce package sets from manifests or releases.nix expressions
  • Allows users to override certain configuration options through use of the module system (see the module evaluation performed in pkgs/top-level/cuda-packages.nix
  • Result of module system evaluation of data is available as the config attribute in each CUDA package set
  • Instead of evaluating the modules once for each group of packages (CUDA redist, CuTensor, CuDNN, TensorRT, etc.) and CUDA version, the modules are now evaluated once and re-used by each CUDA package set
  • Recurse into all CUDA package sets by default
  • Additional cleanups and refactoring

To-do:

  • Ensure cross-compilation works as expected
  • If one squints, the genericManifest wrapper in pkgs/top-level/cuda-packages.nix is very similar to the multiplexed builder -- investigate further
  • Better describe dynamic platforms and our use of badPlatforms as a way of getting around not being able to express combinations of target CUDA version and GPU capabilities as distinct platforms
  • Update documentation to describe what belongs in the module system (raw data like manifest files, releases.nix expressions, paths to fixup functions or shims) and what belongs in pkgs/top-level/cuda-packages.nix (actual package set construction)

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Dec 20, 2023
@ConnorBaker ConnorBaker self-assigned this Dec 20, 2023
@ofborg ofborg bot added the 6.topic: cross-compilation Building packages on a different platform than they will be used on label Dec 20, 2023
@SomeoneSerge
Copy link
Contributor

Are the extra moves, renames, and changes in the manifests (including releases.nix &c) required?

@SomeoneSerge
Copy link
Contributor

The part reducing the number of evalModules calls can be probably moved out into its own PR, and I'd imagine it'll have fewer blockers than splicing or modifications to the top-level config (well I guess, techincally, this PR doesn't declare anything in there, but still)

@ConnorBaker
Copy link
Contributor Author

Are the extra moves, renames, and changes in the manifests (including releases.nix &c) required?

The files can stay where they are, but the changes to the releases.nix expressions are needed (the un-nesting of the attribute sets).

I moved them because I felt it made more sense for them to be in the same directory as the thing which is reading and processing them (same reason why gpus.nix and nvcc-compatibilities.nix were moved).

The part reducing the number of evalModules calls can be probably moved out into its own PR, and I'd imagine it'll have fewer blockers than splicing or modifications to the top-level config (well I guess, techincally, this PR doesn't declare anything in there, but still)

I'm actually not sure it can be without larger changes -- part of the reason I'm able to evaluate it only once is because all-packages.nix now uses callPackage once on cuda-packages.nix and then inherits the different versions of the CUDA package sets from it.

That's only possible now because cuda-packages.nix generates all of the different versions of the package set. Without those changes, we'd still be using callPackage to create each version of the CUDA package set, which would mean evaluating the modules once for each version we create.

That and the fact that without the other changes in the PR, it's still being evaluated by each set of packages (that is, the CUDA redistributables, CuDNN, CuTensor, and TensorRT all evaluate the modules without the changes in this PR).

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Dec 21, 2023

That's only possible now because cuda-packages.nix generates all of the different versions of the package set. Without those changes, we'd still be using callPackage to create each version of the CUDA package set, which would mean evaluating the modules once for each version we create.

I probably like your solution better, but I thought (now I'm unsure) Nix would just cache the evalModules and reuse it between different cudaPackages as long as it was invoked with the exact same static arguments? E.g.

# cuda-packages.nix:
# ...
  final: {
    extraModules = [];
    # all-modules.nix doesn't depend on final or on cudaVersion,
    # although we do create a thunk until we've computed final.extraModules
    metadata = (evalModules { modules = [ ../pkgs/cuda-modules/all-modules.nix ] ++ final.extraModules; }).config
    # ...
  }
# all-packages.nix:
  # now cudaPacckages_11_8.metadata contains information about all of the known cuda releases, not just 11.8;
  # that's probably weird, but does this affect performance?
  cudaPackages_11_8 = callPackage ...;

Introducing top-level cudaVersions would also be consistent with like boostVersions and what else we've got.

I moved them because I felt it made more sense for them to be in the same directory as the thing which is reading and processing them (same reason why gpus.nix and nvcc-compatibilities.nix were moved)

That's probably OK to do (I haven't looked in detail), and even to do it now, but I'd ask you to submit it in a separate PR as soon as you think that particular change is ready. It's a refactoring which we here risk conflating with other potentially breaking changes?

A bit of reflection on the previous big change follows. Looking at the title, I expect this PR (in its final state) to be about enabling splicing. That sounds relatively contained, and probably shouldn't affect the native builds at all. The current diff clearly contains more. I'd guess this to mean that enabling splicing actually depends on other tasks which involve moving things around and regenerating the manifests. And it's only natural to start by solving all of these tasks in the same branch, because you can't look into the future and see if you're going to encounter more obstacles that'd make you change your mind about the previous decisions. However, as we get closer to merging, we should assign more value to disentangling the individual changes, and the earlier the better. Solving multiple tasks at once, this patch now potentially affects everything. Just like we had to assume for the multiplatform PR (we expected it'd break something, we couldn't predict what). If those tasks and refactors went in separate PRs, there'd be a chance that each individual one would end up "small enough" (the new manifests affect every package, but in a homogeneous way because it's a change of the same "type" everywhere). We also could merge those dependencies earlier, which means that we could "test" them in the unstable channels before you have finished working on the actual cross-compilation feature. To conclude, I'd absolutely hate to block you or interrupt your flow, but I think there are benefits to having split things up by the time of merging, and to reviewing and merging dependencies earlier

@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Mar 20, 2024
@ConnorBaker
Copy link
Contributor Author

Closing as superseded by #301416.

Anything outside the scope of cross-compilation lives on in my out-of-tree copy of cuda-modules, where I am working on moving more evaluation into the module system to avoid infinite recursion errors by essentially forcing two stages of evaluation: https://github.com/ConnorBaker/cuda-modules.

@ConnorBaker ConnorBaker closed this Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: merge conflict This PR has merge conflicts with the target branch 6.topic: cross-compilation Building packages on a different platform than they will be used on 6.topic: cuda Parallel computing platform and API
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants