Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shim packages #61

Open
1 task done
jaimergp opened this issue Jul 8, 2021 · 13 comments
Open
1 task done

Shim packages #61

jaimergp opened this issue Jul 8, 2021 · 13 comments

Comments

@jaimergp
Copy link
Member

jaimergp commented Jul 8, 2021

We all know that cudatoolkit packages are huge and cause a lot of unneeded traffic, specially in GPU-less CI settings where it wouldn't work anyway, or HPC systems that provide their own CUDA installation. I'd say it'd be beneficial to have empty cudatoolkit shim packages that are empty and only contain the required metadata. They would be deprioritized with respect to the real package, much in the line of conda-forge/mpich-feedstock#44.

Interested users would specify cudatoolkit=*=external or similar in the package list. Alternatively/Additionally, the empty packages could be uploaded to a different label and interested people would add that label with higher priority.

Would this be something we'd agree on?

@jakirkham
Copy link
Member

I'm not sure.

In issue ( conda-forge/conda-forge.github.io#687 ) we had agreed with the broader community on aligning around using cudatoolkit packages and having CUDA versions selected by them. This is something all parties agreed to and do today. If we were to change that, would like to see the same parties way in.

While the issue of download and decompression pain does come up periodically, my hope would be better split up packages ( #62 ) would cutdown on this pain. Additionally would hope that generally moving to the newer .conda packages would cutdown on the download/decompression pain (by using zip files that have random access and newer compressors like Zstd which work well with filesystems like SquashFS). Also some of this pain can just be due to using the wrong filesystem on a cluster (like NFS), which just involves better engagement with system admins (though maybe we can document this better to help people realize this potential issue sooner).

If even after all of that we still wanted to do something like this, I think the question becomes where can we reliably expect external libraries to live on users' systems? If there is no reliable place, how do we specify it? Then following that, how do we support this functionality in packages using CUDA libraries without needing to add another build for external vs. internal?

@jaimergp
Copy link
Member Author

how do we support this functionality in packages using CUDA libraries without needing to add another build for external vs. internal?

Is an extra build needed? I thought adding cudatoolkit=*=*external to the environment file or package list would suffice for the solver to pick the shim version. After that, it's on the user if they deliberately chose to opt out of the full cudatoolkit download.

@jakirkham
Copy link
Member

If we can do it without an extra build, agree that is more useful. Would much rather people make a runtime decision. That said, Idk if people will generally configure this correctly, so it might need to be an optimization they apply themselves.

@jaimergp
Copy link
Member Author

I envision this as an advanced piece of knowledge users in CI and HPC centers will use only after realizing they don't like the full download, so they will Google this issue and find out that they can do the *external trick (or find it mentioned in the documentation).

With CUDA 11.3+ it will be less of a problem I guess, but still some libraries are well into the hundreds anyway. We can add this only for <=11.2?

@dicta
Copy link

dicta commented Jul 28, 2021

One of the build problems this will cause is when building downstream packages using CMake, the CMAKE_FIND_ROOT_PATH variable is used in conda-build environments to restrict finding any header or library to those in the environment. Other build systems may do something similar, I'm not nearly as familiar with them so can't comment directly.

So there are some real compatibility problems to be solved so that you don't break conda-build infrastructure for anyone who installed a cudatoolkit=*=external package.

IMO this should be thought about at the global conda-forge policy level -- and not just for CUDA as a special case -- so that we consistently deal with any environment that may need to depend on software provided by the system. My personal preference for cases like this is for any generated "external" metapackage to do something like the following:

  1. Generate symlinks from the system locations into the environment for each file that's part of the external package. This may involve variants of the metapackage for different OS distributions, but this cannot be helped and it's better to do that here, in one place, than to have every downstream package that depends on us have to deal with it.

  2. Test at environment activation time (likely using an activation script) whether all the targets of these symlinks exist, so the user has some amount of trust that their environment will actually work.

@jaimergp
Copy link
Member Author

AFAIK this variant would only be relevant at runtime, not during build. Actually, our nvcc wrapper points to the global CUDA installation at /usr/local/cuda, so I don't see how the shims would affect this process (but I am not exactly an expert here and I am sure I am missing something).

The intention here is to provide empty packages that only carry the metadata. The effect would be the same as letting conda install cudatoolkit and then force-remove it (which some people have to do now in their setup for different reasons; mainly file size), but without having to download anything.

@jakirkham
Copy link
Member

With CUDA 11.3+ it will be less of a problem I guess, but still some libraries are well into the hundreds anyway. We can add this only for <=11.2?

The move to 11.3+ already seems pretty fraught. Would rather not add one more thing to the list

@jakirkham
Copy link
Member

More generally am interested in getting the new packages squared away ( #62 ). This will help us solve a lot of things including adding new CUDA versions, better splitting, providing access to build tooling, etc. We can also take that opportunity to leverage CUDA enhanced compatibility and lighten our build matrices.

@jakirkham
Copy link
Member

Should add that for lightening deployments we can look at using nvprune to target things more carefully and cutdown on size. This is a little tricky to do in the general case (like how we build in conda-forge), but maybe we can envision final build steps that do this kind of splitting either at the end of a CI job or on a user's machine. There are pros & cons to this though that we would want to evaluate carefully

@ngam
Copy link

ngam commented Jun 8, 2022

We all know that cudatoolkit packages are huge and cause a lot of unneeded traffic, specially in GPU-less CI settings where it wouldn't work anyway, or HPC systems that provide their own CUDA installation. I'd say it'd be beneficial to have empty cudatoolkit shim packages that are empty and only contain the required metadata.

While I think this is true in general and that it could be beneficial to add such a shim package, I would like to add one point in support of the current solution as described here:

In issue ( conda-forge/conda-forge.github.io#687 ) we had agreed with the broader community on aligning around using cudatoolkit packages and having CUDA versions selected by them. This is something all parties agreed to and do today. If we were to change that, would like to see the same parties way in.

There are potential downsides to having users choose between a conda-forge cudatoolkit and a system cudatoolkit. From my experience with HPC admins, most of them know little about cuda stuff and don't care to know about it. For one, they wouldn't update the stuff unless they are asked (as cudatoolkit and co are not "core" items to them). Even crazier, a lot of these system packages are incomplete --- let me just say, in 2022, I had to request explicitly for the sys admins to include ptxas in their cudatoolkit installation (11.6). I also had access to another HPC which had cudatoolkit 11.4, but no cudnn whatsoever (not 8.x, not 7.x). So, the bottom line is with such a weird package (can we even call it that?!) like cudatoolkit, I really think for the sake of the community and ecosystem, we should try to keep things simpler and not more complicated.

A power/pro user can of course choose to customize stuff as they wish, and they don't need conda-forge to figure it out for them.

Another aspect of all of this is performance. For tensorflow, I consistently find that the conda-forge installation outperforms any other installation (e.g. outperforming pypi by a wide margin, even outperforming the official ngc containers). Could we guarantee that the performance will not suffer if we allow for an external cudatoolkit? I suspect we cannot.

@wesfloyd
Copy link

Adding one additional benefit in support of this issue (enabling an "openmm install without cuda"), building docker images with OpenMM grow in size quickly due to the large cuda package size (800mb'ish).

@jakirkham
Copy link
Member

There is a separate issue ( #48 ) around splitting into smaller packages, which can be depended on separately. Also issue ( #62 ) discusses a newer package structure that implements this.

@daveminh
Copy link

I see this is labeled as complete, but can somebody clarify how to use this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants