-
Notifications
You must be signed in to change notification settings - Fork 10
cuQuantum multi-node example #91
Comments
Hi @osbama
We have run distributed multi-GPU computations as part of circuit-cutting workloads, where a given high-qubit statevector problem is run over many (in this case 128 GPUs). See our paper for more information on how we did this for QAOA optimization problems, or this talk for how this was run on NERSC's Perlmutter supercomputer. For a single distributed state-vector, we do plan to add this support natively to PennyLane in future quarters. In addition, for hybrid classical-quantum distributed work, we have a demonstration that ran on AWS BraKet If you have a specific workload in mind that is not a single distributed statevector, we may be able to offer some suggestions on how to approach this with the existing tooling. |
Thank you very much for the detailed answer. The references will be extremely useful. We are working towards implementing an extended-Hubbard like correction to standard density functional theory kernels using QPU and machine learning, (at the moment we are exploring classical shadows). A distributed state-vector would be great in the future (especially if it is CV), but there are well-known strategies to distribute aspects of this task (i.e. k-point parallelization) . I would very much appreciate if you can provide me with some examples where I can pass (frozen) segments of the full state-vector, or maybe partially contracted observables to other instances of Pennylane efficiently in a HPC environment, just to estimate how much difference a more "detailed" model Hamiltonian running in QPU will make to the overall DFT calculation. At the moment I am using mpi4py, however I am not an expert in optimizing the communications or Python in HPC. If Pennylane or a module already has an efficient implementation of this, that will save us considerable resources. |
Hi @osbama The paper is here and the example code is https://github.com/XanaduAI/randomized-measurements-circuit-cutting. This may not be a perfect match for your intentions, but should help to define the needs for distribution of components. It is possible to use mpi4py for this, with a little less overhead, but a little more code. However, letting Ray/Dask handle the runtime and distribution of the components allowed us to concentrate on the problem itself, without too much concern of the environment it ran in. |
Issue description
Clarification, and preferably an example regarding if lightning.gpu has cuQuantum multi-node capability. Can I access multiple gpus spanning multiple nodes in a HPC? How?
Additional information
cuQuantum seems to have implemented a multi-node implementation. The scaling seems quite nice. Can I use pennylane like this?
https://developer.nvidia.com/blog/best-in-class-quantum-circuit-simulation-at-scale-with-nvidia-cuquantum-appliance/
The text was updated successfully, but these errors were encountered: