Skip to content
This repository has been archived by the owner on Feb 5, 2024. It is now read-only.

cuQuantum multi-node example #91

Open
osbama opened this issue Feb 9, 2023 · 3 comments
Open

cuQuantum multi-node example #91

osbama opened this issue Feb 9, 2023 · 3 comments

Comments

@osbama
Copy link

osbama commented Feb 9, 2023

Issue description

Clarification, and preferably an example regarding if lightning.gpu has cuQuantum multi-node capability. Can I access multiple gpus spanning multiple nodes in a HPC? How?

Additional information

cuQuantum seems to have implemented a multi-node implementation. The scaling seems quite nice. Can I use pennylane like this?

https://developer.nvidia.com/blog/best-in-class-quantum-circuit-simulation-at-scale-with-nvidia-cuquantum-appliance/

@mlxd
Copy link
Member

mlxd commented Feb 9, 2023

Hi @osbama
We do not currently support custatevec's multi-node capabilities for a single state-vector computation.

lightning,gpu supports batched gradient evaluation for multiple observables over GPUs on a given node (see Parallel adjoint differentiation support:).

We have run distributed multi-GPU computations as part of circuit-cutting workloads, where a given high-qubit statevector problem is run over many (in this case 128 GPUs). See our paper for more information on how we did this for QAOA optimization problems, or this talk for how this was run on NERSC's Perlmutter supercomputer.

For a single distributed state-vector, we do plan to add this support natively to PennyLane in future quarters. In addition, for hybrid classical-quantum distributed work, we have a demonstration that ran on AWS BraKet

If you have a specific workload in mind that is not a single distributed statevector, we may be able to offer some suggestions on how to approach this with the existing tooling.

@osbama
Copy link
Author

osbama commented Feb 9, 2023

Thank you very much for the detailed answer. The references will be extremely useful.

We are working towards implementing an extended-Hubbard like correction to standard density functional theory kernels using QPU and machine learning, (at the moment we are exploring classical shadows).

A distributed state-vector would be great in the future (especially if it is CV), but there are well-known strategies to distribute aspects of this task (i.e. k-point parallelization) . I would very much appreciate if you can provide me with some examples where I can pass (frozen) segments of the full state-vector, or maybe partially contracted observables to other instances of Pennylane efficiently in a HPC environment, just to estimate how much difference a more "detailed" model Hamiltonian running in QPU will make to the overall DFT calculation.

At the moment I am using mpi4py, however I am not an expert in optimizing the communications or Python in HPC. If Pennylane or a module already has an efficient implementation of this, that will save us considerable resources.

@mlxd
Copy link
Member

mlxd commented Feb 14, 2023

Hi @osbama
We have had good success using both Ray and Dask Distributed/Dask CUDA for these task-based work-loads. For example, we used circuit-cutting (TN+quantum circuit hybrid) with parameter-shift to handle partitioning of a large problem space into smaller qubit chunks that can fit on an A100 GPU, and running these chunks concurrently --- in our case we used 128 GPUs on NERSC's Perlmutter supercomputer.

The paper is here and the example code is https://github.com/XanaduAI/randomized-measurements-circuit-cutting. This may not be a perfect match for your intentions, but should help to define the needs for distribution of components.

It is possible to use mpi4py for this, with a little less overhead, but a little more code. However, letting Ray/Dask handle the runtime and distribution of the components allowed us to concentrate on the problem itself, without too much concern of the environment it ran in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants