Multi-GPU Question #18

adam-hartshorne · 2024-10-27T09:17:58Z

I notice you have a stable branch for multi-gpu testing. I was just wondering if torch2jax does actually work out of the box when using what I believe is now the standard JAX multi-gpu paradigm of sharding i.e.

https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html
https://jax.readthedocs.io/en/latest/notebooks/shard_map.html

rdyro · 2024-10-28T05:10:40Z

The multi-gpu there refers to correctly calling functions on a particular GPU, but unfortunately not on multi-GPU shared arrays (the paradigm from JAX). I have to learn more about sharding in torch to think how to support a sharded array function in torch.

I'm currently using the NVIDIA C++ functionality for detecting which GPU the data is on, so as long as torch2jax is called from shard_map (exactly!) it should hopefully work correctly. I'm planning on testing this in the coming days. (I'll leave the issue open until I can test it)

Tangentially, this weekend, I finished porting torch2jax (in the new-ffi branch) to the new FFI interface, so long-term support should be assured now.

adam-hartshorne · 2024-10-28T05:33:03Z

Great job on getting FFI interface working. I just tried installing from that branch and doing a fresh recompile on one of my use cases and all seems to work seamlessly.

rdyro · 2024-11-01T03:21:09Z

Awesome! I'll try to switch permanently this weekend

adam-hartshorne · 2024-12-24T05:23:25Z

Just wondering what the state of this is now. I haven't done much JAX based multi-gpu stuff, but would something like this work if a torch2jax function was called say in the loss function?

https://docs.kidger.site/equinox/examples/parallelism/

rdyro · 2024-12-30T11:43:18Z

Good question! I believe it should work on multiple devices when torch2jax is called on each shard (so the sharding approach to parallelism).

pmap is the old way of doing device-level parallelism, but it's (roughly) successor shard_map should work out of the box. One issue might be performance in that each shard would run sequentially instead of in parallel, I'm investigating this and I'll provide a code example soon.

adam-hartshorne · 2024-12-30T11:56:11Z

I got a MWE up and running and it appeared performance became worse than just running it on a single GPU.

rdyro · 2025-01-21T07:35:24Z

I've had some time this weekend to implement experimental (performant) multi-GPU support, can you take a look here: https://github.com/rdyro/torch2jax?tab=readme-ov-file#new-performant-multi-gpu-experimental ?

The biggest takeaway is that pmap doesn't work, but shard_map should work great. Let me know if you want to take a look at the new implementation (main branch) and have any feedback!

Generally for performance investigation, tensorboard traces (like here https://jax.readthedocs.io/en/latest/profiling.html ) are quite useful, if you had those, I'd be happy to take a look too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Question #18

Multi-GPU Question #18

adam-hartshorne commented Oct 27, 2024

rdyro commented Oct 28, 2024

adam-hartshorne commented Oct 28, 2024

rdyro commented Nov 1, 2024

adam-hartshorne commented Dec 24, 2024

rdyro commented Dec 30, 2024

adam-hartshorne commented Dec 30, 2024

rdyro commented Jan 21, 2025

Multi-GPU Question #18

Multi-GPU Question #18

Comments

adam-hartshorne commented Oct 27, 2024

rdyro commented Oct 28, 2024

adam-hartshorne commented Oct 28, 2024

rdyro commented Nov 1, 2024

adam-hartshorne commented Dec 24, 2024

rdyro commented Dec 30, 2024

adam-hartshorne commented Dec 30, 2024

rdyro commented Jan 21, 2025