Optimizer offloading through weight-only offload #867

hanzhi713 · 2024-12-04T18:02:47Z

This PR requires jax >= 0.4.34, != 0.4.35, >=0.4.36: it works on jax 0.4.34, but is broken on jax 0.4.35 due to libtpu bug. It worked on nightly jax 0.4.36 as of 10/30.

This PR represents effort to enable optimizer offloading. The approach we use in this PR is weight-only offloading, which is based on similar building blocks as activation offloading (aka remat offload). When offloading is enabled, optimizer states are stored on CPU pinned memory. Before apply optimizer to calculate updates, optimizer states are moved from CPU memory to HBM via jax.device_put. The new optimizer states are moved back from HBM to CPU.

An alternative approach to this PR is host computation. Host computation means that optimizer transformations are computed on CPU. Before the start of the computation, gradients and weights are transferred to CPU, and after the computation, their new values are transferred back to HBM. This method has lower HBM footprint, but it's much 2x ~ 3x slower due to slow CPU computation. Also, it's very buggy.

TLDR: to be merged after upgrading jax to 0.4.36.

ruomingp

Nice!

axlearn/common/optimizer_base.py

axlearn/common/optimizers.py

ruomingp · 2024-12-04T19:01:36Z

axlearn/common/optimizers.py

+    Only wrap the optimizer that you actually want to offload with this function to avoid
+    unneseccary overhead. This is usually the optimizer that occupies the most HBM. For example,
+    when you have chained optimizers:


Where does the overhead come from? Is it from the states of clip_by_global_norm being offloaded? If so, could we use regular expressions to specify which states to offload?

axlearn/common/optimizers.py

ruomingp

A question about device_put...

axlearn/common/optimizers.py

hanzhi713 · 2024-12-05T02:58:21Z

A question about device_put...

Before the optimizer can be invoked, the offloaded optimizer states need to be transferred to device memory space. If we remove these device_put calls, we will get errors like xxx is not supported on pined_host memory space, where xxx is some XLA primitive operations such as add (forgot the exact error message but is something like this)

ruomingp

Thanks for the clarification on device_put calls. Could you add a comment on why it's necessary? Also two suggestions...

axlearn/common/optimizers.py

hanzhi713 · 2025-01-28T04:28:10Z

@markblee Can you take a look at the pytype errors in CI? Should I change Nested type to include tuple/namedtuple or should I just ignore the errors?

markblee · 2025-01-28T04:46:25Z

@markblee Can you take a look at the pytype errors in CI? Should I change Nested type to include tuple/namedtuple or should I just ignore the errors?

Maybe we can relax the partition fn return type to include named tuple? Changing Nested may have undesirable impact elsewhere.

hanzhi713 · 2025-01-28T05:22:57Z

Maybe we can relax the partition fn return type to include named tuple? Changing Nested may have undesirable impact elsewhere.

Thanks. I included the return type in Nested[...]. Just NamedTuple wouldn't work.

hanzhi713 requested review from ruomingp and markblee as code owners December 4, 2024 18:02

ruomingp reviewed Dec 4, 2024

View reviewed changes

hanzhi713 requested a review from ruomingp December 4, 2024 21:11

ruomingp reviewed Dec 5, 2024

View reviewed changes

axlearn/common/optimizers.py Outdated Show resolved Hide resolved

axlearn/common/optimizers.py Show resolved Hide resolved

axlearn/common/optimizers.py Outdated Show resolved Hide resolved

axlearn/common/optimizers.py Outdated Show resolved Hide resolved

ruomingp reviewed Dec 5, 2024

View reviewed changes

axlearn/common/optimizers.py Outdated Show resolved Hide resolved

axlearn/common/optimizers.py Show resolved Hide resolved

hanzhi713 requested a review from ruomingp December 5, 2024 19:16

ruomingp approved these changes Dec 5, 2024

View reviewed changes

axlearn/common/optimizers.py Show resolved Hide resolved

ruomingp approved these changes Dec 6, 2024

View reviewed changes

axlearn/common/optimizers.py Show resolved Hide resolved

Optimizer offloading

5ea7bb4

hanzhi713 force-pushed the weight-only-offload-cleanup branch from 3adee99 to 5ea7bb4 Compare January 28, 2025 00:13

hanzhi713 requested a review from a team as a code owner January 28, 2025 00:13

Style fix

5dc6036

hanzhi713 enabled auto-merge January 28, 2025 00:16

hanzhi713 requested a review from ruomingp January 28, 2025 00:19

ruomingp approved these changes Jan 28, 2025

View reviewed changes

markblee approved these changes Jan 28, 2025

View reviewed changes

Type fix

eb52c1a

hanzhi713 requested a review from markblee January 28, 2025 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer offloading through weight-only offload #867

Optimizer offloading through weight-only offload #867

hanzhi713 commented Dec 4, 2024

ruomingp left a comment

ruomingp Dec 4, 2024

ruomingp left a comment

hanzhi713 commented Dec 5, 2024

ruomingp left a comment

hanzhi713 commented Jan 28, 2025

markblee commented Jan 28, 2025

hanzhi713 commented Jan 28, 2025

Optimizer offloading through weight-only offload #867

Are you sure you want to change the base?

Optimizer offloading through weight-only offload #867

Conversation

hanzhi713 commented Dec 4, 2024

ruomingp left a comment

Choose a reason for hiding this comment

ruomingp Dec 4, 2024

Choose a reason for hiding this comment

ruomingp left a comment

Choose a reason for hiding this comment

hanzhi713 commented Dec 5, 2024

ruomingp left a comment

Choose a reason for hiding this comment

hanzhi713 commented Jan 28, 2025

markblee commented Jan 28, 2025

hanzhi713 commented Jan 28, 2025