CPU Offload for experts in Runtime #201

mryab · 2021-03-29T07:34:50Z

Given that we're only requesting one expert in the server at a time, it might be possible to keep many experts in CPU memory and to process larger batches in a single step as a result. Since we know the upcoming request queue, we can also prefetch the experts that are going to be required to minimize the latency.

justheuristic · 2021-04-02T01:02:35Z

As a side-note: we could also probably load experts into multiple GPUs and process them concurrently.

justheuristic · 2021-07-08T15:02:07Z

TIL: torch has a great mechanism for asynchronous offloading called torch.cuda.stream

Here's an example of how it works:
https://github.com/facebookresearch/fairscale/blob/8d82db43eca3c6d88f02c60bce5ba80177d2cf12/fairscale/experimental/nn/offload.py#L128-L129

mryab added the enhancement New feature or request label Mar 29, 2021

mryab self-assigned this Mar 29, 2021

mryab changed the title ~~CPU Offload for experts~~ CPU Offload for experts in Runtime Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Offload for experts in Runtime #201

CPU Offload for experts in Runtime #201

mryab commented Mar 29, 2021

justheuristic commented Apr 2, 2021

justheuristic commented Jul 8, 2021

CPU Offload for experts in Runtime #201

CPU Offload for experts in Runtime #201

Comments

mryab commented Mar 29, 2021

justheuristic commented Apr 2, 2021

justheuristic commented Jul 8, 2021