[RFC] Integration of Distributed Inference into TorchChat #1376
Labels
Distributed
Issues related to all things distributed
RFC
Request for Comment
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🚀 The feature, motivation and pitch
Overview
The goal of this RFC is to discuss the integration of distributed inference into TorchChat. Distributed inference leverages tensor parallelism or pipeline parallelism, or a combination of both to support larger model size which do not fit on a single accelerator. Through parallelization each model shard runs in its own worker process. The processes can either be spawned on the script level (e.g. via torchrun) or from within the main script. For online use cases like chat/server the processes need to coordinate fetching and sharing the user input depending on at which point the processes get spawned. Synchronization points between the processes should be minimized for optimal performance.
The design goals of the integration are:
Alternatives
Option 1: Integrate at Model Level
While the usage of a tensor parallel model in PyTorchis is very much transparent, the current pipeline parallel API differs significantly from the usage of a local model. This option hides the distributed inference from the Generator class by introducing the distributed inference inside a torchchat.model.Model derivative. The DistributedModel(torchchat.model.Model) class would implement methods like call() and forward() and handle distribution to the worker processes inside.
Option 2: Abstract Base Class for Generator
Introduce a base class Generator which contains the common portions of the implementation generation process like getting and preparing input from the user. LocalGenerator and DistributedGenerator get introduced to handle specifics. The split between base and derivatives can be made at multiple levels, specifically High:Generator.generate, Mid:Generator.decode_n_tokens/prefill, Low: Generator.decode_one_token/prefill
Option 2b: Integrate at Low Level of Generator without base class
This approach skips the creation of a base class and directly inherits DistributedGenerator(Generator) and adds functionality for distributed inference in the main generate.py file.
cc @Jack-Khuu @byjlw @lessw2020
Additional context
No response
RFC (Optional)
No response
The text was updated successfully, but these errors were encountered: