Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
sequence level processing -> batch level processing (#62)
### sequence level processing -> batch level processing In this PR the code for preparing the input tensors for the AIU is completely rewritten based on the assumption that we have to finish the current decoding on AIU before doing another prefill. Changes: * [rewriting](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/cea122c220b18e3de3dce95faa5e03fe3efe0835) `sendnn_model_runner.py`, `sendnn_worker.py` and `sendnn.py` based on the above constraint. * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/6869231d83734d3c03ffd15bc6754c1857d063cc) class variable `self._padded_batch_size` since other solution implemented * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/ff9ebf6923fd9ac6c99e64dfffc7763f6c194399) the unused `input_block_ids` since AIU does not support paged attention yet. * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/a6d63899bf3d9fae59edde414b8bd2a3c56bc8c7) some unused function arguments in model loading * [removing](https://github.ibm.com/ai-foundation/vllm/pull/62/commits/4527300ee9be4dd1fb76007fb6e0862b97d51676) unused function _get_model_architecture() and global variable `_SENDNN_SUPPORTED_MODELS` The code has been tested in client/server mode for the `llama 194m` and `granite 3b` on `AIU` and `CPU`.
- Loading branch information