This README contains a brief description of the whole project. The READMEs in subfolders cpu-inference and gpp-inference contain the details of experimentation done for each hardware.
Large Language Models for Code have rapidly emerged as powerful assistants for writing and editing code. In this project, we are primarily interested in optimizing the inference of Code Generation in Large Language Models (LLMs), our goal is to enable efficient inference, allowing programmers to leverage these models on a single GPU or even on CPU. In the realm of GPU, We used Flash Attention 2 which helps in efficient attention computation as this step is the time-intensive component of the inference. In the realm of CPU hardware, we used llama.cpp and GGML format models to perform efficient inference on CPU.
Models Used: Starcoder2 7B
Evaluation Benchmark: HumanEval
In the first half of the experiments, we benchmark inference speeds on Starcoder 2 with and without Flash Attention. Then we benchmark the quantized version of Starcoder2 (quantization is achieved using advanced techniques like BitsandBytes and GPTQ). Later, we evaluate the performance of these quantized models using perplexity and pass@1 on Hu as metrics. More information about the methodology and results is present in gpu-inference README file.
In these experiments, we first performed inference using torch-cpu environment and JIT optimization. Later, we used GGML model format to perform inference using llama.cpp. This resulted in a huge performance improvement. We further improve the inference speed by using quantization. See cpu-inference README for further details
GGML models: https://huggingface.co/Mooizz/starcoder2-gguf