Optimizing Code LLM Inference in GPUs and Commodity Hardware

This README contains a brief description of the whole project. The READMEs in subfolders cpu-inference and gpp-inference contain the details of experimentation done for each hardware.

Problem Statement

Large Language Models for Code have rapidly emerged as powerful assistants for writing and editing code. In this project, we are primarily interested in optimizing the inference of Code Generation in Large Language Models (LLMs), our goal is to enable efficient inference, allowing programmers to leverage these models on a single GPU or even on CPU. In the realm of GPU, We used Flash Attention 2 which helps in efficient attention computation as this step is the time-intensive component of the inference. In the realm of CPU hardware, we used llama.cpp and GGML format models to perform efficient inference on CPU.

Models Used: Starcoder2 7B

Evaluation Benchmark: HumanEval

GPU Inference

In the first half of the experiments, we benchmark inference speeds on Starcoder 2 with and without Flash Attention. Then we benchmark the quantized version of Starcoder2 (quantization is achieved using advanced techniques like BitsandBytes and GPTQ). Later, we evaluate the performance of these quantized models using perplexity and pass@1 on Hu as metrics. More information about the methodology and results is present in gpu-inference README file.

CPU Inference

In these experiments, we first performed inference using torch-cpu environment and JIT optimization. Later, we used GGML model format to perform inference using llama.cpp. This resulted in a huge performance improvement. We further improve the inference speed by using quantization. See cpu-inference README for further details

GGML models: https://huggingface.co/Mooizz/starcoder2-gguf

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
cpu-inference		cpu-inference
gpu-inference		gpu-inference
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optimizing Code LLM Inference in GPUs and Commodity Hardware

Problem Statement

GPU Inference

CPU Inference

About

Releases

Packages

Languages

0-5-blood-prince/code-llm

Folders and files

Latest commit

History

Repository files navigation

Optimizing Code LLM Inference in GPUs and Commodity Hardware

Problem Statement

GPU Inference

CPU Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages