Skip to content

Commit

Permalink
Polish readme
Browse files Browse the repository at this point in the history
init readme

wi[

test

looks good

emoji

modify

add wip

wip
  • Loading branch information
ByronHsu committed Aug 13, 2024
1 parent 34b97fd commit 7fb8130
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 54 deletions.
163 changes: 109 additions & 54 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,93 @@
# Liger Kernel

Liger Kernel is the collection of Triton-native kernels for LLM Training. It is designed to be performant, correct, and light-weight.
**Liger Kernel** is a collection of Triton-native kernels designed specifically for LLM training. It aims to be **performant**, **correct**, and **lightweight**. We welcome contributions from the community to help us enhance and grow this project.

## Overview
### ✨ Key Features
- **🚀 Performant:** All kernels are written in OpenAI Triton with optimized tuning, increasing multi-GPU training throughput by 20% and reducing memory usage by 60%.
- **✅ Correct:** Each kernel undergoes rigorous unit and convergence testing to ensure accuracy.
- **🌱 Lightweight:** The kernels have minimal dependencies, requiring only Torch and Triton—no extra libraries needed!

### 🎯 Target Audiences

| Speed Up | Memory Reduction |
|--------------------------|-------------------------|
| ![Speed up](docs/images/speedup.png) | ![Memory](docs/images/memory.png) |
- **Researchers**: Looking to compose models using efficient and reliable kernels for frontier experiments.
- **ML Practitioners**: Focused on maximizing GPU training efficiency with optimal, high-performance kernels.
- **Curious Novices**: Eager to learn how to write reliable Triton kernels to enhance training efficiency.

## 🌟 Overview

> **Note:**
>
> 1. Benchmark conditions: LLaMA 3-8B, Batch Size = 4, Sequence Length = 2048, Data Type = bf16, Full Pass (Forward + Backward).
>
> 2. **Fused Linear Cross Entropy Loss** trades time for memory by not materializing full logits, and it is recommended to use it when memory is the bottleneck.
### Supercharge Your Model with Liger Kernel

Gain +20% throughput and -60% memory usage. Achieve longer context lengths and larger batch sizes.

| Patch existing HF model | Compose your own model |
| ⚡ Speed Up | 💾 Memory Reduction |
|--------------------------|-------------------------|
| ![Patch](docs/images/patch.gif) | ![Compose](docs/images/compose.gif) |
| ![Speed up](docs/images/e2e-tps.png) | ![Memory](docs/images/e2e-memory.png) |

> **Note:**
> 1. Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s.
> 2. HuggingFace models start to OOM at 4K context length, whereas Liger Kernel scales up to 16K.
> 3. **Fused Linear Cross Entropy Loss** is enabled to significantly reduce memory usage.
### ✨ Utilize Individual Kernels or Enhance Existing Models

| 🛠️ Patch Existing HF Model | 🧩 Compose Your Own Model |
|--------------------------|-------------------------|
| ![Patch](docs/images/patch.gif) | ![Compose](docs/images/compose.gif) |

## Features
## 🚀 Features

- Forward + Backward
- Hugging Face model compatible. Easily patch model to speed up with 1 line
- Robust unit and convergence tests for kernels
- Compatible with multi GPUs (PyTorch FSDP)
- Compatible with `torch.compile`
- +20% throughput and -60% memory usage for multi-GPU training.
- Unlock large vocabulary sizes, long contexts, or multi-head training.
- Minimal dependencies—only `torch` and `triton` are required.
- Hugging Face model compatible—speed up your models with just one line of code.
- Forward and backward passes implemented.
- 0% loss in correctness—kernels are validated through robust unit and convergence tests.
- Compatible with multi-GPU setups (PyTorch FSDP and DeepSpeed).
- Seamless integration with Torch Compile.

## 🔧 Installation

## Installation
### Dependencies

- `torch >= 2.1.2`
- `triton >= 2.3.0`
- `transformers >= 4.40.1`

- dependencies
- torch >= `2.1.2`
- triton >= `2.3.0`
- transformers >= `4.40.1`
To install the stable version:

```bash
$ pip install liger-kernel
```

## Usage
To install the nightly version:

```bash
$ pip install liger-kernel-nightly
```

1. Patch existing Hugging Face models
## 🚀 Getting Started

### 1. 🛠️ Patch Existing Hugging Face Models

```python
from liger_kernel.transformers import apply_liger_kernel_to_llama
from transformers import Trainer

# By adding this line, it automatically monkey patches the model with the optimized kernels
apply_liger_kernel_to_llama()
model = transformers.AutoModelForCausalLM.from_pretrained("<some llama model>")
```

| **Model** | **API** | **Supported Operations** |
|-------------|--------------------------------------------------------------|-------------------------------------------------------------------------|
| LLaMA (2 & 3) | `liger_kernel.transformers.apply_liger_kernel_to_llama` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
| Mistral | `liger_kernel.transformers.apply_liger_kernel_to_mistral` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
| Mixtral | `liger_kernel.transformers.apply_liger_kernel_to_mixtral` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |

2. Compose your own model

For example, use `LigerFusedLinearCrossEntropyLoss` with `torch.nn.Linear` model
### 2. 🧩 Compose Your Own Model

```python
from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss
import torch.nn as nn
import torch

model = nn.Linear(128, 256).to("cuda")

# LigerFusedLinearCrossEntropyLoss fuses linear and cross entropy layer together and performs chunk-by-chunk computation to reduce memory
loss_fn = LigerFusedLinearCrossEntropyLoss()

input = torch.randn(4, 128, requires_grad=True, device="cuda")
Expand All @@ -82,36 +97,76 @@ loss = loss_fn(model.weight, input, target)
loss.backward()
```

| **Kernels** | **API** | **Description** | **Benchmark (A100) ** |
|----------------------------|-------------------------------------------------------------|-----------------|--------------------------------------------------------|
| RMSNorm | `liger_kernel.transformers.LigerRMSNorm` | TBA | [time](./benchmark/rms_norm_speed/) / [memory](./benchmark/rms_norm_memory/) |
| RoPE | `liger_kernel.transformers.liger_rotary_pos_emb` | TBA | [time](./benchmark/rope_speed/) / [memory](./benchmark/rope_memory/) |
| SwiGLU | `liger_kernel.transformers.LigerSwiGLUMLP` | TBA | [time](./benchmark/swiglu_speed/) / [memory](./benchmark/swiglu_memory/) |
| CrossEntropy | `liger_kernel.transformers.LigerCrossEntropyLoss` | This liger Cross Entropy loss computes both loss and the gradient in the forward path with inplace replacement of input to reduce the peak memory (avoid the materialization of both input logits and gradient) thus reducing the peak memory. We only consider hard label + mean reduction for now. Please refer to https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for the math. | [time](./benchmark/cross_entropy_speed/) / [memory](./benchmark/cross_entropy_memory/) |
| FusedLinearCrossEntropy | `liger_kernel.transformers.LigerFusedLinearCrossEntropyLoss`| This Liger Cross Entropy loss further improves upon the basic Liger Cross Entropy kernel by reducing peak memory usage through fusion of the model's final output head layer with the CE loss, and chunking the input for block-wise loss and gradient calculation. The same strategy of computing both loss and gradient in the forward path with inplace replacement of input is used here. | [time](./benchmark/fused_linear_cross_entropy_speed/) / [memory](./benchmark/fused_linear_cross_entropy_memory/) |
## ⚙️ Note on ML Compiler

### 1. ⚡ Torch Compile

Since Liger Kernel is 100% Triton-based, it works seamlessly with Torch Compile. In the following example, Liger Kernel can further optimize on top of Torch Compile, reducing the memory by more than half.

## Structure
| Configuration | ⚡ Throughput (tokens/sec) | 💾 Memory Reserved (MB) |
|--------------------------------|----------------------------|-------------------------|
| Torch Compile | 3780 | 66358 |
| Torch Compile + Liger Kernel | 3702 | 31000 |

> **Note:**
> 1. **Fused Linear Cross Entropy Loss** is enabled.
> 2. Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Seq Len = 4096, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s.
> 3. Tested on torch `2.5.0.dev20240731+cu118`
### 2. 🌩️ Lightning Thunder

*WIP*

## 📂 Structure

### Source Code

- `ops/`: Core Triton operations.
- `transformers/`: PyTorch `nn.Module` implementations built on Triton operations, compliant with the `transformers` API.

### Tests

- `transformers/`: Correctness tests for the Triton-based layers.
- `convergence/`: Patches Hugging Face models with all kernels, runs multiple iterations, and compares weights, logits, and loss layer by layer.

### Benchmark

- `benchmark/`: Execution time and memory benchmarks compared to Hugging Face layers.

## 🔧 APIs

### Patching

| **Model** | **API** | **Supported Operations** |
|-------------|--------------------------------------------------------------|-------------------------------------------------------------------------|
| LLaMA (2 & 3) | `liger_kernel.transformers.apply_liger_kernel_to_llama` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
| Mistral | `liger_kernel.transformers.apply_liger_kernel_to_mistral` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss |
| Mixtral | `liger_kernel.transformers.apply_liger_kernel_to_mixtral` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss |

1. Source code

- `ops/`: Core Triton operations implementation
- `transformers/`: PyTorch `nn.Module` on top of Triton operations complying with `transformers` API
### 🧩 Kernels

2. Tests
| **Kernel** | **API** | **Description** |
|---------------------------|-------------------------------------------------------------|-----------------|
| RMSNorm | `liger_kernel.transformers.LigerRMSNorm` | [RMSNorm Paper](https://arxiv.org/pdf/1910.07467) |
| RoPE | `liger_kernel.transformers.liger_rotary_pos_emb` | [RoPE Paper](https://arxiv.org/pdf/2104.09864) |
| SwiGLU | `liger_kernel.transformers.LigerSwiGLUMLP` | [SwiGLU Paper](https://arxiv.org/pdf/2002.05202) |
| CrossEntropy | `liger_kernel.transformers.LigerCrossEntropyLoss` | [PyTorch CrossEntropyLoss Documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) |
| FusedLinearCrossEntropy | `liger_kernel.transformers.LigerFusedLinearCrossEntropyLoss`| Inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy), with additional optimizations |

- `transformers/`: Correctness tests for the triton-based layers
- `convergence/`: Patch Hugging Face models with all kernels, run X iterations, and compare the weights layer by layer, logits, and loss.

## 🛣️ Roadmap

3. Benchmark
WIP

- `benchmark/`: Execution time and memory benchmark versus Hugging Face layers.
## 🤝 Contributing

## Roadmap
WIP

## Contributing
## 📜 License

## Acknowledgements
WIP

## Citation

## License
WIP
Binary file added docs/images/e2e-memory.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/e2e-tps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7fb8130

Please sign in to comment.