From 9250546513c8549d51f62284610d04077e9589f4 Mon Sep 17 00:00:00 2001 From: S1ro <54212263+S1ro1@users.noreply.github.com> Date: Sun, 8 Sep 2024 03:44:04 +0200 Subject: [PATCH] Feat: add kl div to readme (#229) ## Summary Adds newly implemented kl divergence loss to readme. Closes #188 finally. ## Testing Done No code changes --------- Co-authored-by: Shao Tang Co-authored-by: Byron Hsu --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index ae395fbcc..6d11f99c8 100644 --- a/README.md +++ b/README.md @@ -132,7 +132,7 @@ pip install -e . ``` ## Getting Started -There are a couple ways to apply Liger kernels, depending on the level of customization required. +There are a couple of ways to apply Liger kernels, depending on the level of customization required. ### 1. Use AutoLigerKernelForCausalLM @@ -242,6 +242,7 @@ loss.backward() | GeGLU | `liger_kernel.transformers.LigerGEGLUMLP` | | CrossEntropy | `liger_kernel.transformers.LigerCrossEntropyLoss` | | FusedLinearCrossEntropy | `liger_kernel.transformers.LigerFusedLinearCrossEntropyLoss`| +| KLDivergence | `liger_kernel.transformers.LigerKLDIVLoss` | - **RMSNorm**: [RMSNorm](https://arxiv.org/pdf/1910.07467), which normalizes activations using their root mean square, is implemented by fusing the normalization and scaling steps into a single Triton kernel, and achieves ~3X speedup with ~3X peak memory reduction. - **LayerNorm**: [LayerNorm](https://arxiv.org/pdf/1607.06450), which centers and normalizes activations across the feature dimension, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and achieves ~2X speedup. @@ -255,7 +256,7 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$ - **CrossEntropy**: [Cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) is implemented by computing both the loss and gradient in the forward pass with inplace replacement of input to reduce the peak memory by avoiding simultaneous materialization of both input logits and gradient. It achieves >2X speedup and >4X memory reduction for common vocab sizes (e.g., 32K, 128K, etc.). - **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage. - +- **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size. ### Experimental Kernels