-
Notifications
You must be signed in to change notification settings - Fork 97
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Make float8 quantization back in the game. (#92)
* Make float8 quantization back in the game. * Expose max_sequence_length parameter for the calibration dataset * When doing quantization, let's not try to reconvert the model right after * Use correct sharding / offloading for model not fitting the local GPUs for quantization * Expose the batch_size to use for calibration defaulting to 1 * Put the pre/post process in the right order * Change the order export and calibration happen * Let's make sure to clean up the HF model in all the case * Add some float8 quantization test * Let's use c4-new dataset * Fix wrong str identifier around variable name * Always specify the lm_head quantization schema * Let's use cartesian product of the parameters * Quality * Expose device parameter for target datasets * quality * Adapt the quality workflow to match with local * missing pip install command * Make sure to use the right python version for quality * Use direct dependency rather than extras to avoid huge downloads * Quality * Add end-to-end float8 calibration flow * Let's split functional / integration tests * Rename workflow titles * Quality * Some more renaming in workflows * Let's create a temporary folder for unittest * Fix invalid hf model creation from auto factory with config * Fix more issue with invalid layer size * Once again ... * Reintroduce use_fp8 * Quality * Let's make a smaller model and ensure the config values stay in integer repr * Let's save the tokenizer all with the model for the tests * Change wording * Update dependency huggingface_hub with right naming * One last dependencies update * Let's make sure we can serialize the qconfig * Do not serialize the calibration datasets * Force the config to be forwarded * let's uninstall optimum-nvidia from the container * Again * Increase share memory for workflows * Limit concurrency for integration tests * Increase verbosity for now to debug * Update CI image * Update tests concurrency * Add some more logging to dig * Fix tqdm import * Reduce workload for quantization tests * Update huggingface-hub with the config fix * Once more * Remove debugging print statement * Quality * Let's just remove all layer and use a single * Attempt to give more info in case of failure * Let's use a bit more samples to quantize * Added utility to skip test if sm is not meet * Increase shm and use tmpfs for tests * Once more * Pin huggingface-hub main version * Quality * Preinstall dependencies for optimum-nvidia in the dev container * Let's relax testing for loading from the hub if no revision found for underlying hardware * quality * Disable gemma-2b testing for now * Let's follow ModelHubMixin guidelines * Retrieve the device name in parts * Let's raise an issue if the config is None * Quality
- Loading branch information
1 parent
97446a8
commit 22a3a3a
Showing
31 changed files
with
1,231 additions
and
413 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# coding=utf-8 | ||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved. | ||
# # | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# # | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# # | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from argparse import ArgumentParser | ||
from logging import getLogger | ||
from pathlib import Path | ||
|
||
from transformers import AutoTokenizer | ||
|
||
from optimum.nvidia import AutoModelForCausalLM, setup_logging | ||
from optimum.nvidia.quantization import AutoQuantizationConfig | ||
|
||
|
||
# Setup logging needs to happen before importing TRT ... | ||
setup_logging(True) | ||
|
||
from optimum.nvidia.utils.cli import ( | ||
postprocess_quantization_parameters, | ||
register_common_model_topology_args, | ||
register_optimization_profiles_args, | ||
register_quantization_args, | ||
) | ||
|
||
|
||
LOGGER = getLogger(__name__) | ||
|
||
|
||
if __name__ == "__main__": | ||
parser = ArgumentParser("🤗 Optimum-Nvidia Custom Quantization Example") | ||
parser.add_argument( | ||
"--hub-token", | ||
type=str, | ||
help="Hugging Face Hub Token to retrieve private weights.", | ||
) | ||
register_common_model_topology_args(parser) | ||
register_optimization_profiles_args(parser) | ||
register_quantization_args(parser) # Inject params.quantization_config | ||
|
||
parser.add_argument("model", type=str, help="The model's id or path to use.") | ||
parser.add_argument( | ||
"output", type=Path, help="Path to store generated TensorRT engine." | ||
) | ||
args = parser.parse_args() | ||
args = postprocess_quantization_parameters(args) | ||
|
||
if args.hub_token is not None: | ||
from huggingface_hub import login | ||
|
||
login(args.hub_token) | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(args.model, padding_side="left") | ||
if not tokenizer.pad_token: | ||
tokenizer.pad_token = tokenizer.eos_token | ||
|
||
# Quantization Config | ||
qconfig = AutoQuantizationConfig.from_description( | ||
weight="float8", | ||
activation="float8", | ||
tokenizer=tokenizer, | ||
dataset="c4-new", | ||
max_sequence_length=args.max_prompt_length, | ||
num_samples=1024, | ||
) | ||
|
||
# Create the model | ||
model = AutoModelForCausalLM.from_pretrained( | ||
args.model, | ||
max_batch_size=args.max_batch_size, | ||
max_prompt_length=args.max_prompt_length, | ||
num_beams=args.max_beam_width, | ||
quantization_config=qconfig, | ||
) | ||
model.save_pretrained(args.output) | ||
|
||
prompt = "What is the latest generation of Nvidia GPUs?" | ||
tokens = tokenizer(prompt, padding=True, return_tensors="pt") | ||
generated, lengths = model.generate( | ||
**tokens, | ||
top_k=40, | ||
top_p=0.95, | ||
repetition_penalty=10, | ||
pad_token_id=tokenizer.eos_token_id, | ||
eos_token_id=tokenizer.eos_token_id, | ||
max_new_tokens=256, | ||
) | ||
|
||
generated_text = tokenizer.batch_decode( | ||
generated.flatten(0, 1), skip_special_tokens=True | ||
) | ||
print(generated_text) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.