v0.0.2: Compilation caching system and inference with Inferentia
Compilation caching system
Since compiling models before being able to train them can be a real bottleneck (for example on small datasets, compile-time is longer than training-time), we introduce a caching system directly connected to the Hugging Face Hub.
Before starting compilation, the TrainiumTrainer
checks if the needed compile files are on the Hub, and fetched them if that is the case, saving the user the need to do that himself.
Custom cache repo
Since each user might want to have its own cache repo to be able to push stuff and/or keep things private, we offer the possibility to do so via CUSTOM_CACHE_REPO environment variable:
CUSTOM_CACHE_REPO=michaelbenayoun/cache_test python train.py
Neuron export
Support exporting PyTorch models to serialized TorchScript Module compiled by Neuron Compiler (neuron-cc
or neuronx-cc
) that can be used on AWS INF2 or INF1.
Example: Export the BERT model with static shapes:
optimum-cli export neuron --help
optimum-cli export neuron --model bert-base-uncased --sequence_length 128 --batch_size 16 bert_neuron/
By default, on INF2, matmul
operations will be cast from fp32
to bf16
. And on INF1, all operations will be cast to bf16
. Using --auto_cast
to configure which operations to perform auto-casting and using --auto_cast_type
to define the data type for auto-casting.
Example: Auto-cast all operations (this option can potentially lower precision/accuracy) to fp16
data type:
optimum-cli export neuron --model bert-base-uncased --auto_cast all --auto_cast_type fp16 bert_neuron/