Skip to content

v0.0.2: Compilation caching system and inference with Inferentia

Compare
Choose a tag to compare
@michaelbenayoun michaelbenayoun released this 25 Apr 12:22
· 557 commits to main since this release

Compilation caching system

Since compiling models before being able to train them can be a real bottleneck (for example on small datasets, compile-time is longer than training-time), we introduce a caching system directly connected to the Hugging Face Hub.

Before starting compilation, the TrainiumTrainer checks if the needed compile files are on the Hub, and fetched them if that is the case, saving the user the need to do that himself.

Custom cache repo

Since each user might want to have its own cache repo to be able to push stuff and/or keep things private, we offer the possibility to do so via CUSTOM_CACHE_REPO environment variable:

CUSTOM_CACHE_REPO=michaelbenayoun/cache_test python train.py

Neuron export

Support exporting PyTorch models to serialized TorchScript Module compiled by Neuron Compiler (neuron-cc or neuronx-cc) that can be used on AWS INF2 or INF1.

Example: Export the BERT model with static shapes:

optimum-cli export neuron --help
optimum-cli export neuron --model bert-base-uncased --sequence_length 128 --batch_size 16 bert_neuron/

By default, on INF2, matmul operations will be cast from fp32 to bf16. And on INF1, all operations will be cast to bf16. Using --auto_cast to configure which operations to perform auto-casting and using --auto_cast_type to define the data type for auto-casting.

Example: Auto-cast all operations (this option can potentially lower precision/accuracy) to fp16 data type:

optimum-cli export neuron --model bert-base-uncased --auto_cast all --auto_cast_type fp16 bert_neuron/