Project: Fine-tune & quantize Distilbert for IMDB reviews classification. Docker, Flask, ONNX, ONNXRuntime, Transformers, python 3.9
Authors:
- Github: maitrang-ng, NamTran072, quangle, mp1704
- Email: [email protected], [email protected], [email protected], [email protected]
Advisors:
- Github: bangoc123,
- Email: [email protected]
-
Step 1: Python 3.9 & pipenv are required
-
Step 2:
cd
into sentiment-classification then use command linepipenv install -r requirements.txt
to set up envpipenv check
to check env statuspipenv shell
to active env -
Step 3:
cd src
in command line
docker build -t sentiment-classification .
docker run -dp 5001:80 sentiment-classification
- Datasets must be in format CSV. You can check dataset folder to more details.
Training script:
python3 trainer.py -ds ${dataset-path} -r ${ratio-split-dataset} -m ${base-model} -b ${batcg-size}
-e ${epoches} -lr ${learning-rate} -wd ${weight-decay} -o ${output-path} -s ${save-model-dir}
Example: you can ignore all arguments then command will use the default setting
python3 trainer.py -s ../models/DistilBert
dataset-path can accept 1 or more files path, then split dataset with the ratio as you choose. If you provide 2 or 3 files path and do not provide split ratio, the files will be considered in order: train, test or train, valid, test. For 1 or more than 3 files path, split ration is required. Split ratio can accept 1 or 2 values. One value: test-dataset ratio. Two ratios in order: test-dataset ration, validation-dataset ratio. Example:
python3 trainer.py -ds ../dataset/train.csv -r 0.2 0.2 -s ../models/DistilBert
There are some important arguments for the script you should consider when running it:
-s
: the directory to save model
Predict with torch model:
python3 predictor.py torch -m ${torch-model-path}
Example:
python3 predictor.py -m ../models/DistilBert
Predict with onnx model:
python3 predictor.py onnx -m ${onnx-model-path} -t ${tokenizer-path}
Example:
python3 predictor.py -m ../models/ONNX/distilbert_full_precision.onnx -t ../models/ONNX
Run locally on 1000 examples of test dataset
Evaluation torch model
python3 evaluation.py torch
Result:
test Acc: 0.9355
PyTorch cpu Inference time = 323.8157 ms
Evaluation ONNX model
python3 evaluation.py onnx
Result:
test Acc: 0.9482
ONNX cpu Inference time = 286.9295 ms
Training notebook with HF dataset: https://colab.research.google.com/drive/1Bxx0w8EHoVo84yJ6I3VqfdED9-YpRtde
Training notebook with TF dataset: https://colab.research.google.com/drive/1PjCCokQnnXdYVKoXD2qcSezBQQmAwqMW#scrollTo=KW8CH-HBCuzN
Quantization notebook: https://colab.research.google.com/drive/1wtAaF8oQhw5vuAjgFtpJjob-kX-STOif#scrollTo=8J6HAKSiE7IK
Models file to download: https://drive.google.com/drive/folders/1hlVSm434UgpwwzROfYKillIxwMc8lWYt?usp=sharing