We propose training efficient VLA models based on SLMs like Qwen2 with non-autoregressive objective. Our early results shows that these models achieve similar training characteristics compared to much larger counterparts. This repository is a direct fork of Prismatic VLMs and OpenVLA. You can train from scratch, finetune or test our pre-trained models. See our blog or our report for more details about the architecture.
conda create --name evla python=3.10
conda activate evla
cd evla
pip install -e .
Now you have to add HF TOKEN under .hf_token
to run models like llama2/3 or qwen2.
You can either train your own model from scratch or finetune a model with your own dataset. We recommend first running the debug mode to see if everything works.
CUDA_VISIBLE_DEVICES=0 LOCAL_RANK=0 MASTER_ADDR=localhost MASTER_PORT=1235 python vla-scripts/test.py \
--vla.type "debug" \
--data_root_dir DATA_ROOT_DIR \
--run_root_dir RUN_ROOT_DIR
The full-scale training can be run with the 'evla' config from prismatic/conf/vla.py
.
- Remove the hardcoded attention setup.
- Export model to the HF format.
- Add support for LoRA.
@article{kscale2024evla,
title={EdgeVLA: Efficient Vision-Language-Action Models},
author={Paweł Budzianowski, Wesley Maa, Matthew Freed, Jingxiang Mo, Aaron Xie, Viraj Tipnis, Benjamin Bolte},
year={2024}
}