README for downloading and preprocessing the dataset. We includes waymo, argoverse 2.0 and nuscenes dataset in our project.
We've updated the process dataset for:
- Argoverse 2.0: check here. The process script Involved from DeFlow.
- Waymo: check here. The process script was involved from SeFlow.
- nuScenes: done coding, public after review. Will be involved later by another paper.
If you want to use all datasets above, there is a specific process environment in envprocess.yml to install all the necessary packages. As Waymo package have different configuration and conflict with the main environment. Setup through the following command:
conda env create -f envprocess.yml
conda activate dataprocess
Install their download tool:
mamba install s5cmd -c conda-forge
Download the dataset:
# train is really big (750): totally 966 GB
s5cmd --no-sign-request cp "s3://argoverse/datasets/av2/sensor/train/*" sensor/train
# val (150) and test (150): totally 168GB + 168GB
s5cmd --no-sign-request cp "s3://argoverse/datasets/av2/sensor/val/*" sensor/val
s5cmd --no-sign-request cp "s3://argoverse/datasets/av2/sensor/test/*" sensor/test
# for local and online eval mask from official repo
s5cmd --no-sign-request cp "s3://argoverse/tasks/3d_scene_flow/zips/*" .
Then to quickly pre-process the data, we can read these commands on how to generate the pre-processed data for training and evaluation. This will take around 0.5-2 hour for the whole dataset (train & val) based on how powerful your CPU is.
More self-supervised data in AV2 LiDAR only, note: It does not include imagery or 3D annotations. The dataset is designed to support research into self-supervised learning in the lidar domain, as well as point cloud forecasting.
# train is really big (16000): totally 4 TB
s5cmd --no-sign-request cp "s3://argoverse/datasets/av2/lidar/train/*" lidar/train
# val (2000): totally 0.5 TB
s5cmd --no-sign-request cp "s3://argoverse/datasets/av2/lidar/val/*" lidar/val
# test (2000): totally 0.5 TB
s5cmd --no-sign-request cp "s3://argoverse/datasets/av2/lidar/test/*" lidar/test
Dataset | # Total Scene | # Total Frames |
---|---|---|
Sensor/train | 750 | 110071 |
Sensor/val | 150 | 23547 |
Sensor/test | 150 | 23574 |
LiDAR/train | 16000 | - |
LiDAR/val | 2000 | 597590 |
LiDAR/test | 2000 | 597575 |
You need sign up an account at nuScenes to download the dataset from https://www.nuscenes.org/nuscenes#download Full dataset (v1.0), you can choose to download lidar only. Click donwload mini split and unzip the file to the nuscenes
folder if you want to test.
To download the Waymo dataset, you need to register an account at Waymo Open Dataset. You also need to install gcloud SDK and authenticate your account. Please refer to this page for more details.
For cluster without root user, check here sdk tar gz.
Website to check their file: https://console.cloud.google.com/storage/browser/waymo_open_dataset_scene_flow
The thing we need is all things about lidar
, to download the data, you can use the following command:
gsutil -m cp -r "gs://waymo_open_dataset_scene_flow/valid" .
gsutil -m cp -r "gs://waymo_open_dataset_scene_flow/train" .
And flowlabel data can be downloaded here with ground segmentation by HDMap follow the same style of ZeroFlow.
You can download the processed map folder here to free yourself downloaded another type of data again:
wget https://zenodo.org/records/13744999/files/waymo_map.tar.gz
tar -xvf waymo_map.tar.gz -C /home/kin/data/waymo/flowlabel
# you will see there is a `map` folder in the `flowlabel` folder now.
Dataset | # Total Scene | # Total Frames |
---|---|---|
train | 799 | 155687 |
val | 203 | 39381 |
This directory contains the scripts to preprocess the datasets.
extract_av2.py
: Process the datasets in Argoverse 2.0.extract_nus.py
: Process the datasets in nuScenes.extract_waymo.py
: Process the datasets in Waymo.
Example Running command:`
# av2:
python dataprocess/extract_av2.py --av2_type sensor --data_mode train --argo_dir /home/kin/data/av2 --output_dir /home/kin/data/av2/preprocess
# waymo:
python dataprocess/extract_waymo.py --mode train --flow_data_dir /home/kin/data/waymo/flowlabel --map_dir /home/kin/data/waymo/flowlabel/map --output_dir /home/kin/data/waymo/preprocess --nproc 48
All these preprocess scripts will generate the same format .h5
file. The file contains the following in codes:
File: [*:logid].h5
file named in logid. Every timestamp is the key of group (f[key]).
def process_log(data_dir: Path, log_id: str, output_dir: Path, n: Optional[int] = None) :
def create_group_data(group, pc, gm, pose, flow_0to1=None, flow_valid=None, flow_category=None, ego_motion=None):
group.create_dataset('lidar', data=pc.astype(np.float32))
group.create_dataset('ground_mask', data=gm.astype(bool))
group.create_dataset('pose', data=pose.astype(np.float32))
if flow_0to1 is not None:
# ground truth flow information
group.create_dataset('flow', data=flow_0to1.astype(np.float32))
group.create_dataset('flow_is_valid', data=flow_valid.astype(bool))
group.create_dataset('flow_category_indices', data=flow_category.astype(np.uint8))
group.create_dataset('ego_motion', data=ego_motion.astype(np.float32))
After preprocessing, all data can use the same dataloader to load the data. As already in our DeFlow code.
Or you can run testing file to visualize the data.
# view gt flow
python tools/visualization.py --data_dir /home/kin/data/av2/preprocess/sensor/mini --res_name flow
python tools/visualization.py --data_dir /home/kin/data/waymo/preprocess/val --res_name flow