简体中文 | English
We compare our results with some popular frameworks and official releases in terms of speed.
- 8 NVIDIA Tesla V100 (16G) GPUs
- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
- Python 3.7
- PaddlePaddle2.0
- CUDA 10.1
- CUDNN 7.6.3
- NCCL 2.1.15
- GCC 8.2.0
The statistic is the average training time, including data processing and model training time, and the training speed is measured with ips(instance per second). Note that we skip the first 50 iters as they may contain the device warmup time.
Here we compare PaddleVideo with the other video understanding toolkits in the same data and model settings.
To ensure the fairness of the comparison, the comparison experiments were conducted under the same hardware environment and using the same dataset. The dataset we used is generated by the data preparation, and in each model setting, the same data preprocessing methods are applied to make sure the same feature input.
Significant improvement can be observed when comparing with other video understanding framework as shown in the table below, Especially the Slowfast model is nearly 2x faster than the counterparts.
Model | batch size x gpus | PaddleVideo(ips) | Reference(ips) | MMAction2 (ips) | PySlowFast (ips) |
---|---|---|---|---|---|
TSM | 16x8 | 58.1 | 46.04(temporal-shift-module) | To do | X |
PPTSM | 16x8 | 57.6 | X | X | X |
TSN | 16x8 | 841.1 | To do (tsn-pytorch) | To do | X |
Slowfast | 16x8 | 99.5 | X | To do | 43.2 |
Attention_LSTM | 128x8 | 112.6 | X | X | X |
Model | PaddleVideo(ips) | MMAction2 (ips) | BMN(boundary matching network) (ips) |
---|---|---|---|
BMN | 43.84 | x | x |
This repo provides performance and accuracy comparison between classical and popular sequential action segmentation models
Model | Metrics | Value | Flops(M) | Params(M) | test time(ms) bs=1 | test time(ms) bs=2 | inference time(ms) bs=1 | inference time(ms) bs=2 |
---|---|---|---|---|---|---|---|---|
MS-TCN | [email protected] | 38.8% | 791.360 | 0.8 | 170 | - | 10.68 | - |
ASRF | [email protected] | 55.7% | 1,283.328 | 1.3 | 190 | - | 16.34 | - |
- Model: model name, for example: PP-TSM
- Metrics: Fill in the indicators used in the model test, and the data set used is breakfast
- Value: Fill in the value corresponding to the metrics index, and generally keep two decimal places
- Flops(M): The floating-point computation required for one forward operation of the model can be called
paddlevideo/tools/summary.py
script calculation (different models may need to be modified slightly), keep one decimal place, and measure it with data input tensor with shape of (1, 2048, 1000) - Params(M): The model parameter quantity, together with flops, will be calculated by the script, and one decimal place will be reserved
- test time(ms) bs=1: When the python script starts the batchsize = 1 test, the time required for a sample is kept to two decimal places. The data set used in the test is breakfast.
- test time(ms) bs=2: When the python script starts the batchsize = 2 test, the time required for a sample is kept to two decimal places. The sequential action segmentation model is generally a full convolution network, so the batch of training, testing and reasoning_ Size is 1. The data set used in the test is breakfast.
- inference time(ms) bs=1: When the reasoning model is tested with GPU (default V100) with batchsize = 1, the time required for a sample is reserved to two decimal places. The dataset used for reasoning is breakfast.
- inference time(ms) bs=2: When the reasoning model is tested with GPU (default V100) with batchsize = 1, the time required for a sample is reserved to two decimal places. The sequential action segmentation model is generally a full convolution network, so the batch of training, testing and reasoning_ Size is 1. The dataset used for reasoning is breakfast.