Face recognization could automatically determine the identity of the face in the image and has got rich application scenarios, such as facial payment, identification of traffickers in hospital scalpers, epidemiological investigation of the new crown, fugitive tracking, etc. Typically, InsightFace, an open-source 2D&3D deep face analysis toolbox, mainly based on MXNet before, now, OneFlow has implemented it after strict alignment of network, parameters, and configuration.
The report compares throughputs of InsightFace model between repository of oneflow_face and deepinsight. With the same datasets and hardware environment and algorithm, Only speed performances have been compared. In conclusion, OneFlow is better in performance of training InsightFace and distribution scalability.
- InsightFace Deeplearning Framework Tests Report
Reproduction procedures, introductions, logs, data, and English reports could be fetched in DLPerf repository: https://github.com/Oneflow-Inc/DLPerf
Framework | Version | Source |
---|---|---|
OneFlow | 0.3.4 | oneflow_face |
deepinsight | 2021-01-20 update | deepinsight/insightface |
rigorous alignment has been completed between OneFlow and MxNet, including:
R100(ResNet100)+ face_emore | R100(ResNet100)+ glint360k | Y1(MobileFaceNet)+ face_emore | |
---|---|---|---|
fc type | E | FC | GDC |
optimizer | SGD | SGD | SGD |
kernel initializer | random_normal_initializer(mean=0.0, stddev=0.01) | random_normal_initializer(mean=0.0, stddev=0.01) | random_normal_initializer(mean=0.0, stddev=0.01) |
loss type | arcface | cosface | arcface |
regularizer | Step Weight Decay | Step Weight Decay | Step Weight Decay |
lr_step | [100000,160000] | [200000, 400000, 500000, 550000] | [100000,160000,220000] |
scales | [0.1, 0.01] | [0.1, 0.01, 0.001, 0.0001] | [0.1, 0.01, 0.001] |
momentum | 0.9 | 0.9 | 0.9 |
weight decay | 0.0005 | 0.0005 | 0.0005 |
In this report, batch size means the number of samples on each device(GPU), bsz (batch size per GPU) in short. In the tests, it will give the static value or maximum of batch size with different numbers of GPU tests in different frameworks.
In this report, num classes mean the number of face categories. In the tests, it will give the static value or maximum of num classes with different numbers of GPU tests in different frameworks.
batch_size = 64
node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
---|---|---|---|---|
1 | 1 | 64 | 245.0 | 241.82 |
1 | 4 | 64 | 923.23 | 655.56 |
1 | 8 | 64 | 1836.8 | 650.8 |
batch_size = max
node_num | gpu_num_per_node | OneFlow samples/s(max bsz=96) | MXNet samples/s(max bsz=96) |
---|---|---|---|
1 | 1 | 250.71 | 288.0 |
1 | 4 | 972.8 | 733.1 |
1 | 8 | 1931.76 | 749.42 |
batch_size = 64
node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
---|---|---|---|---|
1 | 1 | 64 | 245.29 | 233.88 |
1 | 4 | 64 | 938.83 | 651.44 |
1 | 8 | 64 | 1854.15 | 756.96 |
batch_size = max
node_num | gpu_num_per_node | OneFlow samples/s(max bsz=115) | MXNet samples/s(max bsz=96) |
---|---|---|---|
1 | 1 | 246.55 | 242.2 |
1 | 4 | 970.1 | 724.26 |
1 | 8 | 1921.87 | 821.06 |
batch_size=64
node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
---|---|---|---|---|
1 | 1 | 64 | 246.45 | 218.84 |
1 | 4 | 64 | 948.96 | 787.07 |
1 | 8 | 64 | 1872.81 | 1423.12 |
2 | 8 | 64 | 3540.09 | 2612.65 |
4 | 8 | 64 | 6931.6 | 5008.72 |
batch_size=max
node_num | gpu_num_per_node | OneFlow samples/s(max bsz=120) | MXNet samples/s(max bsz=104) |
---|---|---|---|
1 | 1 | 256.61 | 229.11 |
1 | 4 | 990.82 | 844.37 |
1 | 8 | 1962.76 | 1584.89 |
2 | 8 | 3856.52 | 2845.97 |
4 | 8 | 7564.74 | 5476.51 |
batch_size = 64
node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
---|---|---|---|---|
1 | 1 | 64 | 230.22 | - |
1 | 4 | 64 | 847.71 | - |
1 | 8 | 64 | 1688.62 | - |
batch_size = max
node_num | gpu_num_per_node | OneFlow samples/s(max bsz=85) | MXNet samples/s(max bsz=?) |
---|---|---|---|
1 | 1 | 229.94 | - |
1 | 4 | 856.61 | - |
1 | 8 | 1707.03 | - |
batch_size = 64
node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
---|---|---|---|---|
1 | 1 | 64 | 230.33 | - |
1 | 4 | 64 | 912.24 | - |
1 | 8 | 64 | 1808.27 | - |
batch_size = max
node_num | gpu_num_per_node | OneFlow samples/s(max bsz=100) | MXNet samples/s(max bsz=?) |
---|---|---|---|
1 | 1 | 231.86 | - |
1 | 4 | 925.85 | - |
1 | 8 | 1844.66 | - |
Note: Miss of MXNet data parallelism data and model parallelism data is because scripts under insightface/recognition/ArcFace/ could not support Glint360k dataset.
batch_size=64
node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
---|---|---|---|---|
1 | 1 | 64 | 245.12 | 194.01 |
1 | 4 | 64 | 945.44 | 730.29 |
1 | 8 | 64 | 1858.57 | 1359.2 |
batch_size=max
node_num | gpu_num_per_node | OneFlow samples/s(max bsz=115) | MXNet samples/s(max bsz=96) |
---|---|---|---|
1 | 1 | 248.01 | 192.18 |
1 | 4 | 973.63 | 811.34 |
1 | 8 | 1933.88 | 1493.51 |
batch_size = 256
node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
---|---|---|---|---|
1 | 1 | 256 | 1961.52 | 786.94 |
1 | 4 | 256 | 7354.49 | 1055.88 |
1 | 8 | 256 | 14298.02 | 1031.1 |
batch_size = max
node_num | gpu_num_per_node | OneFlow samples/s(max bsz=350) | MXNet samples/s(max bsz=368) |
---|---|---|---|
1 | 1 | 1969.66 | 931.88 |
1 | 4 | 7511.53 | 1044.38 |
1 | 8 | 14756.03 | 1026.68 |
batch_size = 256
node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
---|---|---|---|---|
1 | 1 | 256 | 1963.62 | 984.2 |
1 | 4 | 256 | 7264.54 | 984.88 |
1 | 8 | 256 | 14049.75 | 1030.58 |
batch_size = max
node_num | gpu_num_per_node | OneFlow samples/s(max bsz=400) | MXNet samples/s(max bsz=352) |
---|---|---|---|
1 | 1 | 1969.65 | 974.26 |
1 | 4 | 7363.77 | 1017.78 |
1 | 8 | 14436.38 | 1038.6 |
node_num | gpu_num_per_node | batch_size_per_device | FP16 | Model Parallel | Partial FC | OneFlow num_classes | MXNet num_classes |
---|---|---|---|---|---|---|---|
1 | 1 | 64 | True | True | True | 2000000 | 1800000 |
1 | 8 | 64 | True | True | True | 13500000 | 12000000 |
The above series of tests show that:
-
With the increase of
batch_size_per_device
, the throughput of MXNet hard to breakthrough even using Partial FC optimization while the throughput of OneFlow has always maintained a relatively stable linear growth. -
Under the same situation, OneFlow supports a larger scale of
batch_size
andnum_classes
. When using a batch size of 64 in one machine with 8 GPUs, optimization of FP16, model_parallel, and partial_fc. The value ofnum_classes
supported by OneFlow is 1.125 times of one supported by MXNet(13,500,000 vs 12,000,000).
For more data details, please check OneFlow and MXNet in DLPerf.