InsightFace Deeplearning Framework Tests Report

Introduction

Face recognization could automatically determine the identity of the face in the image and has got rich application scenarios, such as facial payment, identification of traffickers in hospital scalpers, epidemiological investigation of the new crown, fugitive tracking, etc. Typically, InsightFace, an open-source 2D&3D deep face analysis toolbox, mainly based on MXNet before, now, OneFlow has implemented it after strict alignment of network, parameters, and configuration.

The report compares throughputs of InsightFace model between repository of oneflow_face and deepinsight. With the same datasets and hardware environment and algorithm, Only speed performances have been compared. In conclusion, OneFlow is better in performance of training InsightFace and distribution scalability.

Content

InsightFace Deeplearning Framework Tests Report

Data

Reproduction procedures, introductions, logs, data, and English reports could be fetched in DLPerf repository: https://github.com/Oneflow-Inc/DLPerf

Frameworks & MOdels

Framework	Version	Source
OneFlow	0.3.4	oneflow_face
deepinsight	2021-01-20 update	deepinsight/insightface

Configration

1. Network Alignment

rigorous alignment has been completed between OneFlow and MxNet, including:

	R100（ResNet100）+ face_emore	R100（ResNet100）+ glint360k	Y1（MobileFaceNet）+ face_emore
fc type	E	FC	GDC
optimizer	SGD	SGD	SGD
kernel initializer	random_normal_initializer(mean=0.0, stddev=0.01)	random_normal_initializer(mean=0.0, stddev=0.01)	random_normal_initializer(mean=0.0, stddev=0.01)
loss type	arcface	cosface	arcface
regularizer	Step Weight Decay	Step Weight Decay	Step Weight Decay
lr_step	[100000,160000]	[200000, 400000, 500000, 550000]	[100000,160000,220000]
scales	[0.1, 0.01]	[0.1, 0.01, 0.001, 0.0001]	[0.1, 0.01, 0.001]
momentum	0.9	0.9	0.9
weight decay	0.0005	0.0005	0.0005

2. Batch Size

In this report, batch size means the number of samples on each device(GPU), bsz (batch size per GPU) in short. In the tests, it will give the static value or maximum of batch size with different numbers of GPU tests in different frameworks.

3. Num Classes

In this report, num classes mean the number of face categories. In the tests, it will give the static value or maximum of num classes with different numbers of GPU tests in different frameworks.

Results

Face Emore & R100 & FP32 Thoughput

Data Parallelism

batch_size = 64

node_num	gpu_num_per_node	batch_size_per_device	OneFlow samples/s	MXNet samples/s
1	1	64	245.0	241.82
1	4	64	923.23	655.56
1	8	64	1836.8	650.8

batch_size = max

node_num	gpu_num_per_node	OneFlow samples/s（max bsz=96）	MXNet samples/s(max bsz=96)
1	1	250.71	288.0
1	4	972.8	733.1
1	8	1931.76	749.42

Model Parallelism

batch_size = 64

node_num	gpu_num_per_node	batch_size_per_device	OneFlow samples/s	MXNet samples/s
1	1	64	245.29	233.88
1	4	64	938.83	651.44
1	8	64	1854.15	756.96

batch_size = max

node_num	gpu_num_per_node	OneFlow samples/s(max bsz=115)	MXNet samples/s（max bsz=96）
1	1	246.55	242.2
1	4	970.1	724.26
1	8	1921.87	821.06

Partial FC, sample_ratio = 0.1

batch_size=64

node_num	gpu_num_per_node	batch_size_per_device	OneFlow samples/s	MXNet samples/s
1	1	64	246.45	218.84
1	4	64	948.96	787.07
1	8	64	1872.81	1423.12
2	8	64	3540.09	2612.65
4	8	64	6931.6	5008.72

batch_size=max

node_num	gpu_num_per_node	OneFlow samples/s(max bsz=120)	MXNet samples/s(max bsz=104)
1	1	256.61	229.11
1	4	990.82	844.37
1	8	1962.76	1584.89
2	8	3856.52	2845.97
4	8	7564.74	5476.51

Glint360k & R100 & FP32 Thoughputs

Data Parallelism

batch_size = 64

node_num	gpu_num_per_node	batch_size_per_device	OneFlow samples/s	MXNet samples/s
1	1	64	230.22	-
1	4	64	847.71	-
1	8	64	1688.62	-

batch_size = max

node_num	gpu_num_per_node	OneFlow samples/s(max bsz=85)	MXNet samples/s(max bsz=?)
1	1	229.94	-
1	4	856.61	-
1	8	1707.03	-

Model Parallelism

batch_size = 64

node_num	gpu_num_per_node	batch_size_per_device	OneFlow samples/s	MXNet samples/s
1	1	64	230.33	-
1	4	64	912.24	-
1	8	64	1808.27	-

batch_size = max

node_num	gpu_num_per_node	OneFlow samples/s(max bsz=100)	MXNet samples/s(max bsz=?)
1	1	231.86	-
1	4	925.85	-
1	8	1844.66	-

Note: Miss of MXNet data parallelism data and model parallelism data is because scripts under insightface/recognition/ArcFace/ could not support Glint360k dataset.

Partial FC, sample_ratio = 0.1

batch_size=64

node_num	gpu_num_per_node	batch_size_per_device	OneFlow samples/s	MXNet samples/s
1	1	64	245.12	194.01
1	4	64	945.44	730.29
1	8	64	1858.57	1359.2

batch_size=max

node_num	gpu_num_per_node	OneFlow samples/s(max bsz=115)	MXNet samples/s(max bsz=96)
1	1	248.01	192.18
1	4	973.63	811.34
1	8	1933.88	1493.51

Face Emore & Y1 & FP32 Thoughputs

Data Parallelism

batch_size = 256

node_num	gpu_num_per_node	batch_size_per_device	OneFlow samples/s	MXNet samples/s
1	1	256	1961.52	786.94
1	4	256	7354.49	1055.88
1	8	256	14298.02	1031.1

batch_size = max

node_num	gpu_num_per_node	OneFlow samples/s(max bsz=350)	MXNet samples/s(max bsz=368)
1	1	1969.66	931.88
1	4	7511.53	1044.38
1	8	14756.03	1026.68

Model Parallelism

batch_size = 256

node_num	gpu_num_per_node	batch_size_per_device	OneFlow samples/s	MXNet samples/s
1	1	256	1963.62	984.2
1	4	256	7264.54	984.88
1	8	256	14049.75	1030.58

batch_size = max

node_num	gpu_num_per_node	OneFlow samples/s(max bsz=400)	MXNet samples/s(max bsz=352)
1	1	1969.65	974.26
1	4	7363.77	1017.78
1	8	14436.38	1038.6

Max num_classes

node_num	gpu_num_per_node	batch_size_per_device	FP16	Model Parallel	Partial FC	OneFlow num_classes	MXNet num_classes
1	1	64	True	True	True	2000000	1800000
1	8	64	True	True	True	13500000	12000000

Conclusion

The above series of tests show that:

With the increase of batch_size_per_device, the throughput of MXNet hard to breakthrough even using Partial FC optimization while the throughput of OneFlow has always maintained a relatively stable linear growth.
Under the same situation, OneFlow supports a larger scale of batch_size and num_classes. When using a batch size of 64 in one machine with 8 GPUs, optimization of FP16, model_parallel, and partial_fc. The value of num_classes supported by OneFlow is 1.125 times of one supported by MXNet(13,500,000 vs 12,000,000).

For more data details, please check OneFlow and MXNet in DLPerf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dlperf_insightface_test_report_v1.md

dlperf_insightface_test_report_v1.md

InsightFace Deeplearning Framework Tests Report

Introduction

Content

Data

Frameworks & MOdels

Configration

1. Network Alignment

2. Batch Size

3. Num Classes

Results

Face Emore & R100 & FP32 Thoughput

Data Parallelism

Model Parallelism

Partial FC, sample_ratio = 0.1

Glint360k & R100 & FP32 Thoughputs

Data Parallelism

Model Parallelism

Partial FC, sample_ratio = 0.1

Face Emore & Y1 & FP32 Thoughputs

Data Parallelism

Model Parallelism

Max num_classes

Conclusion

Files

dlperf_insightface_test_report_v1.md

Latest commit

History

dlperf_insightface_test_report_v1.md

File metadata and controls

InsightFace Deeplearning Framework Tests Report

Introduction

Content

Data

Frameworks & MOdels

Configration

1. Network Alignment

2. Batch Size

3. Num Classes

Results

Face Emore & R100 & FP32 Thoughput

Data Parallelism

Model Parallelism

Partial FC, sample_ratio = 0.1

Glint360k & R100 & FP32 Thoughputs

Data Parallelism

Model Parallelism

Partial FC, sample_ratio = 0.1

Face Emore & Y1 & FP32 Thoughputs

Data Parallelism

Model Parallelism

Max num_classes

Conclusion