Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF-Lite GPU benchmark results? #91

Open
mrgloom opened this issue Nov 16, 2018 · 16 comments
Open

TF-Lite GPU benchmark results? #91

mrgloom opened this issue Nov 16, 2018 · 16 comments

Comments

@mrgloom
Copy link

mrgloom commented Nov 16, 2018

Are any TF-Lite GPU benchmark results for mobile phone are available?

@psyhtest
Copy link
Contributor

As far as I know, TFLite only provides GPU acceleration via AndroidNN, which is available from Android 8.1. Unfortunately, the latest phones we have only support Android 8.0. If someone has a newer phone, we can provide instructions on how to benchmark TFLite there (specifically, MobileNets we are contributing to MLPerf).

@mrgloom
Copy link
Author

mrgloom commented Nov 16, 2018

Thanks for clarification.

For example checking ARM Mali-T830 in GPU dropbox show me benchmarks that are all on CPU and OpenCL(as far as I can see in Crowd scenario column), is that lack of data or none of DNN frameworks support GPU on Android?
http://cknowledge.org/repo/web.php?template=cknowledge&action=index&module_uoa=wfe&native_action=show&native_module_uoa=program.optimization

Also I have found this ai benchmark for android smartphones:
http://ai-benchmark.com/ranking.html#ranking
But information about GPU and DNN framework is not available (maybe we can softly assume that >= Android 8.1 is use GPU).

@gfursin
Copy link
Member

gfursin commented Nov 18, 2018

Hi @mrgloom .

If I am correct, we had time to add 2 scenarios with GPU: Caffe (OpenCL version) and ArmCL: https://github.com/ctuning/ck-crowd-scenarios/tree/master/experiment.scenario.mobile . Note that our OpenCL versions work exclusively on GPU (I believe that we force it in scenarios - @psyhtest, can you please confirm?), so if you see OpenCL, you can assume that this scenario ran on GPU.

I also guess that there is just a lack of data if you don't see many GPU points - this Android app was run by volunteers but we are not advertising it too much now. It was a proof-of-concept project and we are now trying to build a more user-friendly way of adding scenarios on top of our low-level CK plugins.

However, maybe you can try to run it on your newer mobile and see if these GPU scenarios are still working (Caffe OpenCL and ArmCL). You can get Android app here: http://cknowledge.org/android-apps.html . Please, tell us if it works or not - I will be curious to see the results

Thank you very much for your feedback!

@mrgloom
Copy link
Author

mrgloom commented Nov 25, 2018

I have successfully run the app on smartphone with android 8.0.0

Here is the list with comments:

  1. ArmCL 18.05 OpenCL: MobileNets v1 0.25 128 (Looks strange that it have size of 141 Mb)
  2. Caffe CPU v2 SqueezeNet 1.1 (36 Mb, but in my experiments in Caffe SqeezeNet v1.1 should be 2.9Mb)
  3. Caffe OpenCL: SqueezeNet 1.1
  4. TFlite CPU: MobileNets v1 0.25 128

In my benchmarks TFLite CPU faster then ArmCL(for MobileNets v1 0.25 128) and Caffe CPU faster then Caffe OpenCL(for SqueezeNet 1.1):
http://cknowledge.org/repo/web.php?template=cknowledge&action=index&module_uoa=wfe&native_action=show&native_module_uoa=program.optimization
Also the problem is that all frameworks don't share at least one model, so I can't compare them directly.

@psyhtest
Copy link
Contributor

psyhtest commented Nov 26, 2018

Also the problem is that all frameworks don't share at least one model, so I can't compare them directly.

Now you can! Please take a look at our brand new dashboard functionality for the MobileNets implementations (which we are contributing to MLPerf Inference): http://cknowledge.org/dashboard

The default workflow "MobileNets (highlights)" currently shows MobileNets v1/v2 with TFLite 0.1.7 on Firefly RK3399 and Linaro HiKey960, as well as best points for MobileNets v1 with Arm Compute Library v18.08 on HiKey960 (which can serve as a vendor submission example).

By default, the X dimension shows the minimum execution time per image, while the Y dimension shows the the Top1 accuracy. To the right of the workflow name is an icon to invoke additional settings
where you can filter out and customise pretty much everything! For example, the Color dimension shows "Image classification rate (maximum, per second)" by default. The fastest point (MobileNets
v1-0.25-128, TFLite, HiKey960) is red as it peaks at 161 images per second. If you change the Color dimension to "Image classification efficiency (maximum, per second per Watt)", you will see three red
points at 17-18 images per second per Watt. Interestingly, RK3399 is a bit more efficient than HiKey960 here (at least, with the peak power values that I plucked from thin air for each platform).

The workflow "MobileNets (all)" (select from the dropdown menu) includes all ArmCL points exploring available options for the convolution method, data layout and kernel tuner choices. You can discern these options on the plot thanks to the Marker overlay dimension. In the default workflow, you can only see the convolution method. Conveniently, dots over polygons mark GPU points, which are faster than corresponding CPU points except for least accurate models.

Have fun!

... and please let us know if you have any questions or suggestions.

@psyhtest
Copy link
Contributor

psyhtest commented Nov 27, 2018

ArmCL 18.05 OpenCL: MobileNets v1 0.25 128 (Looks strange that it have size of 141 Mb)

The model itself is only ~2 MB but we bundle together the engine (i.e. the library and the client program). I suspect we include a debug build as we had issues on Android:

For some reason only debug version of the library can be used with this program on Android. When we use release version, the program gets stuck at stage "Preparing ArmCL graph".

The good news is that the same engine is reused across all ArmCL OpenCL MobileNets samples. This means that if you add any other such sample model, you will only need to download a few MB of extra weights.

/cc @Chunosov

@psyhtest
Copy link
Contributor

In my benchmarks TFLite CPU faster then ArmCL(for MobileNets v1 0.25 128) and Caffe CPU faster then Caffe OpenCL(for SqueezeNet 1.1)

That's expected for very small models. There's simply not enough work to keep the GPU busy, and CPU caching works well. However, if you look at the MobileNets highlights, most GPU points (with dots) lie on the Pareto-optimal frontier: for any such point, to improve speed (move left), you need to loose accuracy (move down); similarly, to improve accuracy (move up), you need to loose speed (move right).

@mrgloom
Copy link
Author

mrgloom commented Nov 28, 2018

Seems like Firefly RK3399 and Linaro HiKey960 not a real consumer phone.

Also seems google also have benchmark results for single phone(Pixel 1) for MobileNet variants and ShuffleNet.
https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
Here is also some relation between models
https://www.tensorflow.org/lite/performance/best_practices
Also TFLite seems have their own benchmark tool:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark

@psyhtest
Copy link
Contributor

While HiKey960 is a development board, it has the same chip (Hisilicon Kirin960) that Huawei used in their several popular phones (including Mate 9 Pro and P10). I have results from a real Mate 10 Pro too.

The graph in that repo is from the original MobileNets v2 paper but it's very crude: you can only guess which model is shown and estimate its peformance (e.g. ±1 ms) and accuracy (e.g. ±1%). Besides, it's very hard to reproduce: it's taken us several weeks to understand how to load the weights, how to preprocess the inputs and interpret the outputs. But now anyone can run experiments across many platforms, under different conditions, try different datasets and so on.

You would be very welcome to contribute your experimental data to the dashboard.

@psyhtest
Copy link
Contributor

I've added TFLite results on Huawei Mate 10 Pro (HiSilicon Kirin 970) and Samsung Galaxy S8 US (Qualcomm Snapdragon 835). You may want to filter the results by Library=tflite-0.1.7, Version=1 and set the Color dimension to Platform. If you then look at individual models (e.g. v1-1.00-224), you will see that generally:

  • for less accurate models (0.25,0.50), Linaro HiKey960 < Firefly RK3399 < Huawei Mate 10 Pro < Samsung Galaxy S8 US, while
  • for more accurate models (0.75,1.00), Linaro HiKey960 < Firefly RK3399 < Samsung Galaxy S8 US < Huawei Mate 10 Pro,
    where "<" means "faster than".

Note, however, that the Linux devices (HiKey960 and RK3399) had the CPU frequencies set to the maximum, while the Android devices (Mate 10 Pro and Galaxy S8 US) were non-rooted, so the CPU frequencies were managed automatically.

@mrgloom
Copy link
Author

mrgloom commented Dec 1, 2018

Looks good, but it will be great if anyone can share link with current 'view' of dashboard.
Something like http://cknowledge.org/dashboard/mlperf.mobilenets&library=tflite-0.1.7&model=v1-1.00-128

Also does peak memory usage is stored somewhere in benchmark logs?

Are .tflite models are available for direct download?
I want to test them locally with https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark

Update:
Look like tensorflow also have a tool to measure accuracy:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/accuracy/README.md

@psyhtest
Copy link
Contributor

psyhtest commented Dec 3, 2018

it will be great if anyone can share link with current 'view' of dashboard

Thanks for your feedback! Yes, supporting links with settings is on our roadmap.

Also does peak memory usage is stored somewhere in benchmark logs?

Not at the moment. Storing would be easy, but we need to know how to measure this reliably. Do you have any suggestions?

Are .tflite models are available for direct download?

Of course, the links are provided in the MobileNets-v1 and MobileNets-v2 README files, so you can download them directly e.g.:

anton@diviniti:/tmp$ wget https://storage.googleapis.com/mobilenet_v2/checkpoints/mobilenet_v2_0.35_96.tgz
--2018-12-03 12:04:40--  https://storage.googleapis.com/mobilenet_v2/checkpoints/mobilenet_v2_0.35_96.tgz
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.201.16, 2a00:1450:400c:c06::80
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.201.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37815375 (36M) [application/x-tar]
Saving to: ‘mobilenet_v2_0.35_96.tgz’

mobilenet_v2_0.35_96.tgz                           100%[================================================================================================================>]  36.06M  18.3MB/s    in 2.0s

2018-12-03 12:04:42 (18.3 MB/s) - ‘mobilenet_v2_0.35_96.tgz’ saved [37815375/37815375]

anton@diviniti:/tmp$ tar xvzf mobilenet_v2_0.35_96.tgz
./
./mobilenet_v2_0.35_96_info.txt
./mobilenet_v2_0.35_96_frozen.pb
./mobilenet_v2_0.35_96_eval.pbtxt
./mobilenet_v2_0.35_96.ckpt.data-00000-of-00001
./mobilenet_v2_0.35_96.ckpt.index
./mobilenet_v2_0.35_96.tflite
./mobilenet_v2_0.35_96.ckpt.meta

As I explained above, however, you then need to perform many manual steps (which CK does behind the scenes).

Also note that the TFLite Model Benchmarking Tool uses random data, so cannot be used to measure accuracy.

@mrgloom
Copy link
Author

mrgloom commented Dec 18, 2018

Also a question are tflite models are benchmarked in single theaded mode?

@psyhtest
Copy link
Contributor

psyhtest commented Dec 18, 2018

are tflite models are benchmarked in single theaded mode?

In the default mode which happens to be multithreaded.

By the way, I think part of the variation in the results is due to thread migration between big and LITTLE cores. We are planning to set up thread affinity to reduce the variation.

@mrgloom
Copy link
Author

mrgloom commented Dec 19, 2018

@psyhtest
Copy link
Contributor

psyhtest commented Dec 19, 2018

Sounds about right. Most high end mobile chips have 4 big cores, so if the 4 threads get allocate to those, you should get good enough performance.

As I mentioned, tuning the number of threads and how they are pinned to cores (thread affinity) is something we want to do in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants