Skip to content

Latest commit

 

History

History
148 lines (116 loc) · 5.97 KB

graph_optimization.md

File metadata and controls

148 lines (116 loc) · 5.97 KB

Graph Optimization

Introduction

Graph optimization is primarily focused on two scenarios, shown below:

  1. FP32 optimization. This is similar to the TensorFlow optimization tool optimize_for_inference while LPOT enables more optimizations (such as common subexpression elimination).

  2. Auto-mixed precision optimization. LPOT generates the optimal model with auto-mixed precision (bfloat16 and FP32) and allows for additional auto-tuning per accuracy requirements.

How to use it

See the following three examples which demonstrate graph optimization API usage.

FP32 Optimization

LPOT runs the graph optimization under FP32 Optimization by default. In other words, the precisions field is explicitly set to fp32:

    from lpot.experimental import Graph_Optimization
    graph_optimizer = Graph_Optimization()
    graph_optimizer.precisions = 'fp32' #Optional, default is 'fp32'
    graph_optimizer.input = 'input'  # Optional
    graph_optimizer.output = 'op_to_store'  # Optional
    graph_optimizer.model = '/path/to/model'
    optimized_model = graph_optimizer()

Auto-mixed Precision Optimization

Default auto-mixed precision

The only difference between this and the default mode (FP32 optimization) is that bf16 must be added to the precisions field.

    from lpot.experimental import Graph_Optimization
    graph_optimizer = Graph_Optimization()
    graph_optimizer.precisions = 'bf16, fp32'
    graph_optimizer.input = 'input'  # Optional
    graph_optimizer.output = 'op_to_store'  # Optional
    graph_optimizer.model = '/path/to/model'
    optimized_model = graph_optimizer()

Note the fp32 is optional when the bf16 is set to precisions field. The below example has the identical action under the hardware platform supports bf16, e.g, the CPX platform.

    from lpot.experimental import Graph_Optimization
    graph_optimizer = Graph_Optimization()
    graph_optimizer.precisions = 'bf16'
    graph_optimizer.model = '/path/to/model'
    optimized_model = graph_optimizer()

For those platforms without bf16 enabling, like CLX. LPOT also could leverage the graph optimization feature to generate the model under bf16 precision.The usage is just adding the FORCE_BF16=1 before the cmd. e.g, FORCE_BF16=1 /path/to/executable_lpot_wrapper. If we don't add such prefix FORCE_BF16=1, the LPOT would exit consequently.

Auto-mixed precision with auto-tuning

LPOT also supports tuning the model in graph optimization mode. The end user must replace the quantization field with graph_optimization parts such as shown below. The precisions field only supports bf16 and fp32.

graph_optimization:
  precisions: ['bf16', 'fp32']

Note that if we remove the evaluation field from the yaml file, the graph optimization will only convert the model depending on the precisions setting.

When the graph_optimization field is set and the evaluation field exists in the yaml file, LPOT executes the similar process like quantization. It means the LPOT converts op into bf16 as much as possible and checks the metric later. If the metric meets the criterion, LPOT exits or it fallbacks one op to fp32 and re-runs the above process until it meets the exit policy setting.

Below is an example of using yaml to trigger graph optimization.

    from lpot.experimental import Graph_Optimization
    graph_optimizer = Graph_Optimization('/path/to/config.yaml')
    graph_optimizer.model = '/path/to/model'
    optimized_model = graph_optimizer()

Examples

FP32 optimization

The below example demonstrate how to speed up the Resnet50 FP32 throughput performance via Graph Optimization.

  1. Download the pre-trained ResNet-50 model with below command.
  wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/resnet50_fp32_pretrained_model.pb
  1. Measure the performance on original FP32 model.

First of all, we create the resnet50_measurement.yaml with below settings for leveraging LPOT Benchmark API.

  model:
    name: resnet50_v1
    framework: tensorflow

  evaluation:
    performance:
      configs:
        cores_per_instance: 28
        num_of_instance: 1
      dataloader:
        batch_size: 100
        dataset:
          dummy:
            shape: [1000, 224, 224, 3]

Then, we can leverage the Benchmark API to measure the performance.

from lpot.experimental import Benchmark
evaluator = Benchmark('/path/to/resnet50_measurement.yaml')
evaluator.model = '/path/to/resnet50_fp32_pretrained_model.pb'
evaluator('performance')

We got below performance result under Intel Xeon Scalable processor Cascade Lake 8280.

performance mode benchmark result:
2021-05-28 15:16:11 [INFO] Batch size = 100
2021-05-28 15:16:11 [INFO] Latency: 7.165 ms
2021-05-28 15:16:11 [INFO] Throughput: 139.567 images/sec
  1. Re-Measure the performance on optimized FP32 model.
from lpot.experimental import Graph_Optimization

graph_optimizer = Graph_Optimization()
graph_optimizer.model = '/path/to/resnet50_fp32_pretrained_model.pb'
output_graph = graph_optimizer()
output_graph.save('/path/to/fp32_optimized_model')

Then, We measure the optimized performance via LPOT Benchmark API again.

from lpot.experimental import Benchmark
evaluator = Benchmark('/path/to/resnet50_measurement.yaml')
evaluator.model = '/path/to/fp32_optimized_model'
evaluator('performance')

Now, the throughput has been improved ~2.3x (325.99 vs 139.56) compared with the initial data.

performance mode benchmark result:
2021-05-28 15:16:41 [INFO] Batch size = 100
2021-05-28 15:16:41 [INFO] Latency: 3.068 ms
2021-05-28 15:16:41 [INFO] Throughput: 325.992 images/sec