查看tensorrt安裝位置：tensorrt使用教程

TensorFlow-TensorRT (TF-TRT) 是 TensorFlow 和 TensorRT 的集成，可在 TensorFlow 生態系統中對 NVIDIA GPU 的推理進行優化。它提供的簡單 API 能夠輕而易舉地在使用 NVIDIA GPU 時帶來巨大性能提升。該集成使 TensorRT 中的優化可被使用，並在遇到 TensorRT 不支持的模型部分運算元時提供到原生 TensorFlow 的回退。

TensorRT
https://developer.nvidia.com/tensorrt

先前關於 TF-TRT 集成的文章中，我們介紹了 TensorFlow 1.13 和更早版本的工作流。這篇文章將介紹 TensorFlow 2.x 中的 TensorRT 集成，並展示最新 API 的示例工作流。如果您剛接觸此集成，這並無大礙，本文包含所有入門所需的信息。與 NVIDIA T4 GPU 上的使用原生 TensorFlow 推理相比，使用 TensorRT 集成可以將性能提高 2.4 倍。

文章
https://blog.tensorflow.org/2019/06/high-performance-inference-with-TensorRT.html

TF-TRT 集成

啟用 TF-TRT 後，第一步解析經過訓練的模型，將計算圖分為 TensorRT 支持的子計算圖和不支持的子計算圖。然後，每個 TensorRT 支持的子計算圖都被封裝在一個特殊的 TensorFlow 運算 (TRTEngineOp) 中。第二步，為每個 TRTEngineOp 節點構建一個優化的 TensorRT 引擎。TensorRT 不支持的子計算圖保持不變，由 TensorFlow 運行時處理。如圖 1 所示。

TF-TRT
https://github.com/tensorflow/tensorrt
TensorRT 支持的
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html?ncid=partn-31097#supported-ops

TF-TRT 既可以利用 TensorFlow 的靈活性，同時還可以利用可應用於 TensorRT 支持的子計算圖的優化。TensorRT 只優化和執行計算圖的一部分，剩餘的計算圖由 TensorFlow 執行。

在圖 1 所示推理示例中，TensorFlow 執行了 Reshape 運算和 Cast 運算。然後，TensorFlow 將預構建的 TensorRT 引擎 TRTEngineOp_0 的執行傳遞至 TensorRT 運行時。利用 TensorRT 實現 TensorFlow低延遲推理

圖 1：計算圖分區和在 TF-TRT 中構建 TRT 引擎的示例

工作流

在這一部分中，我們將通過一個示例研究典型的 TF-TRT 工作流。利用 TensorRT 實現 TensorFlow低延遲推理

圖 2僅在 TensorFlow中執行推理，以及在TensorFlow-TensorRT中使用轉換後的SavedModel執行推理時的工作流圖

圖 2 顯示了原生 TensorFlow 中的標準推理工作流，並與 TF-TRT 工作流進行了對比。SavedModel 格式包含共享或部署經過訓練的模型所需的所有信息。在原生 TensorFlow 中，工作流通常涉及載入保存的模型並使用 TensorFlow 運行時運行推理。在 TF-TRT 中還涉及一些額外步驟，包括將 TensorRT 優化應用到 TensorRT 支持的模型子計算圖，以及可選地預先構建 TensorRT 引擎。

SavedModel
https://tensorflow.google.cn/guide/saved_model

首先，創建一個對象來存放轉換參數，包括一個精度模式。精度模式用於指示 TF-TRT 可以用來實現 TensorFlow 運算的最低精度（例如 FP32、FP16 或 INT8）。然後創建一個轉換器對象，它從保存的模型中獲取轉換參數和輸入。注意，在 TensorFlow 2.x 中，TF-TRT 僅支持以 TensorFlow SavedModel 格式保存的模型。

接下來，當我們調用轉換器 convert() 方法時，TF-TRT 將用 TRTEngineOps 替換 TensorRT 兼容的部分以轉換計算圖。如需在運行時獲得更好的性能，可以使用轉換器 build() 方法提前創建 TensorRT 執行引擎。build() 方法要求，在構建優化的 TensorRT 執行引擎之前必須已知輸入數據形狀。如果輸入數據形狀未知，則在輸入數據可用時，可以在運行時構建 TensorRT 執行引擎。要在 GPU 上構建 TensorRT 執行引擎，GPU 的設備類型應與執行推理的設備類型相同，因為構建過程特定於 GPU。例如，為 NVIDIA A100 GPU 構建的執行引擎將無法在 NVIDIA T4 GPU 上運行。

build()
https://tensorflow.google.cn/api_docs/python/tf/experimental/tensorrt/Converter

最後，可以調用 save 方法將 TF-TRT 轉換的模型保存到磁碟。本部分提及的工作流步驟的對應代碼如以下代碼塊所示：

from tensorflow.python.compiler.tensorrt import trt_convert as trt

# Conversion Parameters 
conversion_params = trt.TrtConversionParams(
precision_mode=trt.TrtPrecisionMode.<FP32 or FP16>)

converter = trt.TrtGraphConverterV2(
input_saved_model_dir=input_saved_model_dir,
conversion_params=conversion_params)

# Converter method used to partition and optimize TensorRT compatible segments
converter.convert()

# Optionally, build TensorRT engines before deployment to save time at runtime
# Note that this is GPU specific, and as a rule of thumb, we recommend building at runtime
converter.build(input_fn=my_input_fn)

# Save the model to the disk 
converter.save(output_saved_model_dir)

save
https://tensorflow.google.cn/api_docs/python/tf/experimental/tensorrt/Converter

由以上代碼示例可知，build() 方法需要一個與輸入數據形狀對應的輸入函數。輸入函數示例如下所示：

# input_fn: a generator function that yields input data as a list or tuple,
# which will be used to execute the converted signature to generate TensorRT
# engines. Example:
def my_input_fn():
# Let's assume a network with 2 input tensors. We generate 3 sets
# of dummy input data:
input_shapes = [[(1, 16), (2, 16)], # min and max range for 1st input list
[(2, 32), (4, 32)], # min and max range for 2nd list of two tensors
[(4, 32), (8, 32)]] # 3rd input list
for shapes in input_shapes:
# return a list of input tensors
yield [np.zeros(x).astype(np.float32) for x in shapes]

對 INT8 的支持

相較於 FP32 和 FP16，INT8 需要額外的校準數據來確定最佳量化閾值。當轉換參數中的精度模式為 INT8 時，需要為 convert() 方法調用提供輸入函數。此輸入函數類似於提供至 build() 方法的輸入函數。此外，傳遞至 convert() 方法的輸入函數所生成的校準數據應與推理過程中可見的實際數據在統計上相似。

from tensorflow.python.compiler.tensorrt import trt_convert as trt

conversion_params = trt.TrtConversionParams(
precision_mode=trt.TrtPrecisionMode.INT8)

converter = trt.TrtGraphConverterV2(
input_saved_model_dir=input_saved_model_dir,
conversion_params=conversion_params)

# requires some data for calibration
converter.convert(calibration_input_fn=my_input_fn)

# Optionally build TensorRT engines before deployment.
# Note that this is GPU specific, and as a rule of thumb we recommend building at runtime
converter.build(input_fn=my_input_fn)

converter.save(output_saved_model_dir)

示例：ResNet-50

本文其餘部分的工作流將採用 TensorFlow 2.x ResNet-50 模型，對其進行訓練、保存、使用 TF-TRT 優化，以及最後部署，用於推理。我們還將在 FP32、FP16 和 INT8 三種精度模式下使用 TensorFlow 原生與 TF-TRT 比較推理吞吐量。

示例的前提條件

Ubuntu OS
Docker (https://docs.docker.com/get-docker)
最新 TensorFlow 2.x 容器：
- docker pull tensorflow/tensorflow:latest-gpu
NVIDIA Container Toolkit (https://github.com/NVIDIA/NVIDIA-docker)，這允許您在 docker 容器中使用 NVIDIA GPU。
安裝在主機上的 NVIDIA Driver >= 450（編寫時，應檢查最新 tensorflow 容器的要求）。您可以運行以下命令檢查您的計算機上當前安裝的版本：nvidia-smi | grep “Driver Version:”
NVIDIA Driver >= 450
https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html

使用 TensorFlow 2.x 容器訓練 ResNet-50

首先，需要從 TensorFlow GitHub 倉庫下載 ResNet-50 模型的最新版本：

# Adding the git remote and fetch the existing branches
$ git clone --depth 1 https://github.com/tensorflow/models.git .

# List the files and directories present in our working directory
$ ls -al

rwxrwxr-x user user 4 KiB Wed Sep 30 15:31:05 2020 ./
rwxrwxr-x user user 4 KiB Wed Sep 30 15:30:45 2020 ../
rw-rw-r-- user user 337 B Wed Sep 30 15:31:05 2020 AUTHORS
rw-rw-r-- user user 1015 B Wed Sep 30 15:31:05 2020 CODEOWNERS
rwxrwxr-x user user 4 KiB Wed Sep 30 15:31:05 2020 community/
rw-rw-r-- user user 390 B Wed Sep 30 15:31:05 2020 CONTRIBUTING.md
rwxrwxr-x user user 4 KiB Wed Sep 30 15:31:15 2020 .git/
rwxrwxr-x user user 4 KiB Wed Sep 30 15:31:05 2020 .github/
rw-rw-r-- user user 1 KiB Wed Sep 30 15:31:05 2020 .gitignore
rw-rw-r-- user user 1 KiB Wed Sep 30 15:31:05 2020 ISSUES.md
rw-rw-r-- user user 11 KiB Wed Sep 30 15:31:05 2020 LICENSE
rwxrwxr-x user user 4 KiB Wed Sep 30 15:31:05 2020 official/
rwxrwxr-x user user 4 KiB Wed Sep 30 15:31:05 2020 orbit/
rw-rw-r-- user user 3 KiB Wed Sep 30 15:31:05 2020 README.md
rwxrwxr-x user user 4 KiB Wed Sep 30 15:31:06 2020 research/

如前一部分所述，本示例將使用 Docker 存儲庫中的最新 TensorFlow 容器：由於容器中已經包含 TensorRT 集成，因此用戶不需要執行任何其他安裝步驟。容器的拉取和啟動步驟如下：

$ docker pull tensorflow/tensorflow:latest-gpu

# Please ensure that the Nvidia Container Toolkit is installed before running the following command
$ docker run -it --rm 
--gpus="all" 
--shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 
--workdir /workspace/ 
-v "$(pwd):/workspace/" 
-v "</path/to/save/data/>:/data/"  # This is the path that will hold the training data
tensorflow/tensorflow:latest-gpu

隨後即可在容器內部驗證是否有權訪問相關文件和要針對的 NVIDIA GPU：

# Let's first test that we can access the ResNet-50 code that we previously downloaded
$ ls -al
drwxrwxr-x 8 1000 1000 4096 Sep 30 22:31 .git
drwxrwxr-x 3 1000 1000 4096 Sep 30 22:31 .github
-rw-rw-r-- 1 1000 1000 1104 Sep 30 22:31 .gitignore
-rw-rw-r-- 1 1000 1000 337 Sep 30 22:31 AUTHORS
-rw-rw-r-- 1 1000 1000 1015 Sep 30 22:31 CODEOWNERS
-rw-rw-r-- 1 1000 1000 390 Sep 30 22:31 CONTRIBUTING.md
-rw-rw-r-- 1 1000 1000 1115 Sep 30 22:31 ISSUES.md
-rw-rw-r-- 1 1000 1000 11405 Sep 30 22:31 LICENSE
-rw-rw-r-- 1 1000 1000 3668 Sep 30 22:31 README.md
drwxrwxr-x 2 1000 1000 4096 Sep 30 22:31 community
drwxrwxr-x 12 1000 1000 4096 Sep 30 22:31 official
drwxrwxr-x 3 1000 1000 4096 Sep 30 22:31 orbit
drwxrwxr-x 23 1000 1000 4096 Sep 30 22:31 research

# Let's verify we can see our GPUs:
$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.XX.XX Driver Version: 450.XX.XX CUDA Version: 11.X |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:1A:00.0 Off | Off |
| 38% 52C P8 14W / 70W | 1MiB / 16127MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

接下來開始訓練 ResNet-50。為了避免花費大量時間訓練深度學習模型，本文將使用較小的MNIST 數據集。不過，對於 ImageNet 這樣更先進的數據集，工作流也不會發生改變。

# Install dependencies
$ pip install tensorflow_datasets tensorflow_model_optimization

# Download MNIST data and Train
$ python -m "official.vision.image_classification.mnist_main" 
--model_dir=./checkpoints 
--data_dir=/data 
--train_epochs=10 
--distribution_strategy=one_device 
--num_gpus=1 
--download

# Let』s verify that we have the trained model saved on our machine.
$ ls -al checkpoints/

-rw-r--r-- 1 root root 87 Sep 30 22:34 checkpoint
-rw-r--r-- 1 root root 6574829 Sep 30 22:34 model.ckpt-0001.data-00000-of-00001
-rw-r--r-- 1 root root 819 Sep 30 22:34 model.ckpt-0001.index
[...]
-rw-r--r-- 1 root root 6574829 Sep 30 22:34 model.ckpt-0010.data-00000-of-00001
-rw-r--r-- 1 root root 819 Sep 30 22:34 model.ckpt-0010.index
drwxr-xr-x 4 root root 4096 Sep 30 22:34 saved_model
drwxr-xr-x 3 root root 4096 Sep 30 22:34 train
drwxr-xr-x 2 root root 4096 Sep 30 22:34 validation

獲取 TF-TRT 將使用的 SavedModel

經過訓練，Google 的 ResNet-50 代碼將以 SavedModel 格式導出模型，路徑如下：checkpoints/saved_model/。

以下示例代碼可以作為參考，以將您自己的訓練模型導出為 TensorFlow SavedModel。

import numpy as np

import tensorflow as tf
from tensorflow import keras

def get_model():
# Create a simple model.
inputs = keras.Input(shape=(32,))
outputs = keras.layers.Dense(1)(inputs)
model = keras.Model(inputs, outputs)
model.compile(optimizer="adam", loss="mean_squared_error")
return model

model = get_model()

# Train the model.
test_input = np.random.random((128, 32))
test_target = np.random.random((128, 1))
model.fit(test_input, test_target)

# Calling `save('my_model')` creates a SavedModel folder `my_model`.
model.save("my_model")

代碼
https://tensorflow.google.cn/guide/keras/save_and_serialize#savedmodel_format

我們可以驗證 Google 的 ResNet-50 腳本生成的 SavedModel 是否可讀和正確：

$ ls -al checkpoints/saved_model

drwxr-xr-x 2 root root 4096 Sep 30 22:49 assets
-rw-r--r-- 1 root root 118217 Sep 30 22:49 saved_model.pb
drwxr-xr-x 2 root root 4096 Sep 30 22:49 variables

$ saved_model_cli show --dir checkpoints/saved_model/ --tag_set serve --signature_def serving_default

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

The given SavedModel SignatureDef contains the following input(s):
inputs['input_1'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 28, 28, 1)
name: serving_default_input_1:0
The given SavedModel SignatureDef contains the following output(s):
outputs['dense_1'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 10)
name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

驗證 SavedModel 已正確保存後，我們可以使用 TF-TRT 進行載入以開始推理。

推理

使用 TF-TRT 執行 ResNet-50 推理

本部分將介紹如何使用 TF-TRT 在 NVIDIA GPU 上部署保存的 ResNet-50 模型。如前所述，首先使用 convert 方法將 SavedModel 轉換為 TF-TRT 模型，然後載入模型。

# Convert the SavedModel
converter = trt.TrtGraphConverterV2(input_saved_model_dir=path)
converter.convert()

# Save the converted model
converter.save(converted_model_path)

# Load converted model and infer
model = tf.saved_model.load(converted_model_path)
func = root.signatures['serving_default']
output = func(input_tensor)

為簡單起見，我們將使用腳本執行推理 (tf2_inference.py)。我們將從 github.com 下載腳本，並將其放在與先前相同的 docker 容器的工作目錄「/workspace/」中。隨後即可執行腳本：

$ wget https://raw.githubusercontent.com/tensorflow/tensorrt/master/tftrt/blog_posts/Leveraging%20TensorFlow-TensorRT%20integration%20for%20Low%20latency%20Inference/tf2_inference.py

$ ls
AUTHORS CONTRIBUTING.md LICENSE checkpoints data orbit tf2_inference.py
CODEOWNERS ISSUES.md README.md community official research

$ python tf2_inference.py --use_tftrt_model --precision fp16

=========================================
Inference using: TF-TRT …
Batch size: 512
Precision: fp16
=========================================

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TrtConversionParams(rewriter_config_template=None, max_workspace_size_bytes=8589934592, precision_mode='FP16', minimum_segment_size=3, is_dynamic_op=True, maximum_cached_engines=100, use_calibration=True, max_batch_size=512, allow_build_at_runtime=True)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Processing step: 0100 ...
Processing step: 0200 ...
[...]
Processing step: 9900 ...
Processing step: 10000 ...

Average step time: 2.1 msec
Average throughput: 244248 samples/sec

tf2_inference.py
https://github.com/tensorflow/tensorrt/blob/master/tftrt/blog_posts/Leveraging%20TensorFlow-TensorRT%20integration%20for%20Low%20latency%20Inference/tf2_inference.py
github.com
https://raw.githubusercontent.com/tensorflow/tensorrt/master/tftrt/blog_posts/Leveraging%20TensorFlow-TensorRT%20integration%20for%20Low%20latency%20Inference/tf2_inference.py

同樣，我們可為 INT8 和 FP32 運行推理

$ python tf2_inference.py --use_tftrt_model --precision int8

$ python tf2_inference.py --use_tftrt_model --precision fp32

使用原生 TensorFlow (GPU) FP32 執行推理

您也可以不採用 TF-TRT 加速，運行未經修改的 SavedModel。

$ python tf2_inference.py --use_native_tensorflow

=========================================
Inference using: Native TensorFlow …
Batch size: 512
=========================================

Processing step: 0100 ...
Processing step: 0200 ...
[...]
Processing step: 9900 ...
Processing step: 10000 ...

Average step time: 4.1 msec
Average throughput: 126328 samples/sec

此運行使用 NVIDIA T4 GPU 執行。同樣的工作流可以在任何 NVIDIA GPU 上運行。

原生 TF 2.x 與 TF-TRT 推理性能對比

藉助 TF-TRT，只需進行少量代碼修改可顯著提高性能。例如，使用本文中的推理腳本，在 NVIDIA T4 GPU 上的批處理大小為 512，我們觀察到 TF-TRT FP16 的速度幾乎比原生 TensorFlow 提升了 2 倍，TF-TRT INT8 的速度提升了 2.4 倍。實際速度提升可能因各種因素而異，如使用的模型、批處理大小、數據集中圖像的大小和格式以及 CPU 瓶頸。

我們在本文中展示了 TF-TRT 提供的加速。此外，通過 TF-TRT，我們可以使用完整的 TensorFlow Python API 和 Jupyter Notebook 或 Google Colab 等互動式環境。

支持的運算元

TF-TRT 用戶指南列出了 TensorRT 兼容子計算圖中支持的運算元。列表之外的運算元將由原生 TensorFlow 運行時執行。