2023 June 27 tvm

TVMC 编译和优化模型

官方文档

运行环境

Windows 10 家庭中文版 22H2
LLVM 8.0.0
TVM 0.11.0

使用 TVMC 编译和优化模型

使用 TVMC，即 TVM 命令行驱动程序。TVMC 工具，它暴露了 TVM 的功能，如 auto-tuning、编译、profiling 和通过命令行界面执行模型。

在完成本节内容后，将使用 TVMC 来完成以下任务：

为 TVM 运行时编译预训练 ResNet-50 v2 模型。
通过编译后的模型运行真实图像，并解释输出和模型的性能。
使用 TVM 在 CPU 上调优模型。
使用 TVM 收集的调优数据重新编译优化模型。
通过优化后的模型运行图像，并比较输出和模型的性能。

本节的目的是让你了解 TVM 和 TVMC 的能力，并为理解 TVM 的工作原理奠定基础。

将 ONNX 模型编译到 TVM Runtime

resnet50-v2-7.onnx

tvmc compile --target

metal
llvm
vulkan
stackvm
c
cuda
opencl
rocm
hexagon
nvptx
webgpu
sdaccel
aocl
example_target_hook
ccompiler
aocl_sw_emu
ext_dev
hybrid
composite
test
cudnn

# 将 ONNX 模型编译到 TVM Runtime
tvmc compile --target "llvm" --input-shapes "data:[1,3,224,224]" --output ./build/resnet50-v2-7-llvm.tar ./params/resnet50-v2-7.onnx

tvmc compile --target "llvm -mcpu=cascadelake" --input-shapes "data:[1,3,224,224]" --output ./build/resnet50-v2-7-cascadelake.tar ./params/resnet50-v2-7.onnx

tvmc compile --target "opencl" --input-shapes "data:[1,3,224,224]" --output ./build/resnet50-v2-7-opencl.tar ./params/resnet50-v2-7.onnx

tvmc compile --target "cuda" --input-shapes "data:[1,3,224,224]" --output ./build/resnet50-v2-7-cuda.tar ./params/resnet50-v2-7.onnx

mkdir models
tar -xvf ./build/resnet50-v2-7-tvm.tar -C models
ls models

mod.so 是可被 TVM runtime 加载的模型，表示为 C++ 库。
mod.json 是 TVM Relay 计算图的文本表示。

mod.params 是包含预训练模型参数的文件。

  定义正确的 target

  指定正确的目标（选项 --target）可以对编译后的模块的性能产生巨大的影响，因为它可以利用目标上可用的硬件特性。

使用 TVMC 运行来自编译模块的模型

输入预处理

#!python ./preprocess.py
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np

# https://s3.amazonaws.com/model-server/inputs/kitten.jpg
img_path = "./data/kitten.jpg"

# 重设大小为 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")

# ONNX 需要 NCHW 输入, 因此对数组进行转换
img_data = np.transpose(img_data, (2, 0, 1))

# 根据 ImageNet 进行标准化
imagenet_mean = np.array([0.485, 0.456, 0.406])
imagenet_stddev = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(img_data.shape).astype("float32")
for i in range(img_data.shape[0]):
      norm_img_data[i, :, :] = (img_data[i, :, :] / 255 - imagenet_mean[i]) / imagenet_stddev[i]

# 添加 batch 维度
img_data = np.expand_dims(norm_img_data, axis=0)

# 保存为 .npz（输出 imagenet_cat.npz）
np.savez("./build/imagenet_cat", data=img_data)

运行编译模块

tvmc run --device

cpu
cuda
cl
metal
vulkan
rocm
micro

tvmc run --device cpu --inputs ./build/imagenet_cat.npz --output ./build/predictions-llvm.npz ./build/resnet50-v2-7-llvm.tar

tvmc run --device cpu --inputs ./build/imagenet_cat.npz --output ./build/predictions-cascadelake.npz ./build/resnet50-v2-7-cascadelake.tar

tvmc run --device cl --inputs ./build/imagenet_cat.npz --output ./build/predictions-cl.npz ./build/resnet50-v2-7-opencl.tar

tvmc run --device cuda --inputs ./build/imagenet_cat.npz --output ./build/predictions-cuda.npz ./build/resnet50-v2-7-cuda.tar

.tar 模型文件包括 C++ 库，对 Relay 模型的描述，以及模型的参数。TVMC 包括 TVM 运行时，它可以加载模型并根据输入进行预测。当运行上述命令时，TVMC 会输出新文件，predictions.npz，其中包含 NumPy 格式的模型输出张量。

输出后处理

每个模型都会有自己的特定方式来提供输出张量。需要运行一些后处理，利用为模型提供的查找表，将 ResNet-50 v2 的输出渲染成人类可读的形式。

#!python ./postprocess.py
import os.path
import numpy as np

from scipy.special import softmax

from tvm.contrib.download import download_testdata

# 下载标签列表
# https://s3.amazonaws.com/onnx-model-zoo/synset.txt
labels_path = "./data/synset.txt"
with open(labels_path, "r") as f:
    labels = [l.rstrip() for l in f]

output_file = "./build/predictions.npz"

# 打开并读入输出张量
if os.path.exists(output_file):
    with np.load(output_file) as data:
        scores = softmax(data["output_0"])
        scores = np.squeeze(scores)
        ranks = np.argsort(scores)[::-1]

        for rank in ranks[0:5]:
            print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

# class='n02123045 tabby, tabby cat' with probability=0.610552
# class='n02123159 tiger cat' with probability=0.367180
# class='n02124075 Egyptian cat' with probability=0.019365
# class='n02129604 tiger, Panthera tigris' with probability=0.001273
# class='n04040759 radiator' with probability=0.000261

自动调优 ResNet 模型

在某些情况下，当使用编译模块运行推理时，可能无法获得预期的性能。在这种情况下，可以利用自动调优器，为模型找到更好的配置，获得性能的提升。TVM 中的调优是指对模型进行优化以在给定目标上更快地运行的过程。这与训练或微调不同，因为它不影响模型的准确性，而只影响运行时的性能。作为调优过程的一部分，TVM 将尝试运行许多不同的算子实现变体，以观察哪些算子表现最佳。这些运行的结果被存储在调优记录文件中，这最终是 tune 子命令的输出。

在最简单的形式下，调优要求你提供三样东西：

你打算在这个模型上运行的设备的目标规格
输出文件的路径，调优记录将被保存在该文件中
最后是要调优的模型的路径。

默认搜索算法需要 xgboost。

tvmc tune --target "llvm" --output ./build/resnet50-v2-7-llvm-autotuner_records.json ./params/resnet50-v2-7.onnx

tvmc tune --target "opencl" --output ./build/resnet50-v2-7-opencl-autotuner_records.json ./params/resnet50-v2-7.onnx

tvmc tune --target "cuda" --output ./build/resnet50-v2-7-cuda-autotuner_records.json ./params/resnet50-v2-7.onnx

备注：直接运行调优可能会跑不通，参考 issuue 13431 解决 tvmc tune resnet50 ERROR 的问题。

import onnx

onnx_model = onnx.load_model('params/resnet50-v2-7.onnx')
onnx_model.graph.input[0].type.tensor_type.shape.dim[0].dim_value = 1
onnx_model.graph.output[0].type.tensor_type.shape.dim[0].dim_value = 1
onnx.checker.check_model(onnx_model)
onnx.save(onnx_model, 'params/resnet50-v2-7-frozen.onnx')

此例中，为 –target 标志指定更具体的 target 时，会得到更好的结果。例如，在 Intel i7 处理器上，可用 –target llvm -mcpu=skylake。这个调优示例把 LLVM 作为指定架构的编译器，在 CPU 上进行本地调优。

TVMC 针对模型的参数空间进行搜索，为算子尝试不同的配置，然后选择平台上运行最快的配置。虽然这是基于 CPU 和模型操作的引导式搜索，但仍需要几个小时才能完成搜索。搜索的输出将保存到 resnet50-v2-7-autotuner_records.json 文件中，该文件之后会用于编译优化模型。

定义调优搜索算法

这个搜索算法默认用 XGBoost Grid 算法进行引导。根据模型复杂度和可用时间，可选择不同的算法。完整列表可查看 tvmc tune –help。

对于消费级的 Skylake CPU，输出如下：

tvmc tune --target "llvm -mcpu=cascadelake" --output ./build/resnet50-v2-7-autotuner_records.json ./params/resnet50-v2-7.onnx

调优 session 需要很长时间，因此 tvmc tune 提供了许多选项来自定义调优过程，包括重复次数（例如 –repeat 和 –number）、要用的调优算法等。查看 tvmc tune –help 了解更多信息。

使用调优数据编译优化模型

从上述调优过程的输出文件 `resnet50-v2-7-autotuner_records.json 可获取调优记录。该文件可用来：

作为进一步调优的输入（通过 tvmc tune –tuning-records）
作为编译器的输入

执行 tvmc compile –tuning-records 命令让编译器利用这个结果为指定 target 上的模型生成高性能代码。查看 tvmc compile –help 来获取更多信息。

模型的调优数据收集到后，可用优化的算子重新编译模型来加快计算速度。

tvmc compile --target "llvm" --tuning-records resnet50-v2-7-autotuner_records.json --output resnet50-v2-7-tvm_autotuned.tar resnet50-v2-7.onnx

验证优化模型是否运行并产生相同结果：

tvmc run --inputs imagenet_cat.npz --output predictions.npz resnet50-v2-7-tvm_autotuned.tar

python postprocess.py

验证预测值是否相同：

# class='n02123045 tabby, tabby cat' with probability=0.610550
# class='n02123159 tiger cat' with probability=0.367181
# class='n02124075 Egyptian cat' with probability=0.019365
# class='n02129604 tiger, Panthera tigris' with probability=0.001273
# class='n04040759 radiator' with probability=0.000261

比较调优和未调优的模型

TVMC 提供了模型之间的基本性能评估工具。可指定重复次数，也可指定 TVMC 报告模型的运行时间（独立于 runtime 启动）。可大致了解调优对模型性能的提升程度。例如，对 Intel i7 系统进行测试时，调优后的模型比未调优的模型运行速度快 47%：

tvmc run --inputs imagenet_cat.npz --output predictions.npz --print-time --repeat 100 resnet50-v2-7-tvm_autotuned.tar

# Execution time summary:
# mean (ms)   max (ms)    min (ms)    std (ms)
#     92.19     115.73       89.85        3.15

tvmc run --inputs imagenet_cat.npz --output predictions.npz  --print-time --repeat 100 resnet50-v2-7-tvm.tar

# Execution time summary:
# mean (ms)   max (ms)    min (ms)    std (ms)
#    193.32     219.97      185.04        7.11