[参考译文] AM62A3：使用 edgeai-tidl-tools 编译的模型会导致 AM62a 出现分段故障

admin

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1453488/am62a3-models-compiled-with-edgeai-tidl-tools-cause-segmentation-fault-on-am62a

器件型号：AM62A3

工具与软件：

大家好、我使用的是 AM62a 目标器件、将模型卸载到加速器会导致分段故障。

-我已经在我用来编译工件的 Docker 容器内验证了"开箱即用"示例

- 我在容器中成功编译了 osrt_python/tfl 和 osrt_python/ort 示例(这些是我转移到目标的文件)

-我已经在 Docker 容器中测试了推理,而没有卸载

-我在未在目标设备上卸载的情况下测试了推理

仅当我在 AM62a 器件上启用卸载时、才会出现分段故障。 osrt_python/tFL 和 osrt_python/ort 模型的情况也是如此。

-我在 Docker 容器和目标设备中使用释放标签10_00_07_00

-我最近更新了目标器件的 TIDL 版本、遵循此处的说明 edgeai-tidl-tools/docs/background_compatibility.md 位于10_00_07_00·TexasInstruments/edgeai-tidl-tools

在侧边说明中： edgeai-tidl-tools/docs/background_compatibility.md (位于10_00_07_00·TexasInstruments/edgeai-tidl-tools )中的说明是否足以在目标器件上升级/安装 edgeai-tidl-tool？我在某个时候已经阅读了有关 RTOS SDK 版本的内容。这也是我需要升级的内容吗？如果是、如何实现？

对此、我该怎么办？如何解决此问题？感谢您的帮助！

11 个月前

0 admin 11 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

python tflrt_Delegate.py 的输出：

root@am62dl：/opt/edgeai-tidl-tools/examples/osrt_python/tfl # python3 tflrt_delegate.py
运行4个模型-["cl-tfl-mobilenet_v1_1.0_224"、"s-tfl-deeplabv3_MNv2_ade20k_float"、"od-tfl-ssd_mobilenet_v2_300_float"、"od-tfl-ssdlet_mobiledet_dsp_320x320_coco"]

Running_Model : cl-TFL-mobilenet_v1_1.0_224

子图的数量：1、34个节点中委派了34个节点

分段故障(已转储内核)

0 admin 11 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

Stefan、您好！

我们来弄清楚它的来源。我的第一个怀疑与工具版本有关

[报价用户 id="617958" url="~/support/processors-group/processors/f/processors-forum/1453488/am62a3-models-compiled-with-edgeai-tidl-tools-cause-segmentation-fault-on-am62a "]

-我在 Docker 容器和目标设备中使用释放标签10_00_07_00

[报价]

奇数的 TIDL Tools 版本被指定为可移植到以前的 SDK 的版本、在本例中为9.2。这是您在 AM62A 安装上使用的 SDK 版本吗？

您应该将$EDGEAI_SDK_VERSION 设置为在 Linux 环境中定义的类似于09_02或9.2的字符串(由登录时的自动运行脚本定义)
您应该已经使用 background_compatibility.md 文档中提到的步骤更新了固件、OSRT 组件(例如 TFLite 和 ONNXRT 库)以及其他 TI 库(例如 libtivision_apps.us)
- 听起来您已经完成了这些操作、但在这里只是进行验证
- 确保将$SOC 环境变量设置为"am62a"。登录时应该也已由自动运行脚本处理此问题

如果您在任何模型上看到 SEG 故障、那么我的估计是某些内容未正确更新。测试不同组件的方法将其与 TIDL 堆栈隔离、因此这很有用。

我也很好奇为什么您的设备的主机名是 root@am62dl、但这可能是有意为之、我们可以忽略。

收集更多信息/日志的建议步骤：

创建模型时、将 DEBUG_LEVEL=2传递到运行时。在 examples/osrt_python/common_utils.py 中设置此值应该足够
在目标上、启动脚本之前在后台运行/opt/vx_app_arm_remote_log.out
从 gdb 运行 python 应用程序、并检查 seg-faulted 线程的回溯
运行`pip3冻结| grep -i "tflite\|onnx\|tidl"并共享软件包版本

[报价 userid="617958" url="~/support/processors-group/processors/f/processors-forum/1453488/am62a3-models-compiled-with-edgeai-tidl-tools-cause-segmentation-fault-on-am62a "]在边注：edgeai-tidl-tools/docs/background_compatibility.md (位于10_00_07_00 TexasInstruments/edgeai-tidl-tools·)中的说明是否足以在目标器件上升级/安装 edgeai-tidl-tool？我在某个时候已经阅读了有关 RTOS SDK 版本的内容。这也是我需要升级的内容吗？如果是、如何操作？

是的、此处的说明足以使用最新的错误修复和更改升级先前 SDK 上的 TIDL 堆栈(不仅仅是 edgeai-tidl-tools)、需要注意一点：EVM 和您的硬件平台之间的内存映射必须兼容。如果您位于入门套件 EVM 上、请忽略这一点。

我认为这里不需要 RTOS SDK (可能是 PSDK RTOS)、但如果您再次遇到这种情况、请将我的注意事项指出来。如果需要更改定制硬件的存储器映射、这一点很重要。请注意、对于 AM62A、我们有一个与 PSDK RTOS SDK 具有相同功能的"固件构建器"工具。

BR、
Reese

0 admin 10 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

您好、Stefan、

感谢这里的所有信息-非常感谢和乐于助人。我看到问题了。

遗憾的是、是的、这可能是不兼容的组合。请参阅 version_compatibility 文档。我们从10.0 SDK 开始了这种形式的向后兼容性、并保持了9.2 SDK 的兼容性(与您找到的步骤相同)。这不适用于9.0 SDK

因此、这是版本兼容性问题。这样、您将在实际的9.0 SDK 安装中应用与9.2 SDK 兼容的10.0.0.7固件。

您是否能够将 SDK 移至9.2或10.0？值得注意的是、10.1 SDK 将在接下来的几周内发布。否则、您需要从09_00_XX_YY 使用 edgeai-tidl-tools

BR、
Reese

0 admin 10 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

嗨、Reese、

我是 Stefan 的同事、我们在同一个开发板上工作(所以 Stefan 提供的所有信息也包含在这篇文章中)。

Unknown 说：
否则、您需要继续使用 edgeai-tidl-tools (从09_00_XX_YY

我曾尝试您的建议、将 devcontainer 中的 TIDL 工具版本更改为09_00_00_06

tidl-model-compilation/edgeai-tidl-tools$ git st
HEAD detached at 09_00_00_06

在更新器件板上的 SDK 时也会遇到一些问题、Stefan 试图更新到 10_00_07_00、但我认为这样做没有成功(请参阅上文 Stefan 所尝试的方法)。

但是、当我尝试示例编译时、python 脚本会挂起(请参阅末尾 ctrl-c 后的错误 MDG)

root@c9f23fa83205:/opt/edgeai-tidl-tools/examples/osrt_python/tfl# python tflrt_delegate.py -c
Running 4 Models - ['cl-tfl-mobilenet_v1_1.0_224', 'ss-tfl-deeplabv3_mnv2_ade20k_float', 'od-tfl-ssd_mobilenet_v2_300_float', 'od-tfl-ssdlite_mobiledet_dsp_320x320_coco']


Running_Model :  cl-tfl-mobilenet_v1_1.0_224

Running_Model :  ss-tfl-deeplabv3_mnv2_ade20k_float

Running_Model :
Running_Model :  od-tfl-ssdlite_mobiledet_dsp_320x320_coco
 od-tfl-ssd_mobilenet_v2_300_float
Number of OD backbone nodes = 89
Size of odBackboneNodeIds = 89
TIDL Meta PipeLine (Proto) File  : ../../../models/public/ssdlite_mobiledet_dsp_320x320_coco_20200519.prototxt
Number of OD backbone nodes = 112
Size of odBackboneNodeIds = 112

 Preliminary number of subgraphs:1 , 81 nodes delegated out of 81 nodes


 Preliminary number of subgraphs:1 , 34 nodes delegated out of 34 nodes


 Preliminary number of subgraphs:1 , 129 nodes delegated out of 129 nodes

TF Meta PipeLine (Proto) File  : ../../../models/public/ssdlite_mobiledet_dsp_320x320_coco_20200519.prototxt
num_classes : 91
y_scale : 10.000000
x_scale : 10.000000
w_scale : 5.000000
h_scale : 5.000000
num_keypoints : 5.000000
score_threshold : 0.600000
iou_threshold : 0.450000
max_detections_per_class : 200
max_total_detections : 100
      scales, height_stride, width_stride, height_offset, width_offset
   0.2000000,   -1.0000000,   -1.0000000,   -1.0000000,   -1.0000000
   0.3500000,   -1.0000000,   -1.0000000,   -1.0000000,   -1.0000000
   0.5000000,   -1.0000000,   -1.0000000,   -1.0000000,   -1.0000000
   0.6500000,   -1.0000000,   -1.0000000,   -1.0000000,   -1.0000000
   0.8000000,   -1.0000000,   -1.0000000,   -1.0000000,   -1.0000000
   0.9500000,   -1.0000000,   -1.0000000,   -1.0000000,   -1.0000000
aspect_ratios
   1.0000000
   2.0000000
   0.5000000
   3.0000000
   0.3333000

 Preliminary number of subgraphs:1 , 107 nodes delegated out of 107 nodes

Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal

 ************** Frame index 1 : Running float import *************

 ************** Frame index 1 : Running float import *************
****************************************************
**                ALL MODEL CHECK PASSED          **
****************************************************

INFORMATION: [TIDL_ResizeLayer] ResizeBilinear_TIDL_0 Any resize ratio which is power of 2 and greater than 4 will be placed by combination of 4x4 resize layer and 2x2 resize layer. For example a 8x8 resize will be replaced by 4x4 resize followed by 2x2 resize.
INFORMATION: [TIDL_ResizeLayer] ResizeBilinear_TIDL_1 Any resize ratio which is power of 2 and greater than 4 will be placed by combination of 4x4 resize layer and 2x2 resize layer. For example a 8x8 resize will be replaced by 4x4 resize followed by 2x2 resize.
INFORMATION: [TIDL_ResizeLayer] ResizeBilinear Any resize ratio which is power of 2 and greater than 4 will be placed by combination of 4x4 resize layer and 2x2 resize layer. For example a 8x8 resize will be replaced by 4x4 resize followed by 2x2 resize.
INFORMATION: [TIDL_ResizeLayer] decoder/ResizeBilinear Any resize ratio which is power of 2 and greater than 4 will be placed by combination of 4x4 resize layer and 2x2 resize layer. For example a 8x8 resize will be replaced by 4x4 resize followed by 2x2 resize.
INFORMATION: [TIDL_ResizeLayer] ResizeBilinear_1 Any resize ratio which is power of 2 and greater than 4 will be placed by combination of 4x4 resize layer and 2x2 resize layer. For example a 8x8 resize will be replaced by 4x4 resize followed by 2x2 resize.
****************************************************
**          5 WARNINGS          0 ERRORS          **
****************************************************
The soft limit is 2048
The soft limit is 2048
The hard limit is 2048
The hard limit is 2048
MEM: Init ... !!!
MEM: Init ... !!!
MEM: Init ... Done !!!
MEM: Init ... Done !!!
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal
 0.0s:  VX_ZONE_INIT:Enabled
 0.3s:  VX_ZONE_ERROR:Enabled
 0.4s:  VX_ZONE_WARNING:Enabled
 0.0s:  VX_ZONE_INIT:Enabled
 0.10s:  VX_ZONE_ERROR:Enabled
 0.11s:  VX_ZONE_WARNING:Enabled
 0.2903s:  VX_ZONE_INIT:[tivxInit:185] Initialization Done !!!
 0.3187s:  VX_ZONE_INIT:[tivxInit:185] Initialization Done !!!

 ************** Frame index 1 : Running float import *************
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal
Warning : Requested Output Data Convert Layer is not Added to the network, It is currently not Optimal

 ************** Frame index 1 : Running float import *************
****************************************************
**                ALL MODEL CHECK PASSED          **
****************************************************

The soft limit is 2048
The hard limit is 2048
MEM: Init ... !!!
MEM: Init ... Done !!!
 0.0s:  VX_ZONE_INIT:Enabled
 0.9s:  VX_ZONE_ERROR:Enabled
 0.10s:  VX_ZONE_WARNING:Enabled
 0.1993s:  VX_ZONE_INIT:[tivxInit:185] Initialization Done !!!
****************************************************
**                ALL MODEL CHECK PASSED          **
****************************************************

The soft limit is 2048
The hard limit is 2048
MEM: Init ... !!!
MEM: Init ... Done !!!
 0.0s:  VX_ZONE_INIT:Enabled
 0.12s:  VX_ZONE_ERROR:Enabled
 0.21s:  VX_ZONE_WARNING:Enabled
 0.2402s:  VX_ZONE_INIT:[tivxInit:185] Initialization Done !!!
 
 
 after CTRL-c:
 ^CTraceback (most recent call last):
  File "/opt/edgeai-tidl-tools/examples/osrt_python/tfl/tflrt_delegate.py", line 274, in <module>
    nthreads = join_one(nthreads)
  File "/opt/edgeai-tidl-tools/examples/osrt_python/tfl/tflrt_delegate.py", line 256, in join_one
    sem.acquire()
KeyboardInterrupt

这是正在创建的内容：

root@c9f23fa83205:/opt/edgeai-tidl-tools/models/public# l
total 117M
8.7M -rw-r--r-- 1 root root 8.7M Dec 20 10:24 deeplabv3_mnv2_ade20k_float.tflite
 17M -rw-r--r-- 1 root root  17M Dec 20 10:24 mobilenet_v1_1.0_224.tflite
 28M -rw-r--r-- 1 root root  28M Dec 20 10:24 ssdlite_mobiledet_dsp_320x320_coco_20200519.tflite
4.0K -rw-r--r-- 1 root root 2.9K Dec 20 10:24 ssdlite_mobiledet_dsp_320x320_coco_20200519.prototxt
 65M -rw-r--r-- 1 root root  65M Dec 20 10:24 ssd_mobilenet_v2_300_float.tflite
 
 (3.10.16) root@c9f23fa83205:/opt/edgeai-tidl-tools/model-artifacts/cl-tfl-mobilenet_v1_1.0_224/tempDir# l
total 20M
 12K -rw-r--r-- 1 root root 8.8K Dec 20 11:43 86_tidl_net.bin_netLog.txt
 19M -rw-r--r-- 1 root root  19M Dec 20 11:43 86_tidl_net.bin
 40K -rw-r--r-- 1 root root  37K Dec 20 11:43 86_tidl_io_1.bin
4.0K -rw-r--r-- 1 root root 1.8K Dec 20 11:43 86_tidl_net.bin.layer_info.txt
236K -rw-r--r-- 1 root root 236K Dec 20 11:43 86_tidl_net.bin.svg

它在本地运行、但没有开发板、它告诉我"allowedNode.txt"丢失。

这里出了什么问题？

(注意：不紧急、我将在1月中旬圣诞节假期回来)

0 admin 10 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

您好、Dominic：

好了、现在针对9.0 SDK 工具进行编译时发现了这一点。否则升级 SDK 是不可行的、这是正确的。

[报价 userid="635895" url="~/support/processors-group/processors/f/processors-forum/1453488/am62a3-models-compiled-with-edgeai-tidl-tools-cause-segmentation-fault-on-am62a/5579116 #5579116"]

但是、当我尝试示例编译时、python 脚本会挂起(请参阅末尾 ctrl-c 后的错误 MDG)

[报价]

我看到您运行的是模型的默认示例。这将创建多个分叉进程、当等待其中一个进程返回时可能会挂起。或许其中一个失败了。很难通过日志判断。

我在这边提到了、在我这边、od-tfl-ssd_mobilenet_v2_300_float 在编译期间命中了一个 segfault。
- 如果一个模型未完成并挂起、则整个进程将挂起、因为主线程等待线程完成。

您是对某个特定的模型感兴趣、还是只是尝试测试这些工具？

您可以通过将"-m model_config_name"添加到命令行 args 来运行单个模型、其中 model_config_name 是 examples/osrt_python/model_configs.py 中的一个键。其中一个是"cl-TFL-mobilenet_v1_1.0_224"。

tempDir 中的文件看起来是正确的、但这些文件是中间文件(和一些调试信息)。该目录中包含工件的重要文件。应该有2个二进制文件、一个模型文件和一些支持文件、例如 allowedNode.txt

我建议将 DEBUG_LEVEL 参数增大至1。您可以从 common_utils.py 文件或通过将'debug_level':1添加到 model_configs.py dict 条目中的附加"optional_options"词典中来全局更改此项。很可能有一个模型出现故障并导致整个脚本挂起。

BR、
Reese

0 admin 10 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

您好、 Dominic：

Reese 本周即将结束、下周才能回复。

此致、

建中

0 admin 10 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

您好、Dominic：

在我离开的时候,感谢耐心的等待。

我意识到我推荐的 CLI 选项-m 不是此版本的工具。抱歉、我忘记了此 SDK 工具需要在此版本的脚本中定义的模型集。

我看到你正在得到(特别不透明的)"总线错误",因为脚本失败。您尝试的所有车型都是这样、对吗？通常、有一种简单的方法可以解决 TIDL 在总线错误上发生故障。当/dev/shm 下的某些共享存储器无法清除且无法分配更多内存、从而导致错误时、就会发生这种情况。尝试下面的行来清除 TIDL 将会创建的/dev/文件：

rm /dev/shm/vashm*

我注意到、在模型 "od-tfl-ssdlet_mobiledet_dsp_320x320_coco"和"od-tfl-ssd_mobilenet_v2_300_float" 的最后阶段(晚于您的日志)、我的编译出现了问题、但其他两个编译未出现问题、并提供合理的输出。坦率地说、9.0 SDK 是8.6版和当前10.1版之间稳定性最低的版本----如果可能的话、我建议进行升级。

我认为您的容器很好。 Ubuntu 22.04正确。 SDK 9.0没有基于 GPU 的工具来加速编译、因此此处不应使用 GPU 信息/状态。

BR、
Reese

0 admin 9 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

尊敬的 Reese：

我尝试了您的建议-不幸的是、我稍后得到相同的总线错误输出。

************ in TIDL_subgraphRtCreate ************ 
 The soft limit is 2048
The hard limit is 2048
MEM: Init ... !!!
MEM: Init ... Done !!!
 0.0s:  VX_ZONE_INIT:Enabled
 0.4s:  VX_ZONE_ERROR:Enabled
 0.5s:  VX_ZONE_WARNING:Enabled
 0.1520s:  VX_ZONE_INIT:[tivxInit:185] Initialization Done !!!
************ TIDL_subgraphRtCreate done ************ 
 tidl_tfLiteRtImport_delegate.cpp Invoke 478 
*******   In TIDL_subgraphRtInvoke  ******** 
Bus error (core dumped)

容器资源应良好(已启动容器刷新、仅运行容器)

CONTAINER ID   NAME            CPU %     MEM USAGE / LIMIT     MEM %     NET I/O        BLOCK I/O   PIDS
f21fbd32371b   loving_mclean   6.04%     1.688GiB / 31.19GiB   5.41%     86MB / 1.4MB   0B / 0B     106

Unknown 说：
坦率地说、9.0 SDK 是8.6和当前版本之间最不稳定的版本(10.1)--我建议如果可能的话进行升级。 [报价]
我认为这正是我们要做的。 Stefan 设法在版本10上同时部署了自定义的经过训练的 yolox 模型。

相关:我注意到,我们可能会在2月10日开会,因为我将参加生病的研讨会,在那里你和曼努埃尔菲利普平签署了-->是否有必要我们事先为你汇编一个问题/主题列表?

此致、

Dominic

0 admin 9 个月前

TI__Guru**** 2470720 points

请注意，本文内容源自机器翻译，可能存在语法或其它翻译错误，仅供参考。如需获取准确内容，请参阅链接中的英语原文或自行翻译。

您好、Dominic：

Hmm、仍然遇到该总线错误。我很惊讶清空共享内存没有解决这个问题、尤其是在将一个小模型编译为初始测试时。

我认为您在10.0或更新版本中的体验会好得多。

我想说明的是、类似"od-tfl-ssdlite_mobiledet_dsp_320x320_coco"的模型在10.0/10.1上存在问题。我刚刚确认了上周的修复,与此 bugfix 发布将在下一周或左右上线(10.1.0.4是对应的版本字符串)。我提到这是一个快速警告、以防您看到突出显示的"ValueError: basic_string:_M_create"错误。

[报价 userid="635895" url="~/support/processors-group/processors/f/processors-forum/1453488/am62a3-models-compiled-with-edgeai-tidl-tools-cause-segmentation-fault-on-am62a/5631174 #5631174"]

我认为这正是我们要做的。 Stefan 设法在版本10上同时部署了自定义的经过训练的 yolox 模型。

[报价]

好的、很高兴听到。

Unknown 说：
我们事先为您汇编一份问题/主题清单是否合理？

是的！这将是非常有帮助的----然后我们可以得到尽可能切合实际的内容和讨论。请向我/曼努埃尔发送一份问题列表、以便我们进行审查和准备。期待与您见面！

BR、
Reese

处理器（参考译文帖）

处理器（参考译文帖）(Read Only)

[参考译文] AM62A3：使用 edgeai-tidl-tools 编译的模型会导致 AM62a 出现分段故障