This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

[参考译文] PROCESSOR-SDK-AM62X:485c0100.dma-控制器崩溃问题的内核 IRQ

Guru**** 1807890 points
请注意,本文内容源自机器翻译,可能存在语法或其它翻译错误,仅供参考。如需获取准确内容,请参阅链接中的英语原文或自行翻译。

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1430952/processor-sdk-am62x-kernel-irq-of-485c0100-dma-controller-crash-issue

器件型号:PROCESSOR-SDK-AM62X

工具与软件:

您好、TI 专家

在运行 ARRECORD 时、我们观察到系统遇到以下内核 IRQ 崩溃错误消息:

[ 5637.696595] ti-udma 485c0100.dma-controller: chan2 teardown timeout!
Recording WAVE 'stdin' : Signed 16 bit Little Endian, Rate 48000 Hz, Channels 4
[ 5660.061570] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 5660.061607] rcu:   0-....: (5457 ticks this GP) idle=7acc/1/0x4000000000000000 softirq=0/0 fqs=2574 rcuc=5455 jiffies(starved)
[ 5660.061623]           (t=5251 jiffies g=35081 q=283 ncpus=4)
[ 5660.061644] CPU: 0 PID: 901 Comm: irq/95-485c0100 Tainted: G           O       6.1.33-rt11-g685e771524 #1
[ 5660.061654] Hardware name: Texas Instruments AM625 SK (DT)
[ 5660.061661] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5660.061669] pc : _raw_spin_unlock_irq+0x18/0x70
[ 5660.061716] lr : irq_finalize_oneshot.part.0+0x68/0x110
[ 5660.061755] sp : ffff800009b5bd90
[ 5660.061760] x29: ffff800009b5bd90 x28: 0000000000000000 x27: 0000000000000000
[ 5660.061792] x26: ffff8000080ac380 x25: ffff8000080ac5d0 x24: ffff000004397980
[ 5660.061805] x23: ffff000001344c00 x22: ffff000001344c60 x21: ffff000001344cdc
[ 5660.061816] x20: ffff000004397980 x19: ffff000001344c00 x18: 0000000000000000
[ 5660.061827] x17: 0000000000000001 x16: 0000000000000001 x15: 0000b6762e2321a2
[ 5660.061838] x14: 02da0938abd6dca0 x13: 000058fa38cd37d2 x12: 01641952b170bc26
[ 5660.061850] x11: 00000000000002ef x10: 000000000000b67e x9 : 0000000000000001
[ 5660.061861] x8 : ffff00000719cc58 x7 : ffff00000719cc68 x6 : ffffffffffffffe0
[ 5660.061871] x5 : ffff000001372680 x4 : 0000000000000000 x3 : ffffffffffffffe0
[ 5660.061883] x2 : ffff800009f00000 x1 : ffff00000719c880 x0 : 0000000100000001
[ 5660.061900] Call trace:
[ 5660.061916]  _raw_spin_unlock_irq+0x18/0x70
[ 5660.061930]  irq_forced_thread_fn+0x70/0xd0
[ 5660.061945]  irq_thread+0x178/0x23c
[ 5660.061953]  kthread+0x124/0x12c
[ 5660.061975]  ret_from_fork+0x10/0x20

操作环境:SK-AM62x EVM
SDK 版本:TI-PROCESSOR-SDK-LINUX-RT-am62xx-evm-09.00.00.03.tgz
引导方法:从 SDK 中使用 tisdk-default-image-am62xx-evm.wic.xz 创建 SD 卡以便在 EVM 上进行引导

重现内核崩溃的步骤:

1. 为简化测试条件,我们首先从 SD 卡启动后禁用了以下服务,然后重新启动。

systemctl stop irqbalanced.servic
systemctl disable irqbalanced.service
systemctl stop weston.service
systemctl disable weston.service
systemctl stop docker.service
systemctl disable docker.service
systemctl stop startwlanap.service
systemctl stop startwlansta.service
systemctl stop strongswan-starter.service
systemctl disable strongswan-starter.service
systemctl disable startwlansta.service
systemctl disable startwlanap.service
systemctl stop atd.service
systemctl disable atd.service
systemctl stop bluetooth.service
systemctl stop bt-enable.service
systemctl disable bluetooth.service
systemctl disable bt-enable.service
systemctl stop ti-apps-launcher.service
systemctl disable ti-apps-launcher.service

2.创建了音频测试文件 audio.sh、其中包含以下内容:

#!/bin/bash

while :
do
    arecord -f S16_LE -r 48000 -c 4 > /dev/null
done

3.创建了一个崩溃测试文件 test.sh、内容如下:

#!/bin/bash

modprobe -r tidss

./audio.sh &
sleep 3
memtester 256M > /dev/null &
sleep 3
memtester 256M > /dev/null &
sleep 3
memtester 256M > /dev/null &
sleep 3
memtester 256M > /dev/null &
sleep 3
memtester 256M > /dev/null &

4.执行以下命令:

chmod 777 test.sh audio.sh
./test.sh &

5.大约三个小时后发生碰撞。

完整日志:

e2e.ti.com/.../am62x_5F00_evm_5F00_kernel_5F00_irq_5F00_crash.log

运行五个 memtester 实例旨在加速问题、而移除 tidss 驱动程序旨在避免以下可能会使问题复杂化的问题:

AM625:关于 CPU 上的 tidss RCU_PREMPTE 自检测失速的问题-处理器论坛-处理器- TI E2E 支持论坛

AM625:运行时出现 CPU 错误时的 RCU_PREMPTE 自检测失速-处理器论坛-处理器- TI E2E 支持论坛

我们注意到、最近的补丁似乎解决了上述 TIDSS IRQ 淹没问题:

https://lore.kernel.org/lkml/20241021-tidss-irq-fix-v1-4-82ddaec94e4a@ideasonboard.com/T/#mfadbc7283ea4db24ee390b2322c39df34faba7b5

不过、我们既没有使用与 tidss 相关的函数、也没有安装相关的驱动程序、不过我们也会遇到类似的问题。

经测试、如果 arecord 未运行、则不会发生此 IRQ 崩溃。  

此外、当发生崩溃时、如果有机会成功运行 cat /proc/interrupts、您可以看到与崩溃相关的 IRQ 的中断计数显著增加、如下所示:

您可以帮助检查导致此崩溃的原因吗? 谢谢你。