PROCESSOR-SDK-AM62X: 测试rpmsg_char_simple示例和rpmsg_char_zerocopy示例

Feng Lang

Prodigy 190 points

Part Number: PROCESSOR-SDK-AM62X

我想要知道这两个核间通信示例，哪一个传输数据的效率更快一些，我的测试结果如下所示，rpmsg_char_simple的结果与官方文档给出的结果类似。

rpmsg_char_zerocopy的方式比速度上要比rpmsg_char_simple慢，这个结论是否正确？

为什么rpmsg_char_zerocopy的方式会慢一点

1 个月前

0 Shine 1 个月前

TI__Guru**** 357097 points

Feng Lang 说：
我的测试结果如下所示，rpmsg_char_simple的结果与官方文档给出的结果类似。

请问是怎么测试的？官方文档是哪个？

0 Feng Lang 1 个月前回复 Shine

Prodigy 190 points

如下图所示，在linux脚本程序中，发送和接收函数的入口和出口计算时间间隔，

rpmsg参考的文档是下面这个

dev.ti.com/.../node

0 Shine 1 个月前回复 Feng Lang

TI__Guru**** 357097 points

感谢您对TI产品的关注！由于问题比较复杂，已将您的问题发布在E2E英文技术论坛上，由资深的英文论坛工程师为您提供帮助。您也可以点击下帖链接了解进展：
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1330423/am625-rpmsg_char_zerocopy-performance

+1 Shine 1 个月前回复 Feng Lang

TI__Guru**** 357097 points

请按照下面工程师建议的方法，测试一下两个demo传输数据所花的时间。

工程师给出了详细的解释。

The output numbers the customer provided do not match the code snippet that you attached. I am not sure how they generated their numbers, so I cannot tell you exactly what is going on.

Does the code snippet from rpmsg_char_zerocopy actually measure latency? No.

t measures the time it takes to execute function send_msg. It does NOT measure the time between when Linux sends the RPMsg, and the time when the remote core receives and processes the RPMsg.

t2 measures the time it takes for (the remote core to receive the RPMsg, process the RPMsg, execute all the other code it wants to execute, rewrite the shared memory, send an RPMsg to Linux, have Linux receive and process the RPMsg) minus (the time it takes to go through all the print statements on the Linux side).

Neither t, nor t2, actually measures RPMsg latency.

Ok, so what could the customer do if they DO want to benchmark the zerocopy example?

The whole point of the zerocopy example is to show how to move large amounts of data between cores. This is NOT an example of how to minimize latency. It is an example of how to maximize THROUGHPUT.

If I were a customer benchmarking the zerocopy example, I would compare code like this:

rpmsg_char_simple:

//Define 1MB of data to send between Linux & remote core
// it is probably easiest to just run the RPMsg_echo example
// 1048576 bytes / 496 bytes = 2,115 times
// i.e., 4230 total messages get sent, 2,115 messages in each direction
// this avoids potential issues like overflow if we were sending 2,115 messages
// back-to-back from Linux to the remote core, or vice versa
start_time
send & receive messages
end_time
total_time = end_time - start_time
Then I would run the zerocopy example like this:

// define a 1MB region of shared memory
start_time
write 1MB of data to the shared memory region
send RPMsg
wait for RPMsg reply
// while we are waiting:
// remote core receives RPMsg, reads in the 1MB of data, writes 1MB of data
// then remote core sends an RPMsg
read 1MB of data from the shared memory region
stop_time

Why else would I want to use the zerocopy example?

One usecase I have seen is when customers are trying to send data 496 bytes at a time through RPMsg in a single direction, they can reach an overflow situation where the sending core is sending data faster than the receiving core can keep up. Eventually they start losing data.

Instead of interrupting the receiving core for every 496 bytes of data, a shared memory example allows the sending core to send fewer interrupts to transmit the same amount of data. That can help the receiving core run more efficiently, since it is interrupted less, so it has to context switch less.

What if I am trying to minimize latency between Linux and a remote core?

First of all, make sure that you ACTUALLY want Linux to be in a critical control path where low, deterministic latency is required. Even RT Linux is NOT a true real-time OS, so there is always the risk that Linux will miss timing eventually. Refer here for more details: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1085663/faq-sitara-multicore-system-design-how-to-ensure-computations-occur-within-a-set-cycle-time

Additionally, Linux RPMsg is NOT currently designed to be deterministic. So the average latency may be on the order of tens of microseconds to 100 microseconds, but the worst-case latency CAN rarely spike up to 1ms or more in Linux kernel 6.1 and earlier.

RPMsg is the TI-supported IPC between Linux and a remote core. If you do not actually need to send 496 bytes of data with each message, you could implement your own IPC, like something that just used mailboxes. Keep in mind that TI does NOT provide support for mailbox communication between Linux userspace and remote cores. If the customer decides to develop their own mailbox IPC, we will NOT be able to support that development.

0 Feng Lang 1 个月前回复 Shine

Prodigy 190 points

非常感谢，该回答解答了我的问题

0 Shine 1 个月前回复 Feng Lang

TI__Guru**** 357097 points

不客气~ 应该的。

处理器

处理器论坛

PROCESSOR-SDK-AM62X: 测试rpmsg_char_simple示例和rpmsg_char_zerocopy示例