This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: 关于UDMA的并发问题

Part Number: TDA4VM


TI的各位老师:

我在TDA4上,用UDMA搞DSP程序的优化,遇到一个问题,关于DMA的并发数量的问题。

TDA4上有2个C66的DSP,那么我让这个两个DSP同时执行各自的任务,两个DSP在执行任务的时候都用到的UDMA,以加速DSP程序的执行。

比如我有同样的一段DSP程序,用UDMA优化后,执行时间是5ms,

我发现如果2个DSP分别用2个独立通道的UDMA,比如DSP1用ch8和ch9,DSP2用ch12和ch13,这个时候2个任务的执行时间是5ms,程序明显是并行的

但是,如果2个DSP分别用4个独立通道的UDAM,比如DSP1用ch8-ch11,DSP2用ch12-ch15,这个时候2个任务的执行时间是10ms,程序明显有串行的迹象

事实上,我通过相关文档看到,UDMA一共16个通道,分成了block copy和DRU,分别对应ch0-ch7和ch8-ch15。每个group内可同时发起8个DMA传输。

我想请教的是:

1)这个block copy和DRU是什么区别?应用上有什么不同?

2)udma的group是什么概念

3)根据上面的介绍,明显是同时8路copy的时候,就已经串行了。这个跟TI的文档有出入,该如何理解呢?

  • 您好,我们已收到您的问题并升级到英文论坛寻求帮助,链接如下,如有答复将尽快回复您:

    e2e.ti.com/.../tda4vm-parallel-issue-of-udm

  • 您好,

    我通过相关文档看到,UDMA一共16个通道,分成了block copy和DRU,分别对应ch0-ch7和ch8-ch15。每个group内可同时发起8个DMA传输。

    实际上取决于 DMA 的编程和触发情况。 能否提供更多有关设置以及如何触发通道的信息?

    1)这个block copy和DRU是什么区别?应用上有什么不同?

    block copy使用两个通道 TX 和 RX,并从 TX 中的源传输到 RX 中的目标。 DRU 是 SoC 中的另一个 DMA 引擎。

    2)udma的group是什么概念

    不太理解这个问题?您能再阐述一下吗?

    3)根据上面的介绍,明显是同时8路copy的时候,就已经串行了。这个跟TI的文档有出入,该如何理解呢?

    我们需要先检查您的设置再来查看该问题。

    详情请见英文论坛回复。

  • Hello Cherry:

    我已经注册了企业邮箱,但是,英文版论坛还是提示我缺少企业邮箱地址,我就没法回复了。

    现在中文版论坛回复如下:

    thanks for your replying at first

    let me explain some details:

    1.about the "What's the definition of group of udma"

    I don't know what is "group", but, I found it in the codes of vision_apps,

    ti-processor-sdk-rtos-j721e-evm-08_04_00_06/vision_apps/kernels/img_proc/c66/vx_dma_transfers.h, line71:

    #define DMA_MAX_NUM_TRANSFERS_PER_GROUP (8)


    2.about the parallel issue of UDMA

    I checked the problem for further, and, I found it is not only for 8 channels, but also 4 channels.

    I upload the code after this post, check it please

    #include "../cv_kernel_implement.h"
    #include <TI/TI_platforms.h>
    #include <stdio.h>
    #include <string.h>
    
    //#define DEBUG_OUTPUT
    
    #ifdef DEBUG_OUTPUT
    static unsigned char validate_buf0[1920 * 5] = { 0 };
    static unsigned char validate_buf1[1920 * 5] = { 0 };
    static char          log_buf[8192]           = { 0 };
    #endif
    
    static void normalize_image_row_proc_4x(unsigned char const* restrict src,
                                            float const* restrict         mean,
                                            float const* restrict         scale,
                                            float* restrict               dst,
                                            int                           step){
      int         i  = 0;
      uint32_t    in = 0;
      __float2_t  mean_f2;
      int64_t     in_0123_us4;
      __float2_t  in01_f2, in23_f2;
      __float2_t  res01, res23;
    
      _nassert((((unsigned int)src)   & 0x7) == 0) ;
      _nassert((((unsigned int)mean)  & 0x7) == 0) ;
      _nassert((((unsigned int)scale) & 0x7) == 0) ;
      _nassert((((unsigned int)dst)   & 0x7) == 0) ;
    
      for(i = 0 ; i < step ; i ++){
        in          = _amem4_const(   src   + 4 * i);
        mean_f2     = _amem8_f2_const(mean  + 4 * i);
        in_0123_us4 = _unpkbu4(in);
        in01_f2     = _dinthspu(_loll(in_0123_us4));
        res01       = _dsubsp(in01_f2, mean_f2);
        mean_f2     = _amem8_f2_const(mean  + 4 * i + 2);
        in23_f2     = _dinthspu(_hill(in_0123_us4));
        res23       = _dsubsp(in23_f2, mean_f2);
        __float2_t scale_f4_lo  = _amem8_f2_const(scale + 4 * i);
        __float2_t scale_f4_hi  = _amem8_f2_const(scale + 4 * i + 2);
        __x128_t   m_f4_0       = _f2to128(res23,       res01);
        __x128_t   m_f4_1       = _f2to128(scale_f4_hi, scale_f4_lo);
        __x128_t   res          = _qmpysp(m_f4_0, m_f4_1);
        _amem8_f2(dst + 4 * i)     = _lof2_128(res);
        _amem8_f2(dst + 4 * i + 2) = _hif2_128(res);
      }
    }
    
    static void normalize_image_row_proc_tail(unsigned char const* src,
                                              float const*         mean,
                                              float const*         scale,
                                              float*               dst,
                                              int                  step){
      int   i  = 0;
      for(i = 0 ; i < step ; i ++){
        dst[i] = (((float)src[i]) - mean[i]) * scale[i];
      }
    }
    
    static void normalize_image_row_proc(unsigned char const* src,
                                         float const*         mean,
                                         float const*         scale,
                                         float*               dst,
                                         int                  aligned_step,
                                         int                  tail_step){
      
      normalize_image_row_proc_4x(src, mean, scale, dst, aligned_step >> 2);
      if(tail_step > 0){
        normalize_image_row_proc_tail(src   + aligned_step, 
                                      mean  + aligned_step, 
                                      scale + aligned_step, 
                                      dst   + aligned_step, 
                                      tail_step);
      }
    }
    
    static void normalize_image_sram_row_proc(unsigned char const* src,
                                              float const*         mean,
                                              float const*         scale,
                                              float*               dst,
                                              int                  aligned_step,
                                              int                  row,
                                              int                  src_line_ele,
                                              int                  dst_line_ele){
      unsigned char const*  src_ptr = src;
      float*                dst_ptr = dst;
      int                   i       = 0;
      for(i = 0 ; i < row ; i ++){
        normalize_image_row_proc_4x(src_ptr, mean, scale, dst_ptr, aligned_step >> 2);
        src_ptr += src_line_ele;
        dst_ptr += dst_line_ele;
      }
    }
    
    static void normalize_image_uc1_f1_implement(unsigned char const*    src,
                                                 unsigned int            width,
                                                 unsigned int            height,
                                                 unsigned int            src_line_bytes,
                                                 float*                  dst,
                                                 unsigned int            dst_line_size,
                                                 float const*            mean_row,
                                                 float const*            scale_row){
      int i = 0 ;
      unsigned char const*  src_ptr      = src;
      float*                dst_ptr      = dst;
      int                   aligned_step = (width >> 2) << 2;
      int                   tail_step    = width - aligned_step;
      for(i = 0 ; i < height ; i ++){
        normalize_image_row_proc(src_ptr, 
                                 mean_row, 
                                 scale_row, 
                                 dst_ptr, 
                                 aligned_step, 
                                 tail_step);
        src_ptr += src_line_bytes;
        dst_ptr += dst_line_size;
      }
    }
    
    static void normalize_image_uc3_f3_implement(unsigned char const*    src,
                                                 unsigned int            width,
                                                 unsigned int            height,
                                                 unsigned int            src_line_bytes,
                                                 float*                  dst,
                                                 unsigned int            dst_line_size,
                                                 float const*            mean_row,
                                                 float const*            scale_row){
      int i = 0 ;
      unsigned char const*  src_ptr      = src;
      float*                dst_ptr      = dst;
      int                   aligned_step = ((3 * width) >> 2) << 2;
      int                   tail_step    = 3 * width - aligned_step;
      for(i = 0 ; i < height ; i ++){
        normalize_image_row_proc(src_ptr, 
                                 mean_row, 
                                 scale_row, 
                                 dst_ptr, 
                                 aligned_step, 
                                 tail_step);
        src_ptr += src_line_bytes;
        dst_ptr += dst_line_size;
      }
    }
    
    static void normalize_image_3planes_row_proc_4x(unsigned char const* restrict src,
                                                    float const* restrict         mean,
                                                    float const* restrict         scale,
                                                    float* restrict               dstB,
                                                    float* restrict               dstG,
                                                    float* restrict               dstR,
                                                    int                           step){
      int         i  = 0;
      uint32_t    in = 0;
      __float2_t  mean_f2;
      int64_t     in_0123_us4;
      __float2_t  in01_f2, in23_f2;
      __float2_t  res01, res23;
    
      _nassert((((unsigned int)src)  & 0x7) == 0) ;
      _nassert((((unsigned int)dstB) & 0x7) == 0) ;
      _nassert((((unsigned int)dstG) & 0x7) == 0) ;
      _nassert((((unsigned int)dstR) & 0x7) == 0) ;
    
      for(i = 0 ; i < step ; i ++){
        in          = _amem4_const(   src   + 12 * i);
        mean_f2     = _amem8_f2_const(mean);
        in_0123_us4 = _unpkbu4(in);
        in01_f2     = _dinthspu(_loll(in_0123_us4));
        res01       = _dsubsp(in01_f2, mean_f2);
        mean_f2     = _amem8_f2_const(mean  + 2);
        in23_f2     = _dinthspu(_hill(in_0123_us4));
        res23       = _dsubsp(in23_f2, mean_f2);
        __float2_t scale_f4_lo   = _amem8_f2_const(scale);
        __float2_t scale_f4_hi   = _amem8_f2_const(scale + 2);
        __x128_t   m_f4_0        = _f2to128(res23,       res01);
        __x128_t   m_f4_1        = _f2to128(scale_f4_hi, scale_f4_lo);
        __x128_t   res0          = _qmpysp(m_f4_0, m_f4_1);
        __float2_t b01           = _ftof2(_get32f_128(res0, 3), _get32f_128(res0, 0));
        _amem8_f2(dstB + 4 * i)  = b01;
    
        in            = _amem4_const(   src   + 12 * i + 4);
        mean_f2       = _amem8_f2_const(mean + 4);
        in_0123_us4   = _unpkbu4(in);
        in01_f2       = _dinthspu(_loll(in_0123_us4));
        res01         = _dsubsp(in01_f2, mean_f2);
        mean_f2       = _amem8_f2_const(mean  + 6);
        in23_f2       = _dinthspu(_hill(in_0123_us4));
        res23         = _dsubsp(in23_f2, mean_f2);
        scale_f4_lo   = _amem8_f2_const(scale + 4);
        scale_f4_hi   = _amem8_f2_const(scale + 6);
        m_f4_0        = _f2to128(res23,       res01);
        m_f4_1        = _f2to128(scale_f4_hi, scale_f4_lo);
        __x128_t   res1 = _qmpysp(m_f4_0, m_f4_1);
        __float2_t g01  = _ftof2(_get32f_128(res1, 0), _get32f_128(res0, 1));
        __float2_t r01  = _ftof2(_get32f_128(res1, 1), _get32f_128(res0, 2));
        _amem8_f2(dstG + 4 * i) = g01;
        _amem8_f2(dstR + 4 * i) = r01;
    
        in            = _amem4_const(   src   + 12 * i + 8);
        mean_f2       = _amem8_f2_const(mean + 8);
        in_0123_us4   = _unpkbu4(in);
        in01_f2       = _dinthspu(_loll(in_0123_us4));
        res01         = _dsubsp(in01_f2, mean_f2);
        mean_f2       = _amem8_f2_const(mean  + 10);
        in23_f2       = _dinthspu(_hill(in_0123_us4));
        res23         = _dsubsp(in23_f2, mean_f2);
        scale_f4_lo   = _amem8_f2_const(scale + 8);
        scale_f4_hi   = _amem8_f2_const(scale + 10);
        m_f4_0        = _f2to128(res23,       res01);
        m_f4_1        = _f2to128(scale_f4_hi, scale_f4_lo);
        __x128_t res2 = _qmpysp(m_f4_0, m_f4_1);
        __float2_t b23 = _ftof2(_get32f_128(res2, 1), _get32f_128(res1, 2));
        __float2_t g23 = _ftof2(_get32f_128(res2, 2), _get32f_128(res1, 3));
        __float2_t r23 = _ftof2(_get32f_128(res2, 3), _get32f_128(res2, 0));
        _amem8_f2(dstB + 4 * i + 2) = b23;
        _amem8_f2(dstG + 4 * i + 2) = g23;
        _amem8_f2(dstR + 4 * i + 2) = r23;
      }
    }
    
    static void normalize_image_3planes_row_proc_tail(unsigned char const* restrict src,
                                                      float const* restrict         mean,
                                                      float const* restrict         scale,
                                                      float* restrict               dstB,
                                                      float* restrict               dstG,
                                                      float* restrict               dstR,
                                                      int                           step){
      int i  = 0;
      for(i = 0 ; i < step ; i ++){
        dstB[i] = (((float)src[3 * i])     - mean[0]) * scale[0];
        dstG[i] = (((float)src[3 * i + 1]) - mean[1]) * scale[1];
        dstR[i] = (((float)src[3 * i + 2]) - mean[2]) * scale[2];
      }
    }
    
    static void normalize_image_uc3_planes_row_proc(unsigned char const* src,
                                                    float const*         mean,
                                                    float const*         scale,
                                                    float*               dstB,
                                                    float*               dstG,
                                                    float*               dstR,
                                                    int                  aligned_step,
                                                    int                  tail_step){
      normalize_image_3planes_row_proc_4x(src, 
                                          mean, 
                                          scale, 
                                          dstB, 
                                          dstG, 
                                          dstR, 
                                          aligned_step >> 2);
      if(tail_step > 0){
        normalize_image_3planes_row_proc_tail(src   + 3 * aligned_step, 
                                              mean, 
                                              scale, 
                                              dstB  + aligned_step, 
                                              dstG  + aligned_step, 
                                              dstR  + aligned_step, 
                                              tail_step);
      }
    }
    
    static void normalize_image_sram_3planes_row_proc(unsigned char const* src,
                                                      float const*         mean,
                                                      float const*         scale,
                                                      float**              dst,
                                                      int                  aligned_step,
                                                      int                  row,
                                                      int                  src_line_ele,
                                                      int                  dst_line_ele){
      unsigned char const*  src_ptr  = src;
      float*                dstB_ptr = dst[0];
      float*                dstG_ptr = dst[1];
      float*                dstR_ptr = dst[2];
      int                   i       = 0;
      for(i = 0 ; i < row ; i ++){
        normalize_image_3planes_row_proc_4x(src_ptr, 
                                            mean, 
                                            scale, 
                                            dstB_ptr, 
                                            dstG_ptr,
                                            dstR_ptr,
                                            aligned_step >> 2);
        src_ptr  += src_line_ele;
        dstB_ptr += dst_line_ele;
        dstG_ptr += dst_line_ele;
        dstR_ptr += dst_line_ele;
      }
    }
    
    static void normalize_image_uc3_f3_planes_implement(unsigned char const*    src,
                                                        unsigned int            width,
                                                        unsigned int            height,
                                                        unsigned int            src_line_bytes,
                                                        float*                  dstB,
                                                        float*                  dstG,
                                                        float*                  dstR,
                                                        unsigned int            dst_line_size,
                                                        float const*            mean_row,
                                                        float const*            scale_row){
      int i = 0 ;
      unsigned char const*  src_ptr      = src;
      float*                dstB_ptr     = dstB;
      float*                dstG_ptr     = dstG;
      float*                dstR_ptr     = dstR;
      int                   aligned_step = (width >> 2) << 2;
      int                   tail_step    = width - aligned_step;
      for(i = 0 ; i < height ; i ++){
        normalize_image_uc3_planes_row_proc(src_ptr, 
                                            mean_row, 
                                            scale_row, 
                                            dstB_ptr, 
                                            dstG_ptr,
                                            dstR_ptr,
                                            aligned_step, 
                                            tail_step);
        src_ptr  += src_line_bytes;
        dstB_ptr += dst_line_size;
        dstG_ptr += dst_line_size;
        dstR_ptr += dst_line_size;
      }
    }
    
    void normalize_image_uc1_f1(unsigned char const*         src,
                                unsigned int                 width,
                                unsigned int                 height,
                                unsigned int                 src_line_bytes,
                                float*                       dst,
                                unsigned int                 dst_line_bytes,
                                SEVX_NORMALIZE_IMAGE_PARAM*  norm_param){
      normalize_image_uc1_f1_implement(src, 
                                       width, 
                                       height, 
                                       src_line_bytes, 
                                       dst, 
                                       dst_line_bytes >> 2, 
                                       norm_param->mean_row_buf,
                                       norm_param->scale_row_buf);
    }
    
    void normalize_image_uc3_f3(unsigned char const*         src,
                                unsigned int                 width,
                                unsigned int                 height,
                                unsigned int                 src_line_bytes,
                                float*                       dst,
                                unsigned int                 dst_line_bytes,
                                SEVX_NORMALIZE_IMAGE_PARAM*  norm_param){
      normalize_image_uc3_f3_implement(src, 
                                      width, 
                                      height, 
                                      src_line_bytes, 
                                      dst, 
                                      dst_line_bytes >> 2, 
                                      norm_param->mean_row_buf,
                                      norm_param->scale_row_buf);
    }
    
    void normalize_image_uc3_f3_planes(unsigned char const*         src,
                                       unsigned int                 width,
                                       unsigned int                 height,
                                       unsigned int                 src_line_bytes,
                                       float*                       dstB,
                                       float*                       dstG,
                                       float*                       dstR,
                                       unsigned int                 dst_line_bytes,
                                       SEVX_NORMALIZE_IMAGE_PARAM*  norm_param){
      normalize_image_uc3_f3_planes_implement(src, 
                                              width, 
                                              height, 
                                              src_line_bytes, 
                                              dstB, 
                                              dstG,
                                              dstR,
                                              dst_line_bytes >> 2, 
                                              norm_param->mean_row_buf,
                                              norm_param->scale_row_buf);
    }
    
    void normalize_image_uc1_f1_dma(unsigned char const*         src,
                                    unsigned int                 width,
                                    unsigned int                 height,
                                    unsigned int                 src_line_bytes,
                                    float*                       dst,
                                    unsigned int                 dst_line_bytes,
                                    SEVX_NORMALIZE_IMAGE_PARAM*  norm_param){
      int i = 0 ;
      unsigned char const*  src_ptr[2]   = { 
                                             norm_param->pingpong_in_a,
                                             norm_param->pingpong_in_b
                                           };
      float*                dst_ptr[2]   = { 
                                             norm_param->pingpong_out_a,
                                             norm_param->pingpong_out_b
                                           };
      int   aligned_step = (width >> 2) << 2;
      int   y_size       = norm_param->y_size;
      int   buf_idx      = 0;
      int   sram_src_line_ele = norm_param->pingpong_in_size / y_size;
      int   sram_dst_line_ele = (norm_param->pingpong_out_size / y_size) / sizeof(float);
      DMA_CPY_TRIGGER dma_cpy_start  = norm_param->dma_cpy_start;
      DMA_CPY_WAIT    dma_cpy_wait   = norm_param->dma_cpy_wait;
      void*           dma_handle_in  = norm_param->dma_handle[0];
      void*           dma_handle_out = norm_param->dma_handle[1];
      height = height / y_size;
      dma_cpy_start(dma_handle_in);
      dma_cpy_wait(dma_handle_in);
      dma_cpy_start(dma_handle_in);
      normalize_image_sram_row_proc(src_ptr[buf_idx], 
                                    norm_param->mean_row_buf, 
                                    norm_param->scale_row_buf, 
                                    dst_ptr[buf_idx], 
                                    aligned_step, 
                                    y_size,
                                    sram_src_line_ele,
                                    sram_dst_line_ele);
    
      buf_idx = 1 - buf_idx;
      dma_cpy_wait(dma_handle_in);
      for(i = 0 ; i < height - 2 ; i ++){
        dma_cpy_start(dma_handle_in);
        dma_cpy_start(dma_handle_out);
        normalize_image_sram_row_proc(src_ptr[buf_idx], 
                                      norm_param->mean_row_buf, 
                                      norm_param->scale_row_buf, 
                                      dst_ptr[buf_idx], 
                                      aligned_step, 
                                      y_size,
                                      sram_src_line_ele,
                                      sram_dst_line_ele);
        buf_idx = 1 - buf_idx;
        dma_cpy_wait(dma_handle_in);
        dma_cpy_wait(dma_handle_out);
      }
      dma_cpy_start(dma_handle_out);
      normalize_image_sram_row_proc(src_ptr[buf_idx], 
                                    norm_param->mean_row_buf, 
                                    norm_param->scale_row_buf, 
                                    dst_ptr[buf_idx], 
                                    aligned_step, 
                                    y_size,
                                    sram_src_line_ele,
                                    sram_dst_line_ele);
      buf_idx = 1 - buf_idx;
      dma_cpy_wait(dma_handle_out);
      dma_cpy_start(dma_handle_out);
      dma_cpy_wait(dma_handle_out);
    }
    
    void normalize_image_uc3_f3_dma(unsigned char const*         src,
                                    unsigned int                 width,
                                    unsigned int                 height,
                                    unsigned int                 src_line_bytes,
                                    float*                       dst,
                                    unsigned int                 dst_line_bytes,
                                    SEVX_NORMALIZE_IMAGE_PARAM*  norm_param){
      int i = 0 ;
      unsigned char const*  src_ptr[2]   = { 
                                             norm_param->pingpong_in_a,
                                             norm_param->pingpong_in_b
                                           };
      float*                dst_ptr[2]   = { 
                                             norm_param->pingpong_out_a,
                                             norm_param->pingpong_out_b
                                           };
      int   aligned_step = ((3 * width) >> 2) << 2;
      int   y_size       = norm_param->y_size;
      int   buf_idx      = 0;
      int   sram_src_line_ele = norm_param->pingpong_in_size / y_size;
      int   sram_dst_line_ele = (norm_param->pingpong_out_size / y_size) / sizeof(float);
      DMA_CPY_TRIGGER dma_cpy_start  = norm_param->dma_cpy_start;
      DMA_CPY_WAIT    dma_cpy_wait   = norm_param->dma_cpy_wait;
      void*           dma_handle_in  = norm_param->dma_handle[0];
      void*           dma_handle_out = norm_param->dma_handle[1];
      height = height / y_size;
      dma_cpy_start(dma_handle_in);
      dma_cpy_wait(dma_handle_in);
      dma_cpy_start(dma_handle_in);
      normalize_image_sram_row_proc(src_ptr[buf_idx], 
                                    norm_param->mean_row_buf, 
                                    norm_param->scale_row_buf, 
                                    dst_ptr[buf_idx], 
                                    aligned_step, 
                                    y_size,
                                    sram_src_line_ele,
                                    sram_dst_line_ele);
      buf_idx = 1 - buf_idx;
      dma_cpy_wait(dma_handle_in);
      for(i = 0 ; i < height - 2 ; i ++){
        dma_cpy_start(dma_handle_in);
        dma_cpy_start(dma_handle_out);
        normalize_image_sram_row_proc(src_ptr[buf_idx], 
                                      norm_param->mean_row_buf, 
                                      norm_param->scale_row_buf, 
                                      dst_ptr[buf_idx], 
                                      aligned_step, 
                                      y_size,
                                      sram_src_line_ele,
                                      sram_dst_line_ele);
        buf_idx = 1 - buf_idx;
        dma_cpy_wait(dma_handle_in);
        dma_cpy_wait(dma_handle_out);
      }
      dma_cpy_start(dma_handle_out);
      normalize_image_sram_row_proc(src_ptr[buf_idx], 
                                    norm_param->mean_row_buf, 
                                    norm_param->scale_row_buf, 
                                    dst_ptr[buf_idx], 
                                    aligned_step, 
                                    y_size,
                                    sram_src_line_ele,
                                    sram_dst_line_ele);
      buf_idx = 1 - buf_idx;
      dma_cpy_wait(dma_handle_out);
      dma_cpy_start(dma_handle_out);
      dma_cpy_wait(dma_handle_out);
    }
    
    void normalize_image_uc3_f3_planes_dma(unsigned char const*         src,
                                           unsigned int                 width,
                                           unsigned int                 height,
                                           unsigned int                 src_line_bytes,
                                           float*                       dstB,
                                           float*                       dstG,
                                           float*                       dstR,
                                           unsigned int                 dst_line_bytes,
                                           SEVX_NORMALIZE_IMAGE_PARAM*  norm_param){
      int i = 0 ;
      unsigned char const*  src_ptr[2]   = { 
                                             norm_param->pingpong_in_a,
                                             norm_param->pingpong_in_b
                                           };
      unsigned int  pingpong_out_size    = (norm_param->pingpong_out_size / 3) >> 2;
      float*                dst_ptr[6]   = { 
                                             norm_param->pingpong_out_a,                         //B
                                             norm_param->pingpong_out_a + 2 * pingpong_out_size, //G
                                             norm_param->pingpong_out_a + 4 * pingpong_out_size, //R
                                             norm_param->pingpong_out_a +     pingpong_out_size,
                                             norm_param->pingpong_out_a + 3 * pingpong_out_size,
                                             norm_param->pingpong_out_a + 5 * pingpong_out_size,
                                           };
      int   aligned_step = ((width + 3) >> 2) << 2;
      int   y_size       = norm_param->y_size;
      int   buf_idx      = 0;
      int   sram_src_line_ele = norm_param->pingpong_in_size / y_size;
      int   sram_dst_line_ele = (norm_param->pingpong_out_size / (3 * y_size)) / sizeof(float);
      DMA_CPY_TRIGGER dma_cpy_start  = norm_param->dma_cpy_start;
      DMA_CPY_WAIT    dma_cpy_wait   = norm_param->dma_cpy_wait;
      void*           dma_handle_in  = norm_param->dma_handle[0];
      void*           dma_handle_B   = norm_param->dma_handle[1];
      void*           dma_handle_G   = norm_param->dma_handle[2];
      void*           dma_handle_R   = norm_param->dma_handle[3];
      height = height / y_size;
      dma_cpy_start(dma_handle_in);
      dma_cpy_wait(dma_handle_in);
      dma_cpy_start(dma_handle_in);
      normalize_image_sram_3planes_row_proc(src_ptr[buf_idx], 
                                    norm_param->mean_row_buf, 
                                    norm_param->scale_row_buf, 
                                    dst_ptr + 3 * buf_idx, 
                                    aligned_step, 
                                    y_size,
                                    sram_src_line_ele,
                                    sram_dst_line_ele);
      buf_idx = 1 - buf_idx;
      dma_cpy_wait(dma_handle_in);
      for(i = 0 ; i < height - 2 ; i ++){
        dma_cpy_start(dma_handle_in);
        dma_cpy_start(dma_handle_B);
        dma_cpy_start(dma_handle_G);
        dma_cpy_start(dma_handle_R);
        normalize_image_sram_3planes_row_proc(src_ptr[buf_idx], 
                                      norm_param->mean_row_buf, 
                                      norm_param->scale_row_buf, 
                                      dst_ptr + 3 * buf_idx, 
                                      aligned_step, 
                                      y_size,
                                      sram_src_line_ele,
                                      sram_dst_line_ele);
        buf_idx = 1 - buf_idx;
        dma_cpy_wait(dma_handle_in);
        dma_cpy_wait(dma_handle_B);
        dma_cpy_wait(dma_handle_G);
        dma_cpy_wait(dma_handle_R);
      }
      dma_cpy_start(dma_handle_B);
      dma_cpy_start(dma_handle_G);
      dma_cpy_start(dma_handle_R);
      normalize_image_sram_3planes_row_proc(src_ptr[buf_idx], 
                                    norm_param->mean_row_buf, 
                                    norm_param->scale_row_buf, 
                                    dst_ptr + 3 * buf_idx, 
                                    aligned_step, 
                                    y_size,
                                    sram_src_line_ele,
                                    sram_dst_line_ele);
      buf_idx = 1 - buf_idx;
      dma_cpy_wait(dma_handle_B);
      dma_cpy_wait(dma_handle_G);
      dma_cpy_wait(dma_handle_R);
      dma_cpy_start(dma_handle_B);
      dma_cpy_start(dma_handle_G);
      dma_cpy_start(dma_handle_R);
      dma_cpy_wait(dma_handle_B);
      dma_cpy_wait(dma_handle_G);
      dma_cpy_wait(dma_handle_R);
    }
    
    #include "sevx_cv_kernel_normalize_image_target.h"
    #include <TI/tivx.h>
    #include <TI/tivx_target_kernel.h>
    #include <tivx_kernels_target_utils.h>
    #include <sevx_cv_kernel/sevx_cv_kernel_normalize_image.h>
    #include <app_ipc.h>
    #include <app_udma.h>
    #include <math.h>
    #include "cv_kernel_implement.h"
    
    static tivx_target_kernel sevx_normalize_image_target_kernel = NULL;
    
    #define  DMA_DATA_ALIGNMENT       128
    #define  DMA_DATA_ALIGNMENT_BITS  7
    #define  PINGPONG_NUM             2
    //static const unsigned int  FIXPT_BITS    = 11;
    //static const unsigned int  FIXPT_ONE_VAL = 2048;//(1 << FIXPT_BITS);
    
    typedef struct tag_sevx_norm_image_inst{
      SEVX_NORMALIZE_IMAGE_PARAM  param;
      app_udma_copy_nd_prms_t     dma_param_in;
      app_udma_copy_nd_prms_t     dma_param_out;
    
      app_udma_copy_nd_prms_t     dma_param_R;
      app_udma_copy_nd_prms_t     dma_param_G;
      app_udma_copy_nd_prms_t     dma_param_B;
    }SEVX_NORM_IMAGE_INST, *PSEVX_NORM_IMAGE_INST;
    
    static void print_alg_log(char const* msg){
      VX_PRINT(VX_ZONE_INIT, "%s", msg);
    }
    
    static int dma_cpy_start(void* dma_handle)
    {
      int status = VX_SUCCESS;
      status = appUdmaCopyNDTrigger(dma_handle);
      return status;
    }
    
    static int dma_cpy_wait(void* dma_handle)
    {
      int status = VX_SUCCESS;
      status = appUdmaCopyNDWait(dma_handle);
      return status;
    }
    
    static vx_status dma_create(SEVX_NORMALIZE_IMAGE_PARAM* instance, 
                                unsigned int dma_idx,
                                unsigned int dma_ch)
    {
      vx_status status = VX_SUCCESS;
      instance->dma_channel[dma_idx] = dma_ch;
      instance->dma_handle[dma_idx]  = appUdmaCopyNDGetHandle(dma_ch);
      if(NULL == instance->dma_handle[dma_idx]){
        VX_PRINT(VX_ZONE_ERROR, "Unable to create DMA handle %d\n", instance->dma_channel[dma_idx]);
        status = VX_FAILURE;
      }
      instance->dma_cpy_start = dma_cpy_start;
      instance->dma_cpy_wait  = dma_cpy_wait;
      return status;
    }
    
    static vx_status dma_delete(SEVX_NORMALIZE_IMAGE_PARAM* instance, int dma_idx)
    {
      vx_status status = VX_SUCCESS;
      if(NULL != instance->dma_handle[dma_idx]){
        instance->dma_handle[dma_idx] = NULL;
        int32_t retVal = appUdmaCopyNDReleaseHandle(instance->dma_channel[dma_idx]);
        if(retVal != 0){
          VX_PRINT(VX_ZONE_ERROR, "Unable to release DMA handle %d\n", instance->dma_channel[dma_idx]);
          status = VX_FAILURE;
        }
      }
      return status;
    }
    
    static vx_status dma_init(SEVX_NORMALIZE_IMAGE_PARAM* instance, 
                              int dma_idx, 
                              app_udma_copy_nd_prms_t* xfer_param){
      vx_status status = VX_SUCCESS;
      status = appUdmaCopyNDInit(instance->dma_handle[dma_idx], xfer_param);
      if ((vx_status)VX_SUCCESS != status){
        VX_PRINT(VX_ZONE_ERROR, "dma_init failed\n");
      }
      return status;
    }
    
    static vx_status dma_deinit(SEVX_NORMALIZE_IMAGE_PARAM* instance, 
                                int dma_idx)
    {
      vx_status status = VX_SUCCESS;
      status = appUdmaCopyNDDeinit(instance->dma_handle[dma_idx]);
      return status;
    }
    
    static vx_status dma_preproc(SEVX_NORM_IMAGE_INST*  norm_instance,
                                 unsigned char const*   src,
                                 unsigned int           width,
                                 unsigned int           height,
                                 unsigned int           channel,
                                 unsigned int           src_line_bytes,
                                 float*                 dst,
                                 unsigned int           dst_line_bytes){
      SEVX_NORMALIZE_IMAGE_PARAM* param = &(norm_instance->param);
      unsigned int in_blk_width    = channel * width;
      unsigned int in_blk_height   = param->y_size;
      unsigned int out_blk_width   = channel * width * sizeof(float);
      unsigned int out_blk_height  = param->y_size;
      int          res             = 0;
      app_udma_copy_nd_prms_t*  dma_param_in  = &(norm_instance->dma_param_in);
      app_udma_copy_nd_prms_t*  dma_param_out = &(norm_instance->dma_param_out);
      memset(dma_param_in,  0, sizeof(app_udma_copy_nd_prms_t));
      memset(dma_param_out, 0, sizeof(app_udma_copy_nd_prms_t));
      dma_param_in->copy_mode  = 2;
      dma_param_in->src_addr   = (uint64_t)src;
      dma_param_in->dest_addr  = (((uintptr_t)param->pingpong_in_a) + param->sram_global_base);
      dma_param_in->icnt0      = in_blk_width;
      dma_param_in->icnt1      = in_blk_height;
      dma_param_in->icnt2      = PINGPONG_NUM; /* Ping-pong */
      dma_param_in->icnt3      = (height / in_blk_height) / PINGPONG_NUM;
      dma_param_in->dim1       = src_line_bytes;
      dma_param_in->dim2       = (in_blk_height * src_line_bytes);
      dma_param_in->dim3       = (in_blk_height * src_line_bytes * PINGPONG_NUM);
    
      dma_param_in->dicnt0     = dma_param_in->icnt0;
      dma_param_in->dicnt1     = dma_param_in->icnt1;
      dma_param_in->dicnt2     = PINGPONG_NUM; /* Ping-pong */
      dma_param_in->dicnt3     = (height / in_blk_height) / PINGPONG_NUM;
      dma_param_in->ddim1      = param->pingpong_in_size / in_blk_height;
      dma_param_in->ddim2      = param->pingpong_in_size;
      dma_param_in->ddim3      = 0;
      res = dma_init(param, 0, dma_param_in);
    
      dma_param_out->copy_mode = 2;
      dma_param_out->src_addr  = (((uintptr_t)param->pingpong_out_a) + param->sram_global_base);
      dma_param_out->dest_addr = (uint64_t)dst;
      dma_param_out->icnt0     = out_blk_width;
      dma_param_out->icnt1     = out_blk_height;
      dma_param_out->icnt2     = PINGPONG_NUM; /* Double buffer */
      dma_param_out->icnt3     = (height / out_blk_height) / PINGPONG_NUM;
      dma_param_out->dim1      = param->pingpong_out_size / out_blk_height;
      dma_param_out->dim2      = param->pingpong_out_size;
      dma_param_out->dim3      = 0;
    
      dma_param_out->dicnt0    = dma_param_out->icnt0;
      dma_param_out->dicnt1    = dma_param_out->icnt1;
      dma_param_out->dicnt2    = PINGPONG_NUM;
      dma_param_out->dicnt3    = (height / out_blk_height) / PINGPONG_NUM;
      dma_param_out->ddim1     = dst_line_bytes;
      dma_param_out->ddim2     = (out_blk_height * dst_line_bytes);
      dma_param_out->ddim3     = (out_blk_height * dst_line_bytes * PINGPONG_NUM);
      res |= dma_init(param, 1, dma_param_out);
    
      return res;
    }
    
    static vx_status dma_postproc(SEVX_NORM_IMAGE_INST* norm_instance){
      vx_status status = VX_SUCCESS;
      status  = dma_deinit(&(norm_instance->param), 0);
      status |= dma_deinit(&(norm_instance->param), 1);
      return status;
    }
    
    static vx_status dma_preproc_3planes(SEVX_NORM_IMAGE_INST*  norm_instance,
                                         unsigned char const*   src,
                                         unsigned int           width,
                                         unsigned int           height,
                                         unsigned int           src_line_bytes,
                                         float*                 dstB,
                                         float*                 dstG,
                                         float*                 dstR,
                                         unsigned int           dst_line_bytes,
                                         unsigned int           swap_channel){
      SEVX_NORMALIZE_IMAGE_PARAM* param = &(norm_instance->param);
      unsigned int in_blk_width      = 3 * width;
      unsigned int in_blk_height     = param->y_size;
      unsigned int out_blk_width     = width * sizeof(float);
      unsigned int out_blk_height    = param->y_size;
      int          res               = 0;
      unsigned int pingpong_out_size = param->pingpong_out_size / 3;
      unsigned char* pingpong_out_B  = (unsigned char*)(param->pingpong_out_a);
      unsigned char* pingpong_out_G  = pingpong_out_B + (pingpong_out_size << 1);
      unsigned char* pingpong_out_R  = pingpong_out_G + (pingpong_out_size << 1);
    
      app_udma_copy_nd_prms_t*  dma_param_in = &(norm_instance->dma_param_in);
      app_udma_copy_nd_prms_t*  dma_param_B  = &(norm_instance->dma_param_B);
      app_udma_copy_nd_prms_t*  dma_param_G  = &(norm_instance->dma_param_G);
      app_udma_copy_nd_prms_t*  dma_param_R  = &(norm_instance->dma_param_R);
      memset(dma_param_in,  0, sizeof(app_udma_copy_nd_prms_t));
      memset(dma_param_B,   0, sizeof(app_udma_copy_nd_prms_t));
      memset(dma_param_G,   0, sizeof(app_udma_copy_nd_prms_t));
      memset(dma_param_R,   0, sizeof(app_udma_copy_nd_prms_t));
    
      dma_param_in->copy_mode  = 2;
      dma_param_in->src_addr   = (uint64_t)src;
      dma_param_in->dest_addr  = (((uintptr_t)param->pingpong_in_a) + param->sram_global_base);
      dma_param_in->icnt0      = in_blk_width;
      dma_param_in->icnt1      = in_blk_height;
      dma_param_in->icnt2      = PINGPONG_NUM; /* Ping-pong */
      dma_param_in->icnt3      = (height / in_blk_height) / PINGPONG_NUM;
      dma_param_in->dim1       = src_line_bytes;
      dma_param_in->dim2       = (in_blk_height * src_line_bytes);
      dma_param_in->dim3       = (in_blk_height * src_line_bytes * PINGPONG_NUM);
    
      dma_param_in->dicnt0     = dma_param_in->icnt0;
      dma_param_in->dicnt1     = dma_param_in->icnt1;
      dma_param_in->dicnt2     = PINGPONG_NUM; /* Ping-pong */
      dma_param_in->dicnt3     = (height / in_blk_height) / PINGPONG_NUM;
      dma_param_in->ddim1      = param->pingpong_in_size / in_blk_height;
      dma_param_in->ddim2      = param->pingpong_in_size;
      dma_param_in->ddim3      = 0;
      res = dma_init(param, 0, dma_param_in);
    
      dma_param_B->copy_mode   = 2;
      dma_param_B->src_addr    = (((uintptr_t)pingpong_out_B) + param->sram_global_base);
      dma_param_B->dest_addr   = (uint64_t)dstB;
      dma_param_B->icnt0       = out_blk_width;
      dma_param_B->icnt1       = out_blk_height;
      dma_param_B->icnt2       = PINGPONG_NUM; /* Double buffer */
      dma_param_B->icnt3       = (height / out_blk_height) / PINGPONG_NUM;
      dma_param_B->dim1        = pingpong_out_size / out_blk_height;
      dma_param_B->dim2        = pingpong_out_size;
      dma_param_B->dim3        = 0;
    
      dma_param_B->dicnt0      = dma_param_B->icnt0;
      dma_param_B->dicnt1      = dma_param_B->icnt1;
      dma_param_B->dicnt2      = PINGPONG_NUM;
      dma_param_B->dicnt3      = (height / out_blk_height) / PINGPONG_NUM;
      dma_param_B->ddim1       = dst_line_bytes;
      dma_param_B->ddim2       = (out_blk_height * dst_line_bytes);
      dma_param_B->ddim3       = (out_blk_height * dst_line_bytes * PINGPONG_NUM);
    
      memcpy(dma_param_G, dma_param_B, sizeof(app_udma_copy_nd_prms_t));
      memcpy(dma_param_R, dma_param_B, sizeof(app_udma_copy_nd_prms_t));
      dma_param_G->src_addr  = (((uintptr_t)pingpong_out_G) + param->sram_global_base);
      dma_param_G->dest_addr = (uint64_t)dstG;
      dma_param_R->src_addr  = (((uintptr_t)pingpong_out_R) + param->sram_global_base);
      dma_param_R->dest_addr = (uint64_t)dstR;
      if(0 != swap_channel){
        dma_param_B->dest_addr = (uint64_t)dstR;
        dma_param_R->dest_addr = (uint64_t)dstB;
      }
      res  = dma_init(param, 1, dma_param_B);
      res |= dma_init(param, 2, dma_param_G);
      res |= dma_init(param, 3, dma_param_R);
    
      return res;
    }
    
    static vx_status dma_postproc_3planes(SEVX_NORM_IMAGE_INST* norm_instance){
      vx_status status = VX_SUCCESS;
      status  = dma_deinit(&(norm_instance->param), 0);
      status |= dma_deinit(&(norm_instance->param), 1);
      status |= dma_deinit(&(norm_instance->param), 2);
      status |= dma_deinit(&(norm_instance->param), 3);
      return status;
    }
    
    static int  getOutTensorDims(unsigned int                  src_w,
                                 unsigned int                  src_h,
                                 unsigned int                  src_c,
                                 tivx_obj_desc_tensor_t const* dst_tensor,
                                 unsigned int*                 dst_w, 
                                 unsigned int*                 dst_h,
                                 unsigned int*                 dst_c,
                                 unsigned int*                 dst_line_bytes,
                                 unsigned int*                 plane_bytes,
                                 unsigned int                  out_sperate_planes){
      unsigned int dst_dims = dst_tensor->number_of_dimensions;
      unsigned int i = 0;
      if(1 == src_c){
        if(1 == dst_tensor->dimensions[0]){
          if(src_w == dst_tensor->dimensions[1] &&
             src_h == dst_tensor->dimensions[2]){
            for(i = 3 ; i < dst_dims ; i ++){
              if(1 != dst_tensor->dimensions[i]){
                return -1;
              }
            }
            *dst_c = 1;
            *dst_w = dst_tensor->dimensions[1];
            *dst_h = dst_tensor->dimensions[2];
            *dst_line_bytes = dst_tensor->stride[2];
            *plane_bytes    = dst_tensor->stride[2] * src_h;
          }else{
            return -1;
          }
        }else{
          if(src_w == dst_tensor->dimensions[0] &&
             src_h == dst_tensor->dimensions[1]){
            for(i = 2 ; i < dst_dims ; i ++){
              if(1 != dst_tensor->dimensions[i]){
                return -1;
              }
            }
            *dst_c = 1;
            *dst_w = dst_tensor->dimensions[0];
            *dst_h = dst_tensor->dimensions[1];
            *dst_line_bytes = dst_tensor->stride[1];
            *plane_bytes    = dst_tensor->stride[1] * src_h;
          }else{
            return -1;
          }
        }
      }else{
        if(0 == out_sperate_planes){
          if(    3 == dst_tensor->dimensions[0] &&
             src_w == dst_tensor->dimensions[1] &&
             src_h == dst_tensor->dimensions[2]){
            for(i = 3 ; i < dst_dims ; i ++){
              if(1 != dst_tensor->dimensions[i]){
                return -1;
              }
            }
            *dst_c = 3;
            *dst_w = dst_tensor->dimensions[1];
            *dst_h = dst_tensor->dimensions[2];
            *dst_line_bytes = dst_tensor->stride[2];
            *plane_bytes    = dst_tensor->stride[2] * src_h;
          }else{
            return -1;
          }
        }else{
          if(src_w == dst_tensor->dimensions[0] &&
             src_h == dst_tensor->dimensions[1] &&
             3     == dst_tensor->dimensions[2]){
            for(i = 3 ; i < dst_dims ; i ++){
              if(1 != dst_tensor->dimensions[i]){
                return -1;
              }
            }
            *dst_c = 3;
            *dst_w = dst_tensor->dimensions[0];
            *dst_h = dst_tensor->dimensions[1];
            *dst_line_bytes = dst_tensor->stride[1];
            *plane_bytes    = dst_tensor->stride[2];
          }else{
            return -1;
          }
        }
      }
    
      return 0;
    }
    
    static vx_status VX_CALLBACK sevxKernelCVNormalizeImageProcess(
        tivx_target_kernel_instance kernel, tivx_obj_desc_t *obj_desc[],
        uint16_t num_params, void *priv_arg){
      vx_status status = (vx_status)VX_SUCCESS;
      vx_uint32   src_w;
      vx_uint32   src_h;
      vx_uint32   src_c;
      vx_uint32   dst_w = 0;
      vx_uint32   dst_h = 0;
      vx_uint32   dst_c = 0 ;
      vx_df_image src_fmt;
      tivx_obj_desc_image_t  *src_desc;
      SEVX_NORM_IMAGE_INST   *instance = NULL;
      tivx_obj_desc_tensor_t *dst_desc;
      uint32_t instance_size = 0;
      int      i             = 0;
      vx_uint32  src_line_bytes  = 0;
      vx_uint32  dst_line_bytes  = 0;
      vx_uint32  dst_plane_bytes = 0;
      //uint32_t loop_test  = 0;
      if ((num_params != SEVX_KERNEL_NORMALIZE_IMAGE_MAX_PARAMS)
            || (NULL == obj_desc[SEVX_KERNEL_NORMALIZE_IMAGE_SRC_IDX])
            || (NULL == obj_desc[SEVX_KERNEL_NORMALIZE_CFG_IDX])
            || (NULL == obj_desc[SEVX_KERNEL_NORMALIZE_IMAGE_DST_IDX])
          )
      {
        VX_PRINT(VX_ZONE_INIT, "error0\n");
        status = (vx_status)VX_ERROR_INVALID_PARAMETERS;
      }
    
      if ((vx_status)VX_SUCCESS == status)
      {
        src_desc = (tivx_obj_desc_image_t*)obj_desc[SEVX_KERNEL_NORMALIZE_IMAGE_SRC_IDX];
        dst_desc = (tivx_obj_desc_tensor_t*)obj_desc[SEVX_KERNEL_NORMALIZE_IMAGE_DST_IDX];
      }
    
      if((vx_status)VX_SUCCESS == status){
        status = tivxGetTargetKernelInstanceContext(kernel,
            (void **)&instance, &instance_size);
        if (((vx_status)VX_SUCCESS != status) || (NULL == instance) ||
            (sizeof(SEVX_NORM_IMAGE_INST) != instance_size))
        {
          VX_PRINT(VX_ZONE_INIT, "error1\n");
          status = (vx_status)VX_FAILURE;
        }
    
        if((vx_status)VX_SUCCESS == status){
          src_w   = src_desc->imagepatch_addr[0U].dim_x;
          src_h   = src_desc->imagepatch_addr[0U].dim_y;
          src_line_bytes = src_desc->imagepatch_addr[0U].stride_y;
          src_fmt = src_desc->format;
          src_c   = (((vx_df_image)VX_DF_IMAGE_U8) == src_fmt ? 1 : 3);
          int dims_res = getOutTensorDims(src_w, 
                           src_h, 
                           src_c, 
                           dst_desc, 
                           &dst_w, 
                           &dst_h, 
                           &dst_c, 
                           &dst_line_bytes,
                           &dst_plane_bytes,
                           instance->param.cfg.out_seperate_plane);
          if(0 != dims_res){
            VX_PRINT(VX_ZONE_INIT, "error2 (%d, %d, %d), (%d, %d, %d)\n", 
                     src_c, src_w, src_h, dst_c, dst_w, dst_h);
            status = (vx_status)VX_ERROR_INVALID_PARAMETERS;
          }else{
            if((0 != (dst_w & 3)) && 
                0 == instance->param.use_dma){
              VX_PRINT(VX_ZONE_ERROR, "width of 'dst_tensor' is %d, is not 4x\n", dst_w);
              status = (vx_status)VX_ERROR_INVALID_PARAMETERS;
            }
          }
        }
      }
    
      if((vx_status)VX_SUCCESS == status)
      {
        unsigned char  *src_desc_target_ptr;
        unsigned char  *dst_desc_target_ptr;
    
        src_desc_target_ptr = tivxMemShared2TargetPtr(&src_desc->mem_ptr[0]);
        dst_desc_target_ptr = tivxMemShared2TargetPtr(&dst_desc->mem_ptr);
    
        memset(instance->param.work_buf, 0, instance->param.work_buf_len_aligned);
        if(1 == dst_c){
          for(i = 0 ; i < src_w ; i ++){
            instance->param.mean_row_buf[i]  = instance->param.cfg.mean[0];
            instance->param.scale_row_buf[i] = instance->param.cfg.scale[0];
          }
        }else{
          if(0 == instance->param.cfg.out_seperate_plane){
            for(i = 0 ; i < src_w ; i ++){
              instance->param.mean_row_buf[3 * i]      = instance->param.cfg.mean[0];
              instance->param.mean_row_buf[3 * i + 1]  = instance->param.cfg.mean[1];
              instance->param.mean_row_buf[3 * i + 2]  = instance->param.cfg.mean[2];
              instance->param.scale_row_buf[3 * i]     = instance->param.cfg.scale[0];
              instance->param.scale_row_buf[3 * i + 1] = instance->param.cfg.scale[1];
              instance->param.scale_row_buf[3 * i + 2] = instance->param.cfg.scale[2];
            }
          }else{
            for(i = 0 ; i < 4 ; i ++){
              instance->param.mean_row_buf[3 * i]      = instance->param.cfg.mean[0];
              instance->param.mean_row_buf[3 * i + 1]  = instance->param.cfg.mean[1];
              instance->param.mean_row_buf[3 * i + 2]  = instance->param.cfg.mean[2];
              instance->param.scale_row_buf[3 * i]     = instance->param.cfg.scale[0];
              instance->param.scale_row_buf[3 * i + 1] = instance->param.cfg.scale[1];
              instance->param.scale_row_buf[3 * i + 2] = instance->param.cfg.scale[2];
            }
          }
        }
    
        if(0 == instance->param.use_dma
           || (src_h < 2)
           || (0 != (src_h & 1))
           ){
          VX_PRINT(VX_ZONE_INIT, "no dma\n");
          VX_PRINT(VX_ZONE_INIT, "use_dma: %d, src_line_bytes: %d, dst_line_bytes: %d\n", 
                   instance->param.use_dma,
                   src_line_bytes,
                   dst_line_bytes);
          VX_PRINT(VX_ZONE_INIT, "src: %08lx, dst: %08lx\n", 
                   src_desc_target_ptr, 
                   dst_desc_target_ptr);
          /* Map all buffers, which invalidates the cache */
          tivxCheckStatus(&status, tivxMemBufferMap(src_desc_target_ptr,
                src_desc->mem_size[0], (vx_enum)VX_MEMORY_TYPE_HOST,
                (vx_enum)VX_READ_ONLY));
          tivxCheckStatus(&status, tivxMemBufferMap(dst_desc_target_ptr,
                dst_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST,
                (vx_enum)VX_WRITE_ONLY));
          if(1 == dst_c){
            normalize_image_uc1_f1(src_desc_target_ptr, 
                                  src_w,
                                  src_h,
                                  src_line_bytes,
                                  (float*)dst_desc_target_ptr,
                                  dst_line_bytes,
                                  &(instance->param));
          }else{
            normalize_image_uc3_f3(src_desc_target_ptr, 
                                   src_w,
                                   src_h,
                                   src_line_bytes,
                                   (float*)dst_desc_target_ptr,
                                   dst_line_bytes,
                                   &(instance->param));
          }
          tivxCheckStatus(&status, tivxMemBufferUnmap(src_desc_target_ptr,
              src_desc->mem_size[0], (vx_enum)VX_MEMORY_TYPE_HOST,
              (vx_enum)VX_READ_ONLY));
          tivxCheckStatus(&status, tivxMemBufferUnmap(dst_desc_target_ptr,
              dst_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST,
              (vx_enum)VX_WRITE_ONLY));
        }else{
          tivxCheckStatus(&status, tivxMemBufferMap(src_desc_target_ptr,
                src_desc->mem_size[0], (vx_enum)TIVX_MEMORY_TYPE_DMA,
                (vx_enum)VX_READ_ONLY));
          tivxCheckStatus(&status, tivxMemBufferMap(dst_desc_target_ptr,
                dst_desc->mem_size, (vx_enum)TIVX_MEMORY_TYPE_DMA,
                (vx_enum)VX_WRITE_ONLY));
          if(3 != dst_c ||
             0 == instance->param.cfg.out_seperate_plane){
            dma_preproc(instance,
                        src_desc_target_ptr,
                        src_w,
                        src_h,
                        dst_c,
                        src_line_bytes,
                        (float*)dst_desc_target_ptr,
                        dst_line_bytes);
          }else{
            dma_preproc_3planes(instance,
                                src_desc_target_ptr,
                                src_w,
                                src_h,
                                src_line_bytes,
                                (float*)dst_desc_target_ptr,
                                (float*)(dst_desc_target_ptr + dst_plane_bytes),
                                (float*)(dst_desc_target_ptr + (dst_plane_bytes << 1)),
                                dst_line_bytes,
                                instance->param.cfg.swap_channel);
          }
          if(1 == dst_c){
            normalize_image_uc1_f1_dma(src_desc_target_ptr, 
                                       src_w,
                                       src_h,
                                       src_line_bytes,
                                       (float*)dst_desc_target_ptr,
                                       dst_line_bytes,
                                       &(instance->param));
          }else{
            if(0 == instance->param.cfg.out_seperate_plane){
              normalize_image_uc3_f3_dma(src_desc_target_ptr, 
                                        src_w,
                                        src_h,
                                        src_line_bytes,
                                        (float*)dst_desc_target_ptr,
                                        dst_line_bytes,
                                        &(instance->param));
            }else{
              normalize_image_uc3_f3_planes_dma(src_desc_target_ptr, 
                                                src_w,
                                                src_h,
                                                src_line_bytes,
                                                (float*)dst_desc_target_ptr,
                                                (float*)(dst_desc_target_ptr + dst_plane_bytes),
                                                (float*)(dst_desc_target_ptr + (dst_plane_bytes << 1)),
                                                dst_line_bytes,
                                                &(instance->param));
            }
          }
          if(3 != dst_c ||
             0 == instance->param.cfg.out_seperate_plane){
            dma_postproc(instance);
          }else{
            dma_postproc_3planes(instance);
          }
          tivxCheckStatus(&status, tivxMemBufferUnmap(src_desc_target_ptr,
              src_desc->mem_size[0], (vx_enum)TIVX_MEMORY_TYPE_DMA,
              (vx_enum)VX_READ_ONLY));
          tivxCheckStatus(&status, tivxMemBufferUnmap(dst_desc_target_ptr,
              dst_desc->mem_size, (vx_enum)TIVX_MEMORY_TYPE_DMA,
              (vx_enum)VX_WRITE_ONLY));
        }
      }
      return status;
    }
    
    static void sevxNormalizeImageFreeInstance(SEVX_NORM_IMAGE_INST *instance)
    {
      if (NULL != instance)
      {
        if (NULL != instance->param.work_buf){
          if(NULL != instance->param.sram){
            tivxMemFree(instance->param.sram, 
                        instance->param.work_buf_len_aligned, 
                        instance->param.sram_heap_id);
          }else{
            tivxMemFree(instance->param.work_buf, 
                        instance->param.work_buf_len_aligned, 
                        (vx_enum)TIVX_MEM_EXTERNAL);
          }
          instance->param.sram     = NULL;
          instance->param.work_buf = NULL;
        }
        dma_delete(&(instance->param), 0);
        dma_delete(&(instance->param), 1);
        dma_delete(&(instance->param), 2);
        dma_delete(&(instance->param), 3);
        tivxMemFree(instance, 
                    sizeof(SEVX_NORM_IMAGE_INST), 
                    (vx_enum)TIVX_MEM_EXTERNAL);
      }
    }
    
    static unsigned int calc_dma_y_size(unsigned int available_ram_size,
                                        unsigned int height,
                                        unsigned int in_x_size,
                                        unsigned int out_x_size){
      unsigned int y_size = 1;
      unsigned int pingpong_in_size  = (((in_x_size + DMA_DATA_ALIGNMENT - 1) >> 
                                       DMA_DATA_ALIGNMENT_BITS) << DMA_DATA_ALIGNMENT_BITS);
      unsigned int pingpong_out_size = (((out_x_size + DMA_DATA_ALIGNMENT - 1) >> 
                                       DMA_DATA_ALIGNMENT_BITS) << DMA_DATA_ALIGNMENT_BITS);
      unsigned int pingpong_size = (pingpong_in_size + pingpong_out_size) * 2;
      y_size = available_ram_size / pingpong_size;
      y_size = y_size > height ? height : y_size;
      while(y_size > 1 &&
            (0 != (height % (2 * y_size)))){
        y_size --;
      }
      return y_size;
    }
    
    static unsigned int calc_dma_y_size_3planes(unsigned int available_ram_size,
                                                unsigned int width,
                                                unsigned int height){
      unsigned int y_size = 1;
      unsigned int pingpong_in_size  = (((width + DMA_DATA_ALIGNMENT - 1) >> 
                                       DMA_DATA_ALIGNMENT_BITS) << DMA_DATA_ALIGNMENT_BITS);
      unsigned int out_x_size        = width * sizeof(float);
      unsigned int pingpong_out_size = (((out_x_size + DMA_DATA_ALIGNMENT - 1) >> 
                                       DMA_DATA_ALIGNMENT_BITS) << DMA_DATA_ALIGNMENT_BITS);
      pingpong_out_size = pingpong_out_size * 3;
      unsigned int pingpong_size = (pingpong_in_size + pingpong_out_size) * 2;
      y_size = available_ram_size / pingpong_size;
      y_size = y_size > height ? height : y_size;
      while(y_size > 1 &&
            (0 != (height % (2 * y_size)))){
        y_size --;
      }
      return y_size;
    }
    
    static vx_status VX_CALLBACK sevxKernelCVNormalizeImageCreate(
        tivx_target_kernel_instance kernel, tivx_obj_desc_t *obj_desc[],
      uint16_t num_params, void *priv_arg){
      vx_status status = (vx_status)VX_SUCCESS;
      tivx_obj_desc_image_t            *in_img   = NULL;
      tivx_obj_desc_user_data_object_t *cfg_desc = NULL;
      SEVX_NORM_IMAGE_INST             *instance = NULL;
      int i = 0;
      if (num_params != SEVX_KERNEL_NORMALIZE_IMAGE_MAX_PARAMS){
        status = (vx_status)VX_FAILURE;
      }
      else{
        if (num_params != SEVX_KERNEL_NORMALIZE_IMAGE_MAX_PARAMS){
          status = (vx_status)VX_FAILURE;
        }else{
          for (i = 0U; i < SEVX_KERNEL_NORMALIZE_IMAGE_MAX_PARAMS; i ++)
          {
            if ((NULL == obj_desc[i])){
              status = (vx_status)VX_FAILURE;
              break;
            }
          }
        }
      }
    
      if ((vx_status)VX_SUCCESS == status)
      {
        in_img   = (tivx_obj_desc_image_t *)obj_desc[
            SEVX_KERNEL_NORMALIZE_IMAGE_SRC_IDX];
        cfg_desc = (tivx_obj_desc_user_data_object_t *)obj_desc[
            SEVX_KERNEL_NORMALIZE_CFG_IDX];
        instance = tivxMemAlloc(sizeof(SEVX_NORM_IMAGE_INST), (vx_enum)TIVX_MEM_EXTERNAL);
        if (NULL != instance) {
          memset(instance, 0, sizeof(SEVX_NORM_IMAGE_INST));
          unsigned int   src_width        = in_img->imagepatch_addr[0U].dim_x;
          unsigned int   src_height       = in_img->imagepatch_addr[0U].dim_y;
          unsigned int   src_line_size    = in_img->imagepatch_addr[0U].stride_y;
          vx_df_image    src_fmt          = in_img->format;
          unsigned int   channel          = 1;
          unsigned int   mean_scale_len   = 0;
          unsigned int   pingpong_in      = 0;
          unsigned int   pingpong_out     = 0;
          unsigned int   dma_work_buf_len = 0;
          unsigned int   y_size           = 1;
          
          vx_uint32      avail_size   = 0;
          tivx_mem_stats sram_stats;
    
          if ((vx_df_image)VX_DF_IMAGE_U8 != src_fmt &&
              (vx_df_image)VX_DF_IMAGE_RGB != src_fmt)
          {
            status = (vx_status)VX_ERROR_INVALID_PARAMETERS;
            VX_PRINT(VX_ZONE_ERROR, "'input' should be an image of type:\n VX_DF_IMAGE_U8 \n VX_DF_IMAGE_RGB \n");
          }else{
            channel = (((vx_df_image)VX_DF_IMAGE_U8) == src_fmt ? 1 : 3);
          }
    
          NORM_IMAGE_CFG* cfg_mem = (NORM_IMAGE_CFG*)tivxMemShared2TargetPtr(&cfg_desc->mem_ptr);
          tivxCheckStatus(&status, tivxMemBufferMap(cfg_mem,
                cfg_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST,
                (vx_enum)VX_READ_ONLY));
          memcpy(&(instance->param.cfg), cfg_mem, sizeof(NORM_IMAGE_CFG));
          tivxCheckStatus(&status, tivxMemBufferUnmap(cfg_mem,
                cfg_desc->mem_size, (vx_enum)VX_MEMORY_TYPE_HOST,
                (vx_enum)VX_READ_ONLY));
    
          if ((vx_status)VX_SUCCESS == status){
            tivxMemFree(NULL, 0, (vx_enum)TIVX_MEM_INTERNAL_L2);
            tivxMemStats(&sram_stats, (vx_enum)TIVX_MEM_INTERNAL_L2);
    
            int src_width4 = (((src_width + 3) >> 2) << 2);
            if(3 == channel && 0 != instance->param.cfg.out_seperate_plane){
              mean_scale_len  = (DMA_DATA_ALIGNMENT << 1);
              pingpong_in     = (((3 * src_width4 + DMA_DATA_ALIGNMENT - 1) >> 
                                    DMA_DATA_ALIGNMENT_BITS) << DMA_DATA_ALIGNMENT_BITS);
              pingpong_out  = src_width4 * sizeof(float);
              pingpong_out  = ((pingpong_out + DMA_DATA_ALIGNMENT - 1) >> 
                                DMA_DATA_ALIGNMENT_BITS) << 
                                DMA_DATA_ALIGNMENT_BITS;
              pingpong_out  = 3 * pingpong_out;
              dma_work_buf_len  = mean_scale_len + (pingpong_in + pingpong_out) * 2;
              
            }else{
              mean_scale_len  = channel * src_width * sizeof(float);
              mean_scale_len  = (((mean_scale_len +  DMA_DATA_ALIGNMENT - 1) >> 
                                DMA_DATA_ALIGNMENT_BITS) << 
                                DMA_DATA_ALIGNMENT_BITS);
              mean_scale_len  = mean_scale_len * 2;
              pingpong_in   = (((channel * src_width4 + DMA_DATA_ALIGNMENT - 1) >> 
                                    DMA_DATA_ALIGNMENT_BITS) << DMA_DATA_ALIGNMENT_BITS);
              pingpong_out  = channel * src_width4 * sizeof(float);
              pingpong_out  = ((pingpong_out + DMA_DATA_ALIGNMENT - 1) >> 
                                DMA_DATA_ALIGNMENT_BITS) << 
                                DMA_DATA_ALIGNMENT_BITS;
              dma_work_buf_len  = mean_scale_len + (pingpong_in + pingpong_out) * 2;
            }
            instance->param.channel = channel;
    
    #ifdef CORE_C6XX
            avail_size = sram_stats.free_size;
    #endif
            if(avail_size > dma_work_buf_len){
              instance->param.sram_heap_id     = TIVX_MEM_INTERNAL_L2; /* TIVX_MEM_INTERNAL_L2 or TIVX_MEM_EXTERNAL */
              instance->param.sram_global_base = 0;
              if(appIpcGetSelfCpuId()==APP_IPC_CPU_C6x_1){
                instance->param.sram_global_base = 0x4D80000000;
              }else{
                instance->param.sram_global_base = 0x4D81000000;
              }
              if(3 == channel && 0 != instance->param.cfg.out_seperate_plane){
                y_size = calc_dma_y_size_3planes(avail_size - mean_scale_len, 
                                                 src_width4,
                                                 src_height);
              }else{
                y_size = calc_dma_y_size(avail_size - mean_scale_len, 
                                         src_height, 
                                         channel * src_width4, 
                                         channel * src_width4 * sizeof(float));
              }
              pingpong_in       = y_size * pingpong_in;
              pingpong_out      = y_size * pingpong_out;
              dma_work_buf_len  = mean_scale_len + (pingpong_in + pingpong_out) * 2;
              instance->param.work_buf_len_aligned = dma_work_buf_len;
              VX_PRINT(VX_ZONE_INIT, "y_size: %u, dma_work_buf_len: %u, out_seperate_plane: %d\n", 
                                     y_size, dma_work_buf_len, instance->param.cfg.out_seperate_plane);
              instance->param.sram = tivxMemAlloc(instance->param.work_buf_len_aligned, 
                                                  instance->param.sram_heap_id);
              if(instance->param.sram == NULL){
                VX_PRINT(VX_ZONE_ERROR, "Unable to allocate sram scratch!\n");
                status = (vx_status)VX_ERROR_NO_MEMORY;
              }else{
                instance->param.use_dma  = 1;
                instance->param.work_buf = instance->param.sram;
              }
              if(appIpcGetSelfCpuId()==APP_IPC_CPU_C6x_1){
                status  = dma_create(&(instance->param), 0, 8);
                status |= dma_create(&(instance->param), 1, 9);
                if(3 == channel && 0 != instance->param.cfg.out_seperate_plane){
                  status |= dma_create(&(instance->param), 2, 10);
                  status |= dma_create(&(instance->param), 3, 11);
                }
              }else{
                status  = dma_create(&(instance->param), 0, 12);
                status |= dma_create(&(instance->param), 1, 13);
                if(3 == channel && 0 != instance->param.cfg.out_seperate_plane){
                  status |= dma_create(&(instance->param), 2, 14);
                  status |= dma_create(&(instance->param), 3, 15);
                }
              }
              if ((vx_status)VX_SUCCESS != status){
                VX_PRINT(VX_ZONE_ERROR, "Unable to allocate dma!\n");
                status = (vx_status)VX_ERROR_NO_MEMORY;
              }
            }else{
              instance->param.work_buf_len_aligned = mean_scale_len;
              instance->param.work_buf = (unsigned char*)tivxMemAlloc(instance->param.work_buf_len_aligned, 
                                    (vx_enum)TIVX_MEM_EXTERNAL);
            }
            if ((vx_status)VX_SUCCESS == status){
              if(NULL == instance->param.sram){
                instance->param.work_buf_aligned = (unsigned char*)(((((size_t)instance->param.work_buf) + 31) >> 5) << 5);
                instance->param.mean_row_buf     = (float*)(instance->param.work_buf_aligned);
                instance->param.scale_row_buf    = (float*)(instance->param.work_buf_aligned + (mean_scale_len >> 1));
              }else{
                instance->param.work_buf_aligned = instance->param.sram;
                instance->param.mean_row_buf     = (float*)(instance->param.work_buf_aligned);
                instance->param.scale_row_buf    = (float*)(instance->param.work_buf_aligned + (mean_scale_len >> 1));
                instance->param.pingpong_in_a    = (((unsigned char*)instance->param.scale_row_buf) + (mean_scale_len >> 1)) ;
                instance->param.pingpong_in_b    = (unsigned char*)(instance->param.pingpong_in_a + pingpong_in) ;
                instance->param.pingpong_out_a   = (float*)(instance->param.pingpong_in_b  + pingpong_in) ;
                instance->param.pingpong_out_b   = (float*)(instance->param.pingpong_out_a + (pingpong_out >> 2)) ;
                instance->param.pingpong_in_size  = pingpong_in;
                instance->param.pingpong_out_size = pingpong_out;
                instance->param.y_size            = y_size;
              }
              instance->param.pfn_log          = print_alg_log;
            }
          }
        }
        else{
          status = (vx_status)VX_ERROR_NO_MEMORY;
        }
    
        if ((vx_status)VX_SUCCESS == status){
          tivxSetTargetKernelInstanceContext(kernel, instance,
              sizeof(SEVX_NORM_IMAGE_INST));
        }else{
          if (NULL != instance)
            sevxNormalizeImageFreeInstance(instance);
        }
      }
      return (status);
    }
    
    static vx_status VX_CALLBACK sevxKernelCVNormalizeImageDelete(
        tivx_target_kernel_instance kernel, tivx_obj_desc_t *obj_desc[],
        uint16_t num_params, void *priv_arg){
      vx_status status = (vx_status)VX_SUCCESS;
    
      uint32_t instance_size;
      SEVX_NORM_IMAGE_INST *instance = NULL;
    
      if (num_params != SEVX_KERNEL_NORMALIZE_IMAGE_MAX_PARAMS)
      {
        status = (vx_status)VX_FAILURE;
      }
    
      if ((vx_status)VX_SUCCESS == status)
      {
        status = tivxGetTargetKernelInstanceContext(kernel,
            (void **)&instance, &instance_size);
    
        if (((vx_status)VX_SUCCESS == status) && (NULL != instance) &&
            (sizeof(SEVX_NORM_IMAGE_INST) == instance_size))
        {
          sevxNormalizeImageFreeInstance(instance);
        }
      }
    
      return status;
    }
    
    vx_status sevxAddTargetKernelCVNormalizeImage(const char* rt_path){
      vx_status status = (vx_status)VX_SUCCESS;
      char target_name[TIVX_TARGET_MAX_NAME];
      status = tivxKernelsTargetUtilsAssignTargetNameDsp(target_name);
      if( (vx_status)VX_SUCCESS == status)
      {
        sevx_normalize_image_target_kernel = tivxAddTargetKernelByName(
                SEVX_CV_KERNEL_NORMALIZE_IMAGE_NAME,
                target_name,
                sevxKernelCVNormalizeImageProcess,
                sevxKernelCVNormalizeImageCreate,
                sevxKernelCVNormalizeImageDelete,
                NULL,
                NULL);
      }
      return status;
    }
    
    vx_status sevxRemoveTargetKernelCVNormalizeImage(){
      if(NULL != sevx_normalize_image_target_kernel){
        tivxRemoveTargetKernel(sevx_normalize_image_target_kernel);
      }
      sevx_normalize_image_target_kernel = NULL;
      return (vx_status)VX_SUCCESS;
    }
    

    I think my setup about UDMA is very simple, because, I just call them through the existing interface in vision_apps.

    let me explain the code for you:

    1. about the setup of UDMA, they are located in sevx_cv_kernel_normalize_image_target.c, these functions are prefixed with "dma_": dma_create, dma_preproc, dma_postproc,and so on.

    2. about the pingpoing operation, they are located in cv_normalize_image_implement.c, you can check this function: normalize_image_uc3_f3_planes_dma

  • 好的我们跟进给工程师了。此外由于假期将近,英文论坛的答复可能会有所延迟,感谢您的耐心等待!

  • Hi,

    1.about the "What's the definition of group of udma"

    I don't know what is "group", but, I found it in the codes of vision_apps,

    ti-processor-sdk-rtos-j721e-evm-08_04_00_06/vision_apps/kernels/img_proc/c66/vx_dma_transfers.h, line71:

    #define DMA_MAX_NUM_TRANSFERS_PER_GROUP (8)

    We dont see this being used anywhere, so does not look like increasing it will have any effect.

    Looking at the source code, we don't get which API is exactly triggering 4 or 8 dma transfers. Could you please help us understand? How are you triggering DMA requests back to back for 4 or 8 channels? Before triggering the transfers, are the descriptors ready in the memory? 

    Thanks and regards,

    Cherry

  • 很久之前,我直接在英文论坛回复了,都快1个月了,一直没有回复Joy

  • 抱歉,这条线的回复通常都会慢一些,我们在英文论坛催促一下工程师,您看下后续答复。