NN Deployment

AmebaPro2 has an NN H/W engine to accelerate the neural network inference process. NN models obtained from different AI framework, such as Keras, Tensorflow, Tensorflow Lite, PyTorch, Caffe, ONNX, Darknet and etc, can be converted to network binary graph file by Verisilicon’s Acuity Toolkit. Then, NN model can be deployed on AmebaPro2 easily. Following is the workflow of model deploying:

../../_images/image2.png — Fig. 2 NN model workflow

Using customized NN model

This section will demonstrate how to deploy a pre-trained model. Take yolov4-tiny pre-trained model for example; Fig. 3 is the flowchart of the whole procedure:

../../_images/image3.png — Fig. 3 yolov4-tiny deployment workflow

Setup Acuity toolkit on PC

The Acuity toolkit would be required to generate the NN network binary file from a pre-trained model. The following documents and tools are provided by Verisillicon, and please refer its user guide to setup the PC environment.

Please refer to Acuity Toolkit Installation about how to install Verisilicon’s Acuity Toolkit

Table 1 Acuity tool and document
Ver.	Acuity Tool / Document	Description
Acuity 5.21.1	Vivante.VIP.ACUITY.Toolkit.User.Guide-v0.80-20210326.pdf acuity_toolkit_binary_5.21.1.zip acuity-toolkit-whl-5.21.1.zip acuity-examples.zip	Acuity toolkit user guide document Acuity binary/python version toolkit (Please check the installation steps in chapter 2 of Vivante.VIP.ACUITY.Toolkit.User.Guide.pdf) Acuity example and scripts
	Verisilicon_SW_VIP_NBInfo_v1.1.10_20210331.tgz	memory evaluation tool (Please check the usage guidelines in its readme file)
	VivanteIDE5.3.0_cmdtools.zip	command line tool to export network binary file
Acuity 6.6.1	Verisilicon_Tool_Acuity_Toolkit_6.6.1_Binary_Whl_Src_20220505.tgz	Acuity toolkit user guide document Acuity binary/python version toolkit (Please check the installation steps in chapter 2 of Vivante.VIP.ACUITY.Toolkit.User.Guide.pdf) Acuity example and scripts
	Verisilicon_SW_NBInfo_1.2.4_20220505.tgz	memory evaluation tool (Please check the usage guidelines in its readme file)
	Verisilicon_Tool_VivanteIDE_v5.7.0	command line tool to export network binary file
Acuity 6.18.0	Verisilicon_Tool_Acuity_Toolkit_6.18.0_Binary_Whl_Src_20230331.tgz	Acuity toolkit user guide document Acuity binary/python version toolkit (Please check the installation steps in chapter 2 of Vivante.VIP.ACUITY.Toolkit.User.Guide.pdf) Acuity example and scripts
	Verisilicon_SW_NBInfo_1.2.17_20230412.tgz	memory evaluation tool (Please check the usage guidelines in its readme file)
	Verisilicon_Tool_VivanteIDE_v5.8.1	command line tool to export network binary file

Note

Viplite driver is the NN driver on AmebaPro2 to work with NN engine. If the customized model is converted by newer AcuityToolkit, it should be used with newer Viplite driver. If the model is converted by old AcuityToolkit, it can be also used with newer Viplite driver. Therefore, user do not need to convert the model by new tool if they upgrade the Viplite driver (NN library in SDK: libnn.a).

		AcuityToolkit version
		5.21.1	6.6.1	6.18.0
Viplite driver version	1.3.4	V	X	X
	1.8.0	V	V	X
	1.12.0	V	V	V/X
	2.0.0	V	V	V

Step for customized model conversion

User can refer the following acuity toolkit instructions to generate their own model binary. Necessary scripts are in “acuity-examples/Script”. Take yolov4 as example, user can download yolov4-tiny.cfg, yolov4-tiny.weights from https://github.com/AlexeyAB/darknet#pre-trained-models. If the model is converted correctly, the generated yolov4_tiny.nb should be as same as the one provided in SDK.

import the model:

$ ./pegasus_import.sh yolov4_tiny

modify the “mean” and “scale” in yolov4_tiny_inputmeta.yml –> Ex: scale: 0.00392156 (1/255)
quantize the model:

$ ./pegasus_quantize.sh yolov4_tiny uint8

add the following to the command in pegasus_export_ovx.sh

if Acuity 5.21.1:

--optimize 'VIP8000NANONI_PID0XAD' \
--pack-nbg-viplite \
--viv-sdk 'home/Acuity/VivanteIDE5.3.0_cmdtools/cmdtools' \

if Acuity 6.6.1 or 6.18.0:

--optimize 'VIP8000NANONI_PID0XAD' \
--pack-nbg-unify \
--viv-sdk 'home/Acuity/VivanteIDE5.7.0_cmdtools/cmdtools' \

export the NBG file:

$ ./pegasus_export_ovx.sh yolov4_tiny uint8

Then, a network_binary.nb (yolov4_tiny.nb) will be generated.

Note

After the model conversion is completed, it require cooperating with the corresponding pre-processing and post-processing to complete the function of the model. Acuity tool will not generate pre-processing and post-processing files automatically. Users can refer to pre- and post-processing files for existing nn models. In addition, users can check the inference output results with Acuity script: “./pegasus_inference.sh”.

Supported model quantization type on NN accelerator

In order to run the NN model with full capability of HW accelerator, user have to quantize the model with some specific quantization type. For NN hardware on Pro2, the combination of quantizer and qtype in Acuity’s quantization script should be “asymmetric_affine uint8” or “dynamic_fixed_point int8/int16”.

In addition, NN hardware do not have a good support for per-channel quantization. Per-tensor use a quantization parameter(scale,zp) for whole tensor, and per-channel use different quantization parameter for each channel of weights. The NPU on Pro2 doesn’t support every channel have its own quantization parameter(scale,zp), so please use the per-tensor method instead.

Table 2 NPU HW supported quantization type
quantizer	qtype	per-channel or per-tensor
asymmetric_affine	uint8	only per-tensor
dynamic_fixed_point	int8/int16	only per-tensor

NPU has three main unit: NN, TP and PPU(SHADER). NN and TP are HW accelerator; PPU(SHADER) is general programmable unit. NN/TP only support “dynamic fixed point int8/int16”, “asymmetric affine uint8” quantization type. Other quantization type will run on PPU, so it will much slower than running on NN/TP. User can use NBinfo tool to check the operation of exported model will run on NN or TP or PPU(SHADER).

Note

Vendor suggests importing the original float32 model into Acuity Toolkit, and doing the quantization with Acuity’s quantization script. User can also quantize their model by their training framework (e.g. tensorflow), and they should ensure the supported quantization types are used.

SDK configuration for customized NN model

The model binary file (.nb) was obtained from previous section. In this section, we will introduce how to add this model binary file to SDK and implement the necessary pre-processing and post-processing.

Add customized model network binary to SDK

After yolov4_tiny.nb generated, we can add this file to SDK folder: “project/realtek_amebapro2_v0_example/src/test_model/model_nb”. All model network binary files will be placed here; the structure should be:

project/realtek_amebapro2_v0_example/src/test_model/
|-- model_nb/
|   |-- yolov3_tiny.nb  --> yolov3-tiny network binary graph file
|   |-- yolov4_tiny.nb  --> yolov4-tiny network binary graph file
|   |-- yolov7_tiny.nb  --> yolov7-tiny network binary graph file
|-- model_yolo.c  --> implementation of pre-process & post-process of yolov3,yolov4,yolov7
|-- model_yolo.h

For another example, if you have a converted customized model named “AmebaNet.nb”, you should add it to “test_model/model_nb/” and create two files - model_AmebaNet.c and model_AmebaNet.h. You should implement the pre-process and post-process for AmebaNet in model_AmebaNet.c. Therefore, the folder should be look like

project/realtek_amebapro2_v0_example/src/test_model/
|-- model_nb/
|   |-- yolov3_tiny.nb
|   |-- yolov4_tiny.nb
|   |-- yolov7_tiny.nb
|   |-- AmebaNet.nb
|-- model_yolo.c
|-- model_yolo.h
|-- model_AmebaNet.c  --> implementation of pre-process & post-process of AmebaNet
|-- model_AmebaNet.h

Note

Remember to add your model_AmebaNet.c to “project/realtek_amebapro2_v0_example/GCC-RELEASE/application application.cmake”. Additionally, we also need to check the configuration of flash size and ddr size for the nn model is enough. Please refer 0 and 1.3.2 to do evaluation.

Next, add model to the model list.

Go to “project/realtek_amebapro2_v0_example/GCC-RELEASE/mp/amebapro2_fwfs_nn_models.json” and add AmebaNet.nb to this list:

{
    "msg_level":3,

    "PROFILE":["FWFS"],
    "FWFS":{
        "files":[
            "MODEL0",
            "MODEL1"
        ]
    },
    "MODEL0":{
        "name" : "yolov4_tiny.nb",
        "source":"binary",
        "file":"yolov4_tiny.nb"
    },
    "MODEL1":{
        "name" : "AmebaNet.nb",
        "source":"binary",
        "file":" AmebaNet.nb"
    }
}

Note

If you only want to use AmebaNet.nb, just choose “MODEL1” in “FWFS”-“files”. Otherwise, your final image will become very large since it contain some unused model binary files.

Create a model object can be used by VIPNN module

The vipnn module will use the model object to deploy the model, do model pre-process, trigger model inference and do model post-process.

Therefore, we should create an “nnmodel_t AmebaNet” in model_AmebaNet.c. The following are the necessary functions will be used by VIPNN module, so we should register these function pointers to AmebaNet object after finishing implementation:

nnmodel_t AmebaNet = {
    .nb         = AmebaNet_get_network_filename,
    .preprocess     = AmebaNet_preprocess,
    .postprocess    = AmebaNet_postprocess,
    .model_src  = MODEL_SRC_FILE,
    .name = "AmebaNet"
};

Set the NN model file name used by NN driver

The model name need to be set, so NN driver can open and load the network binary file via file system during runtime deployment.

void *AmebaNet_get_network_filename(void)
{
    return (void *) "NN_MDL/AmebaNet.nb";
}

Note

The NN driver will use firmware file system (component/file_system/fwfs) to open and read the model from flash by default. For further information, user can refer “nn file operation layer” used by NN driver – component/file_system/nn/nn_file_op.c.

Implement customized pre-process and post-process

User can do their customized pre-process for the image before passing it to NN model inference; in addition, they can do their customized post-process to decode the output tensor from result of inference.

Implement pre-process in model_AmebaNet.c:

int AmebaNet_preprocess(void *data_in, nn_data_param_t *data_param, void *tensor_in, nn_tensor_param_t *tensor_param)
{
    void **tensor = (void **)tensor_in;

    //do pre-process here, user can refer model_yolo.c to do it
    (uint8_t *)data_in;
    (uint8_t *)tensor[0];
    (uint8_t *)tensor[1];
    //…

    //clean the cache since the data will be accessed by NN engine directly
    dcache_clean_by_addr((uint32_t *)tensor[0], data_length);

    return 0;
}

Implement post-process in model_AmebaNet.c:

int AmebaNet_postprocess(void *tensor_out, nn_tensor_param_t *param, void *res)
{
    void **tensor = (void **)tensor_out;
    int output_count = param->count;

    //decode the tensor data, user can refer model_yolo.c to do it
    for (int n = 0; n < output_count; n++) {

        (uint8_t *)tensor[n];
        //…
    }

    //fill the result
    int od_num = 0;
    objdetect_res_t *od_res = (objdetect_res_t *)res;
    for (int i = 0; i < box_idx; i++) {
        box_t *obj = &res_box[i];

        if (obj->invalid == 0) {
            od_res[od_num].result[0] = obj->class_idx;
            od_res[od_num].result[1] = obj->prob;
            od_res[od_num].result[2] = obj->x;  // top_x
            od_res[od_num].result[3] = obj->y;  // top_y
            od_res[od_num].result[4] = obj->x + obj->w; // bottom_x
            od_res[od_num].result[5] = obj->y + obj->h; // bottom_y
            od_num++;
        }
    }

    //return number of result
    return od_num;
}

Section 1.4 will demonstrate how to build the NN objection example (mmf2_video_example_vipnn_rtsp_init.c) with yolov4-tiny. If user would like to use their customized model, they can replace default model object “yolov4_tiny” with “AmebaNet”.

NN memory and flash usage evaluation

This section introduce how to evaluate NN model size and DDR usage. The following table shows the memory information of existing model provided in SDK:

Table 3 Model memory and size
Category	Model	Input size	Quantized	DDR memory	File size
Object detection	Yolov3-tiny Yolov4-tiny Yolov4-tiny Yolov7-tiny NanoDet-Plus-m NanoDet-Plus-m	416x416 416x416 576x320 416x416 416x416 576x320	uint8 uint8 uint8 uint8 uint8 uint8	6.9 MB (6,946,128 bytes)\| 7.7 MB (7,712,412 bytes)\| 7.48 MB (7,840,836 bytes) 8.2 MB (8,597,072 bytes) 4.33 MB (4,542,016 bytes) 4.53 MB (4,746,556 bytes)	5.6 MB (5,568,384 bytes) 4.1 MB (4,131,712 bytes) 3.85 MB (4,043,136 bytes) 4.44 MB (4,664,512 bytes) 1.86 MB (1,959,040 bytes) 1.83 MB (1,924,096 bytes)
Face detection	SCRFD SCRFD	640x640 576x320	uint8 uint8	4.1 MB (4,291,200 bytes) 2.6 MB (2,753,864 bytes)	0.68 MB (715,584 bytes) 0.56 MB (583,232 bytes)
Face Recognition	MobileFaceNet MobileFaceNet	112x112 112x112	int8 int16	1.72 MB (1,799,716 bytes) 5.1 MB (5,343,948 bytes)	0.86 MB (904,576 bytes) 3.42MB (3,590,656 bytes)
Sound classification	YAMNet YAMNet_s	15600x1 96x64	fp16 hybrid	9.2 MB (9,172,348 bytes) 0.73 MB (729,608 bytes)	8.7 MB (8,669,888 bytes) 0.67 MB (678,336 bytes)

Evaluate memory usage of model

Please use Verisilicon_SW_VIP_NBInfo tool to evaluate the ddr memory usage of the model on PC. Take yolov4-tiny for example, it requires at least 8MB ddr memory

********************************************************************************
Memory Info
********************************************************************************
Total Read Only Memory (bytes):                                   3737536
Total Command buffer (bytes):                                     167552
Total Load States (bytes):                                        34176
Total NN and TP instruction (bytes):                              132864
Total PPU instruction (bytes):                                    512
********************************************************************************
Total Operation Memory (bytes):                                   3939264
Total Input Memory (bytes):                                       519168
Total Output Memory (bytes):                                      215552
Memory Pool (bytes):                                              2769920
Video memory heap node reserved (bytes):                          20480
********************************************************************************
Total Video Memory (bytes):                                       7464448
Total System Memory (bytes):                                      247964
********************************************************************************

Therefore, we have to make sure the NN ddr region in linker script is enough for this model.

Check and modify in “projectrealtek_amebapro2_v0_exampleGCC-RELEASEapplicationrtl8735b_ram.ld

/* DDR memory */

  VOE    (rwx)    : ORIGIN = 0x70000000, LENGTH = 0x70100000 - 0x70000000  /*  1MB */
  DDR    (rwx)    : ORIGIN = 0x70100000, LENGTH = 0x73000000 - 0x70100000  /* 49MB */
  NN     (rwx)    : ORIGIN = 0x73000000, LENGTH = 0x74000000 - 0x73000000  /* 16MB */

Note

Please also modify project/realtek_amebapro2_v0_example/GCC-RELEASE/bootloader/rtl8735b_boot_mp.ld to make the NN ddr region be consistent with rtl8735b_ram.ld. In addition, if building a TrustZone project, rtl8735b_ram_ns.ld should be modified instead of rtl8735b_ram.ld.

Evaluate model size on flash

Please make sure the NN region in partition table is larger than your model size, so that the model can be downloaded to flash correctly.

One model

Take yolov4-tiny for example, the model size is 4MB

../../_images/image5.png — Fig. 4 model network binary

The nn region length in “projectrealtek_amebapro2_v0_exampleGCC-RELEASEmpamebapro2_partitiontable.json” should not less than 4MB

"nn":{
            "start_addr" : "0x770000",
            "length" : "0x700000",   --> 7MB > yolov4-tiny(4MB)
            "type": "PT_NN_MDL",
            "valid": true
      },

Multi models

If multi models be used, please add up each model size. For example, user want to deploy 4 models – yolov4-tiny, yamnet-s, mobilefacenet and centerface, so the content of “amebapro2_fwfs_nn_models.json” will become:

{
    "msg_level":3,

    "PROFILE":["FWFS"],
    "FWFS":{
        "files":[
            "MODEL0",
            "MODEL1",
            "MODEL2",
            "MODEL3"
        ]
    },
    "MODEL0":{
        "name" : "yolov4_tiny.nb",
        "source":"binary",
        "file":"yolov4_tiny.nb"

    },
    "MODEL1":{
        "name" : "yamnet_s.nb",
        "source":"binary",
        "file":"yamnet_s.nb"

    },
    "MODEL2":{
        "name" : "mobilefacenet_int16.nb",
        "source":"binary",
        "file":"mobilefacenet_int16.nb"

    },
    "MODEL3":{
        "name" : "centerface_uint8.nb",
        "source":"binary",
        "file":"centerface_uint8.nb"

    }
}

Check each model size and calculate the total size = 1,535KB + 3,507KB + 663KB + 4,053KB = 9,740KB. It requires at least 10MB flash size for NN.

../../_images/image6.png — Fig. 5 model network binary size

Therefore, the nn region length in “projectrealtek_amebapro2_v0_exampleGCC-RELEASEmpamebapro2_partitiontable.json” should not less than 10MB

"nn":{
            "start_addr" : "0x770000",
            "length" : "0xA00000",   --> 10MB > total size(9,740KB)
            "type": "PT_NN_MDL",
            "valid": true
      },

Using the NN MMF example with VIPNN module

The yolo nn example is a part of mmf video joined example. Please uncomment the example want to execute.

(project/realtek_amebapro2_v0_example/src/mmfv2_video_example/video_example_media_framework.c)

mmf2_video_example_vipnn_rtsp_init(); yolov4-tiny object detection

The content of this example is located in “mmf2_video_example_vipnn_rtsp_init.c”.

Table 4 NN examples
Example	Description	Result
mmf2_video_example_vipnn_rtsp_init	CH1 Video -> H264/HEVC -> RTSP CH4 Video -> RGB -> NN	RTSP video stream over the network. NN do object detection and draw the bounding box to RTSP channel.

Set RGB video resolution as model input size

If setting the RGB resolution according to NN model input tensor shape, it can avoid software image resizing and save pre-processing time.

Open “mmf2_video_example_vipnn_rtsp_init.c” and set NN_WIDTH and NN_HEIGHT to 416 (same as yolov4-tiny input size).

#define YOLO_MODEL              1
#define USE_NN_MODEL            YOLO_MODEL
…
#if (USE_NN_MODEL==YOLO_MODEL)
#define NN_WIDTH    416
#define NN_HEIGHT   416
static float nn_confidence_thresh = 0.4;
static float nn_nms_thresh = 0.3;
#else
#error Please set model correctly. (YOLO_MODEL)
#endif
…
static video_params_t video_v4_params = {
    .stream_id       = NN_CHANNEL,
    .type            = NN_TYPE,
    .resolution      = NN_RESOLUTION,
    .width           = NN_WIDTH,
    .height          = NN_HEIGHT,
    .bps             = NN_BPS,
    .fps             = NN_FPS,
    .gop             = NN_GOP,
    .direct_output   = 0,
    .use_static_addr = 1
};

Choose NN model

Please check the desired models are selected in amebapro2_fwfs_nn_models.json. For example, if we want to use yolov4_tiny, Go to “project/realtek_amebapro2_v0_example/GCC-RELEASE/mp/amebapro2_fwfs_nn_models.json” and set model yolov4_tiny - “MODEL0” be used:

{
    "msg_level":3,

    "PROFILE":["FWFS"],
    "FWFS":{
         "files":[
            "MODEL0"
]
    },
    "MODEL0":{
        "name" : "yolov4_tiny.nb",
        "source":"binary",
        "file":"yolov4_tiny.nb"

    },
    "MODEL1":{
        "name" : "yamnet_fp16.nb",
        "source":"binary",
        "file":"yamnet_fp16.nb"

    },
    "MODEL2":{
        "name" : "yamnet_s.nb",
        "source":"binary",
        "file":"yamnet_s.nb"

    }
}

Note

After choosing the model, user have to check the ddr memory and flash size usage of models.

Build NN example

Since it’s a part of video mmf example, user should use the following command to generate the makefile.

Generate the makefile for the NN project:

cmake .. -G"Unix Makefiles" -DCMAKE_TOOLCHAIN_FILE=../toolchain.cmake -DVIDEO_EXAMPLE=ON

Then, use the following command to generate an image with NN model inside:

cmake --build . --target flash_nn

After running the command above, you will get the flash_ntz.nn.bin (including the model) in “projectrealtek_amebapro2_v0_exampleGCC-RELEASEbuild”

../../_images/image7.png — Fig. 6 image with NN model

Then, use the image tool to download it to AmebaPro2:

Nor flash

$ ./uartfwburn.linux -p /dev/ttyUSB? -f ./flash_ntz.nn.bin –b 3000000

Nand flash

$ ./uartfwburn.linux -p /dev/ttyUSB? -f ./flash_ntz.nn.bin -b 3000000 -n pro2

Update NN model on flash

If user just want to update the NN model instead of updating whole firmware, the following command can be used to update NN section on flash partially:

Nand flash

$ ./uartfwburn.linux -p /dev/ttyUSB? -f ./flash_ntz.nn.bin -b 3000000 -n pro2 -t 0x81cf

Validate NN example

Refer the following section to validate nn examples.

Object detection example – Yolov4-tiny

While running the example, you may need to configure WiFi connection by using these commands in uart terminal.

ATW0=<WiFi_SSID> : Set the WiFi AP to be connected
ATW1=<WiFi_Password> : Set the WiFi AP password
ATWC : Initiate the connection

If everything works fine, you should see the following logs

…
[VOE]RGB3 640x480 1/5
[VOE]Start Mem Used ISP/ENC:     0 KB/    0 KB Free=  701
hal_rtl_sys_get_clk 2
GCChipRev data = 8020
GCChipDate data = 20190925
queue 20121bd8 queue mutex 71691380
npu gck vip_drv_init, video memory heap base: 0x71B00000, size: 0x01300000
yuv in 0x714cee00
[VOE][process_rgb_yonly_irq][371]Errrgb ddr frame count overflow : int_status 0x00000008 buf_status 0x00000010 time 15573511 cnt 0
input 0 dim 416 416 3 1, data format=2, quant_format=2, scale=0.003660, zero_point=0
ouput 0 dim 13 13 255 1, data format=2, scale=0.092055, zero_point=216
ouput 1 dim 26 26 255 1, data format=2, scale=0.093103, zero_point=216
---------------------------------
input count 1, output count 2
input param 0
        data_format  2
        memory_type  0
        num_of_dims  4
        quant_format 2
        quant_data  , scale=0.003660, zero_point=0
        sizes        1a0 1a0 3 1 0 0
output param 0
        data_format  2
        memory_type  0
        num_of_dims  4
        quant_format 2
        quant_data  , scale=0.092055, zero_point=216
        sizes        d d ff 1 0 0
output param 1
        data_format  2
        memory_type  0
        num_of_dims  4
        quant_format 2
        quant_data  , scale=0.093103, zero_point=216
        sizes        1a 1a ff 1 0 0
---------------------------------
in 0, size 416 416
VIPNN opened
siso_array_vipnn started
nn tick[0] = 47
object num = 0
nn tick[0] = 46
object num = 0
…

Then, open VLC and create a network stream with URL: rtsp://192.168.x.xx:554

If everything works fine, you should see the object detection result on VLC player.

../../_images/image8.png — Fig. 7 VLC validation

How to add a pre-process node to customized model (optional)

Sometimes, user may need to do data preprocess before the inference. Acuity Toolkit provide an auto generated pre-process node that can be added to the beginning of user’s customized network, so the data preprocess can run on NN engine and offload the CPU usage. The pre-process node can handle color space conversion, scaling and cropping. User can configure and enable the auto-generated pre-process node by setting “inputmeta.yml” as following:

 input_meta:
   databases:
   - path: dataset.txt
     type: TEXT
     ports:
     - lid: input.1_137
       category: image
       dtype: float32
       sparse: false
       tensor_name:
       layout: nchw
       shape:
       - 1
       - 3
       - 320
       - 576
       fitting: scale
       preprocess:
         reverse_channel: false
         mean:
         - 127.5
         - 127.5
         - 127.5
         scale: 0.0078125
         preproc_node_params:
           add_preproc_node: true
           preproc_type: IMAGE_NV12
           preproc_image_size:
           - 576
           - 320

Note

If user want to use this feature, please use Acuity 6.18.0 and VIPLite driver 1.12.0. The older version cannot support it well.

How to load model from SD card instead of flash (optional)

Download model to flash may cost lots of time. Therefore, in development period, developer can save their model in SD card and the viplite driver can load the network binary file from SD card instead of flash. Here are the steps:

Prepare a SD card and create a folder named as “NN_MDL”, then copy your model into this folder.
Go to “component/file_system/nn/nn_file_op.c” and define MODEL_SRC as MODEL_FROM_SD.

#define MODEL_FROM_FLASH 0x01
#define MODEL_FROM_SD 0x02
#define MODEL_SRC MODEL_FROM_SD

Build the NN example with following command

cmake --build . --target flash

How to modify customized model name after conversion (optional)

After running export script in Acuity, user can get their customized model binary. The binary graph format is shown as following table; the network name is located at 12 bytes offset from head with 64 bytes length.

Table 5 binary graph format
Section	Field	Data Type	Count	Size in Bytes	Meaning
Header	Magic	CHAR	4	4	A magic number for a valid binary graph file must be “VPMN”.
	…	UINT32	1	4	…
	…	UINT32	1	4	…
	Network_name	CHAR	64	64	Indicates the name of a network.
	…	UINT32	1	4	…

Therefore, user can edit these 64 bytes in network binary file by any hex editor

../../_images/image9.png — Fig. 8 use hex editor to modify model name

A network name query API is used in module_vipnn.c to get the model name during deployment:

vip_query_network(ctx->network, VIP_NETWORK_PROP_NETWORK_NAME, ctx->network_name);
dprintf(LOG_INF, "network name:%s\n\r", ctx->network_name);

After modifying, you should see the following log (User may need to change the debug log level to LOG_INF to see this information in console):

../../_images/image10.png — Fig. 9 model name

Model Security

Some customer has their in-house self-trained model, so the model security is important to protect their intellectual property. Now, SDK support model authentication and encryption. Customer can deploy and run their model on the device securely.

Support model graph binary authentication
Support model graph binary encryption

The authentication and decryption are processed by cryptographic hardware accelerate engine. The key used for decryption will be stored in on-chip eFuse OTP.

Table 6 model security algorithm and its key management
Secure feature	Algorithm Support	Key management
Model Authentication	Hash: sha256 Signature: EdDSA_ED25519	Use private key to sign model on PC or server. Use public key verify model signature. Please use the FW signing key to sign the model. NN module will use the public key in FW manifest to verify the signature at runtime. Note: user must enable trust boot feature, so the public key can then be verified by chain of trust.
Model Encryption	AES_256_CBC	Use AES-256 key to encrypt model on PC or server. Please use the “user eFuse OTP KEY 0” to encrypt the model. NN module will use this key to decrypt the model at run time. Note: if there is no “user eFuse OTP KEY 0” on your device, user should inject this key by eFuse API – efuse_crypto_key_write(key, 0, 1). This key is one-time-programmable, so please discuss with your team before writing.

Model Authentication – Hash and Signature check

Model authentication including integrity check and trust check. After signing the model by the signing tool, a sha256 hash and signature will be appended at the end of model, as shown in Fig. 10 (b). The signature is used to verify the trust of hash, and the sha256 hash is used to verify the integrity of the model. Therefore, we can guarantee that the model is not tampered and is from trusted source.

../../_images/image11.png — Fig. 10 signed model format: (a) encrypted only, (b) signed only, (c) signed + encrypted

Model Encryption – Cypher Text Decryption

User can also encrypt the model graph binary (.nb file) to avoid people parsing it. It means the model should not be a plain text data on flash. Usually, user only need to encrypt the “fixed header” part of the model, which is the first 512 bytes of the network graph binary.

Table 7 fixed header in binary graph
Section	Size in Bytes
Header and Tables	512 (Fixed)
Data Sections	Dynamic

However, before model deploying, the model should be decrypted, the NN driver can create the network correctly. Currently, SDK support to decrypt the encrypted model header by AES-256-CBC mode with the user OTP eFuse key.

After encrypting the model by the signing tool, a random generated IV will be added to encrypt info, as shown in Fig. 10 (a). User can also both sign and encrypt the model, then IV, hash and signature will all be appended, as shown in Fig. 10 (c).

Note

Hardware crypto engine on the device can speed up the decryption process.

SDK Configuration

Model signature validation and decryption feature are both disabled in SDK by default. User should go to platform configuration file “project/realtek_amebapro2_v0_example/inc/platform_opts.h” to enable it:

/* For NN configuration */
#define CONFIG_NN_AES_ENCRYPTION 1
#define CONFIG_NN_HASH_SIGNATURE_CHECK 1

Note

These two feature can be enabled independently.

Secure Deployment Flow

After enabling NN decryption or hash/signature check feature in SDK, Pro2 will deploy the model securely according to the flow as shown in Fig. 11.

../../_images/image12.png — Fig. 11 secure NN model deployment

Signing and Encryption PC Tool

User can find the tool in “project/realtek_amebapro2_v0_example/src/test_model/model_nb/model_signature/model_sign_ed25519.py”. It’s implemented by python script.

Before using the tool, user may need to install following package (PyNaCl, PyCrypto…):

$ pip install pynacl
$ pip install pycryptodome

The format of any key is a Hex string file. For example, the content of 32-bytes signing key (model-sign-key) will be look like:

104008de9c2fed8fbb20139ea3eafb6b60e8fb8a603b488c90586e2750b7f3ae

The followings are the command usage to sign or encrypt the model. The tool also provide the corresponding verification command.

Sign only

Sign the model with sign key (ED25519 public key)

$ python3 model_sign_ed25519.py --sign-key "model-sign-key" --model "../yolov4_tiny.nb"

Verify the model with verify key (ED25519 secret key)

$ python3 model_sign_ed25519.py --verify-key "model-verify-key" --signed-model "../yolov4_tiny.nb.sig"

Note

After signing, user can get signed model – yolov4_tiny.nb.sig. User should download this model to flash partition or file system.

Encrypt only

Encrypt the model with AES key (AES-256-CBC symmetric key), the IV will be generated randomly by the tool

$ python3 model_sign_ed25519.py --model "../yolov4_tiny.nb" --enc-key "model-enc-key"

Decrypt the model with same AES key

$ python3 model_sign_ed25519.py --signed-model "../yolov4_tiny.nb.enc" --enc-key "model-enc-key"

Note

After encrypting, user can get encrypted model – yolov4_tiny.nb.enc. User should download this model to flash partition or file system.

Sign and Encrypt

Sign and encrypt the model with sign key and encrypt key

$ python3 model_sign_ed25519.py --sign-key "model-sign-key" --model "../yolov4_tiny.nb" --enc-key "model-enc-key"

Decrypt and verify the model signature with verify key and encrypt key

$ python3 model_sign_ed25519.py --verify-key "model-verify-key" --signed-model "../yolov4_tiny.nb.enc.sig" --enc-key "model-enc-key"

Note

After signing and encrypting, user can get encrypted model with signature – yolov4_tiny.nb.enc.sig. User should download this model to flash partition or file system.

Performance Test

The Table 8 is the performance test result on Pro2 by using yolov4-tiny 416x416 model:

Table 8 performance test result
Security feature		Time (ms)	Remark
Authentication	Signature check	3	Check the signature of 32 bytes model hash. Time is fixed
Authentication	Hash check	38	Depend on model size (yolov4-tiny: 4MB)
Decryption	Cipher text decrypt	3	Always decrypt 512 bytes fixed header. Time is fixed

Post-process PC Development Tool

User can develop their post-processing on PC and check the decoding result is correct. After running the inference script in AcuityToolkit, user can get the output tensor of the model. Then, we can develop the tensor decoding process on PC to get the comprehensible results such as object class, probability and bounding box.

The post-process API interface used by development tool and Pro2 are same, so it would be easier for user to deploy their model on the device.

Take Yolov4 as example, the following are the steps to use the tool:

Develop post-process in model_yolo_sim.c
Setup tensor parameters from NBinfo. After running the export scrip in AcuityToolkit, user will get a model binary file. The required model information for PC tool will be configured automatically by this model binary file. User need to set the model binary path in main.c:

static void yolo_pc_configure_tensor_param(nn_tensor_param_t *input_param, nn_tensor_param_t *output_param)
{
    /* Configure the model parameter from nb file */
    char *nbg_filename = "../../test_model/model_nb/yolov4_tiny.nb";
    config_param_from_nb_file(nbg_filename, input_param, output_param);
}

int yolo_simulation(void)
{
    // configure tensor param
    nn_tensor_param_t input_param, output_param;
    yolo_pc_configure_tensor_param(&input_param, &output_param);
    // …
}

Get output tensor from Acuity inference. After running the inference scrip in AcuityToolkit, the output tensor of the model can then be obtained. And user also have to set the path of these output tensor:

int yolo_simulation(void)
{
    // …
    // prepare Acuity pre-generated output tensor from file
    char *acuity_tensor_name[16];
    acuity_tensor_name[0] = "../data/yolo_data/iter_0_output_30_65_out0_1_255_13_13.tensor";
    acuity_tensor_name[1] = "../data/yolo_data/iter_0_output_37_76_out0_1_255_26_26.tensor";
    void *pp_tensor_out[16];
    memset(pp_tensor_out, 0, sizeof(pp_tensor_out));
    acuity_output_tensor_conversion(acuity_tensor_name, pp_tensor_out, &output_param);
    // …
}

Build project with command

mkdir build && cd build
cmake .. -G"Unix Makefiles"
make -j4

Execute the program to run your post-process

./nn_postprocess

Check the process result. An image with bounding boxes will be saved in data/yolo_data/prediction.jpg

../../_images/image13.jpg — Fig. 12 detection result

Appendix A. Acuity Supported Operation Layer

ONNX to ACUITY Operation Mapping

ONNX Operation	ACUITY Operation
Abs	abs
Add	add
And	logical_and
ArgMax	argmax
ArgMin	argmin
Atan	atan
Atanh	atanh
BatchNormalization	batchnormalize
Cast	cast
CastLike	cast
Ceil	ceil
Celu	celu
Clip	clipbyvalue
Concat	concat
Conv	conv1d/group_conv1d/depthwise_conv1d/convolution/conv2d_op/depthwise_conv2d_op/conv3d
ConvTranspose	deconvolution/deconvolution1d
Cos	cos
Cumsum	cumsum
DepthToSpace	depth2space
DequantizeLinear	dequantize
DFT	dft
Div	divide
Dropout	dropout
Einsum	einsum
Elu	elu
Equal	equal
Erf	erf
Exp	exp
Expand	expand_broadcast
Floor	floor
Gather	gather
GatherElements	gather_elements
GatherND	gathernd
Gemm	matmul/fullconnect
Greater	greater
GreaterOrEqual	greater_equal
GridSample	gridsample
GRU	gru
HammingWindow	hammingwindow
HannWindow	hannwindow
HardSigmoid	hard_sigmoid
HardSwish	hard_swish
InstanceNormalization	instancenormalize
LeakyRelu	leakyrelu
Less	less
LessOrEqual	less_equal
Log	log
Logsoftmax	log_softmax
LRN	localresponsenormalization
LSTM	lstm
MatMul	matmul/fullconnect
Max	eltwise(MAX)
MaxPool/AveragePool/GlobalAveragePool/GlobalMaxPool	pooling/pool1d/pool3d
MaxRoiPool	roipooling
Mean	eltwise(MEAN)
MeanVarianceNormalization	instancenormalize
Min	eltwise(MIN)
Mish	mish
Mod	mod
Mul	multiply
Neg	neg
NonZero	nonzero
OneHot	onehot
Or	logical_or
Pad	pad
Pow	pow
Prelu	prelu
QLinearConv	convolution/conv1d
QLinearMatMul	matmul
QuantizeLinear	quantize
Reciprocal	variable+divide
ReduceL1	abs+reducesum
ReduceL2	reducesum+multiply+sqrt
ReduceLogSum	reducesum+log
ReduceLogSumExp	exp+reducesum+log
ReduceMax	reducemax
ReduceMean	reducemean
ReduceMin	reducemin
ReduceProd	reduceprod
ReduceSum	reducesum
ReduceSumSquare	multiply+reducesum
Relu	relu
Reshape/Squeeze/Unsqueeze/Flatten	reshape
Resize	image_resize
ReverseSequence	reverse_sequence
Round	round
ScatterND	scatter_nd_update
Selu	selu
Shape	shapelayer
Sigmoid	sigmoid
Sign	sign
Silu	swish
Sin	sin
Size	size
Slice	slice/stridedslice
Softmax	softmax
Softplus	softrelu
Softsign	abs+add+divide+variable
SpaceToDepth	space2depth
Split	split/slice
Sqrt	sqrt
Squeeze	squeeze
STFT	stft
Sub	subtract
Sum	eltwise(SUM)
Tanh	tanh
Tile	tile
TopK	topk
Transpose	permute
Unsqueeze	reshape
Upsample	image_resize
Where	where
Xor	not_equal

Darknet to ACUITY Operation Mapping

Darknet Operation	ACUITY Operation
avgpool	pooling
batch_normalize	batchnormalize
connected	fullconnect
convolutional	convolution
depthwise_convolutional	convolution
leaky	leakyrelu
logistic	sigmoid
maxpool	pooling
mish	mish
region	region
relu	relu
reorg	reorg
route	concat/slice
scale_channels	multiply
shortcut	add/slice+add/pad+add
softmax	softmax
swish	swish
upsample	upsampling
yolo	yolo

NN Deployment

Using customized NN model

Setup Acuity toolkit on PC

Step for customized model conversion

Supported model quantization type on NN accelerator

SDK configuration for customized NN model

NN related file in SDK

Add customized model network binary to SDK

Create a model object can be used by VIPNN module

Set the NN model file name used by NN driver

Implement customized pre-process and post-process

NN memory and flash usage evaluation

Evaluate memory usage of model

Evaluate model size on flash

One model

Multi models

Using the NN MMF example with VIPNN module

Set RGB video resolution as model input size

Choose NN model

Build NN example

Update NN model on flash

Validate NN example

Object detection example – Yolov4-tiny

How to add a pre-process node to customized model (optional)

How to load model from SD card instead of flash (optional)

How to modify customized model name after conversion (optional)

Model Security

Model Authentication – Hash and Signature check

Model Encryption – Cypher Text Decryption

SDK Configuration

Secure Deployment Flow

Signing and Encryption PC Tool

Performance Test

Post-process PC Development Tool

Appendix A. Acuity Supported Operation Layer

ONNX to ACUITY Operation Mapping

Darknet to ACUITY Operation Mapping