admin管理员组文章数量:1431024
I've been struggling with this one for a while so here we go : I've been trying to match the speed of inference for an ML model I generated with Edge Impulse originnaly to Arduino, then to ESP-IDF for my ESP32-CAM device.
The algo takes ~1300ms to run on Arduino and it takes ~6600ms on ESP-IDF with -Os optim in both case. The closer I got is by setting compile optimization to -O2 which got me around 2000ms on ESP-IDF.
In both cases the CPU frequency is set at 240MHz, and I tried to figure out how exactly does Arduino compiles and mimic it to see what I could miss but I'm I'm not figuring it out.
I verified with a test sample that does matricial calcul with both volatile floats and integers to ensure that the CPU calculus capacities are the same in both envs and I got :
I have similar results on both projects and logged everything thread related and it matches (runs on core1, same priority, same cpu speed).
I ensured that memory allocation is static in both tensor flow lite lib with the flag -DTF_LITE_STATIC_MEMORY.
I ensured that there is no parallel shinanigans and OPEN_MP is disabled in both cases.
I switched compilers to check if it doesn't come from a compiler libc or something.
I tried to get as close as possible as Arduino's compiler arguments.
Here is a dump of arduino compile flags arduino compile arguments :
COLLECT_GCC_OPTIONS='-c' '-mlongcalls' '-Wno-frame-address' '-ffunction-sections' '-fdata-sections' '-Wno-error=unused-function' '-Wno-error=unused-variable' '-Wno-error=unused-but-set-variable' '-Wno-error=deprecated-declarations' '-Wno-unused-parameter' '-Wno-sign-compare' '-Wno-enum-conversion' '-gdwarf-4' '-ggdb' '-freorder-blocks' '-Wwrite-strings' '-fstack-protector' '-fstrict-volatile-bitfields' '-fno-jump-tables' '-fno-tree-switch-conversion' '-std=gnu++23' '-fexceptions' '-fno-rtti' '-w' '-Os' '-v' '-w' '-E' '-CC' '-D' 'F_CPU=240000000L' '-D' 'ARDUINO=10607' '-D' 'ARDUINO_ESP32_DEV' '-D' 'ARDUINO_ARCH_ESP32' '-D' 'ARDUINO_BOARD="ESP32_DEV"' '-D' 'ARDUINO_VARIANT="esp32"' '-D' 'ARDUINO_PARTITION_huge_app' '-D' 'ARDUINO_HOST_OS="windows"' '-D' 'ARDUINO_FQBN="esp32:esp32:esp32cam:CPUFreq=240,FlashFreq=80,FlashMode=qio,PartitionScheme=huge_app,DebugLevel=none,EraseFlash=none"' '-D' 'ESP32' '-D' 'CORE_DEBUG_LEVEL=0' '-D' 'BOARD_HAS_PSRAM' '-mfix-esp32-psram-cache-issue' '-mfix-esp32-psram-cache-strategy=memw' '-D' 'ARDUINO_USB_CDC_ON_BOOT=0' '-D' 'ESP_PLATFORM' '-D' 'IDF_VER="v5.1.4-497-gdc859c1e67-dirty"' '-D' 'MBEDTLS_CONFIG_FILE="mbedtls/esp_config.h"' '-D' 'SOC_MMU_PAGE_SIZE=CONFIG_MMU_PAGE_SIZE' '-D' 'UNITY_INCLUDE_CONFIG_H' '-D' '_GNU_SOURCE' '-D' '_POSIX_READER_WRITER_LOCKS' '-D' 'configENABLE_FREERTOS_DEBUG_OCDAWARE=1' '-D' 'TF_LITE_STATIC_MEMORY' '-I'
Here are my compile line on esp-idf :
C:\Espressif\tools\xtensa-esp32-elf\esp-12.2.0_20230208\xtensa-esp32-elf\bin\xtensa-esp32-elf-g++.exe -mlongcalls -Wno-frame-address -DNDEBUG -fdiagnostics-color=always -Wno-unused-variable -Wno-deprecated-declarations -Wno-missing-field-initializers -Wno-maybe-uninitialized -Wno-error=uninitialized -DTF_LITE_STATIC_MEMORY -mlongcalls -ffunction-sections -fdata-sections -fstrict-volatile-bitfields -fno-jump-tables -fno-tree-switch-conversion -fno-rtti -w -Wall -Werror=all -Wno-error=unused-function -Wno-error=unused-variable -Wno-error=unused-but-set-variable -Wno-error=deprecated-declarations -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-enum-conversion -gdwarf-4 -ggdb -mfix-esp32-psram-cache-issue -mfix-esp32-psram-cache-strategy=memw -Os -freorder-blocks -fmacro-prefix-map=path -fmacro-prefix-map=other_path -DconfigENABLE_FREERTOS_DEBUG_OCDAWARE=1 -std=gnu++2b -fno-exceptions -DESP32=ESP32 -MD -MT file.cpp.obj -MF file.cpp.obj.d -o file.cpp.obj
-c file.cpp
What is the more strange for me is the difference between -o2 optimization in my ESP-IDF case, but Arduino is better with -Os...
Anyway any help would be greatly appreciated, Have a good day everyone and thanks for reading me,
Aloïs
I've been struggling with this one for a while so here we go : I've been trying to match the speed of inference for an ML model I generated with Edge Impulse originnaly to Arduino, then to ESP-IDF for my ESP32-CAM device.
The algo takes ~1300ms to run on Arduino and it takes ~6600ms on ESP-IDF with -Os optim in both case. The closer I got is by setting compile optimization to -O2 which got me around 2000ms on ESP-IDF.
In both cases the CPU frequency is set at 240MHz, and I tried to figure out how exactly does Arduino compiles and mimic it to see what I could miss but I'm I'm not figuring it out.
I verified with a test sample that does matricial calcul with both volatile floats and integers to ensure that the CPU calculus capacities are the same in both envs and I got :
I have similar results on both projects and logged everything thread related and it matches (runs on core1, same priority, same cpu speed).
I ensured that memory allocation is static in both tensor flow lite lib with the flag -DTF_LITE_STATIC_MEMORY.
I ensured that there is no parallel shinanigans and OPEN_MP is disabled in both cases.
I switched compilers to check if it doesn't come from a compiler libc or something.
I tried to get as close as possible as Arduino's compiler arguments.
Here is a dump of arduino compile flags arduino compile arguments :
COLLECT_GCC_OPTIONS='-c' '-mlongcalls' '-Wno-frame-address' '-ffunction-sections' '-fdata-sections' '-Wno-error=unused-function' '-Wno-error=unused-variable' '-Wno-error=unused-but-set-variable' '-Wno-error=deprecated-declarations' '-Wno-unused-parameter' '-Wno-sign-compare' '-Wno-enum-conversion' '-gdwarf-4' '-ggdb' '-freorder-blocks' '-Wwrite-strings' '-fstack-protector' '-fstrict-volatile-bitfields' '-fno-jump-tables' '-fno-tree-switch-conversion' '-std=gnu++23' '-fexceptions' '-fno-rtti' '-w' '-Os' '-v' '-w' '-E' '-CC' '-D' 'F_CPU=240000000L' '-D' 'ARDUINO=10607' '-D' 'ARDUINO_ESP32_DEV' '-D' 'ARDUINO_ARCH_ESP32' '-D' 'ARDUINO_BOARD="ESP32_DEV"' '-D' 'ARDUINO_VARIANT="esp32"' '-D' 'ARDUINO_PARTITION_huge_app' '-D' 'ARDUINO_HOST_OS="windows"' '-D' 'ARDUINO_FQBN="esp32:esp32:esp32cam:CPUFreq=240,FlashFreq=80,FlashMode=qio,PartitionScheme=huge_app,DebugLevel=none,EraseFlash=none"' '-D' 'ESP32' '-D' 'CORE_DEBUG_LEVEL=0' '-D' 'BOARD_HAS_PSRAM' '-mfix-esp32-psram-cache-issue' '-mfix-esp32-psram-cache-strategy=memw' '-D' 'ARDUINO_USB_CDC_ON_BOOT=0' '-D' 'ESP_PLATFORM' '-D' 'IDF_VER="v5.1.4-497-gdc859c1e67-dirty"' '-D' 'MBEDTLS_CONFIG_FILE="mbedtls/esp_config.h"' '-D' 'SOC_MMU_PAGE_SIZE=CONFIG_MMU_PAGE_SIZE' '-D' 'UNITY_INCLUDE_CONFIG_H' '-D' '_GNU_SOURCE' '-D' '_POSIX_READER_WRITER_LOCKS' '-D' 'configENABLE_FREERTOS_DEBUG_OCDAWARE=1' '-D' 'TF_LITE_STATIC_MEMORY' '-I'
Here are my compile line on esp-idf :
C:\Espressif\tools\xtensa-esp32-elf\esp-12.2.0_20230208\xtensa-esp32-elf\bin\xtensa-esp32-elf-g++.exe -mlongcalls -Wno-frame-address -DNDEBUG -fdiagnostics-color=always -Wno-unused-variable -Wno-deprecated-declarations -Wno-missing-field-initializers -Wno-maybe-uninitialized -Wno-error=uninitialized -DTF_LITE_STATIC_MEMORY -mlongcalls -ffunction-sections -fdata-sections -fstrict-volatile-bitfields -fno-jump-tables -fno-tree-switch-conversion -fno-rtti -w -Wall -Werror=all -Wno-error=unused-function -Wno-error=unused-variable -Wno-error=unused-but-set-variable -Wno-error=deprecated-declarations -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-enum-conversion -gdwarf-4 -ggdb -mfix-esp32-psram-cache-issue -mfix-esp32-psram-cache-strategy=memw -Os -freorder-blocks -fmacro-prefix-map=path -fmacro-prefix-map=other_path -DconfigENABLE_FREERTOS_DEBUG_OCDAWARE=1 -std=gnu++2b -fno-exceptions -DESP32=ESP32 -MD -MT file.cpp.obj -MF file.cpp.obj.d -o file.cpp.obj
-c file.cpp
What is the more strange for me is the difference between -o2 optimization in my ESP-IDF case, but Arduino is better with -Os...
Anyway any help would be greatly appreciated, Have a good day everyone and thanks for reading me,
Aloïs
Share Improve this question edited Nov 19, 2024 at 9:29 A.KYROU asked Nov 19, 2024 at 9:20 A.KYROUA.KYROU 334 bronze badges 5- 1 Profile it. Find the nottleneck – 0___________ Commented Nov 19, 2024 at 9:29
- If the code executed from IRAM or serial FLASH? Is FPU enabled? – 0___________ Commented Nov 19, 2024 at 9:42
- Thanks for your answer I'm adding profiling rn. For FPU The ESP32 is based on the Xtensa architecture with LX6 cores, which do not have a dedicated FPU. About the IRAM, I did nothing to specify functions in it, but I think neither did the Arduino. Do you have an advice about how to check it ? Straight from map file ? – A.KYROU Commented Nov 19, 2024 at 10:02
- ESP32-S3 has FPU – 0___________ Commented Nov 19, 2024 at 10:08
- HA ok sorry, but I work with an older generation an ESP-32S Module – A.KYROU Commented Nov 19, 2024 at 10:18
1 Answer
Reset to default 0Long story short, it was the TFLITE
kernel accelerated maths functions that weren't compiled because the flag that defines the boards wasn't always passed by the CMake file.
If you find your problem looks similar, look for the EI_CLASSIFIER_TFLITE_ENABLE_ESP_NN
flag and ensure your board is always defined.
本文标签:
版权声明:本文标题:c - ML model from Edge Impulse runs 5 times slower after porting it from Arduino IDE to ESP-IDF on ESP32 - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745569972a2663987.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论