Fatal error when trying to set wifi credentials for combined application Golioth

Description

Hello everyone,

I made a public repo that tries to combine the reference-design-template with the certificate_provisioning example, in hopes to create a production ready Golioth+Zephyr base application that works out-of-the-box for an ESP32 based product… My goal for this is to be a standalone repo that does not reference any samples/common.

In my specific case, I made a custom PCB using the ESP32C6-WROOM-N4, so it has 4MB of flash instead of the default 8MB, but that’s the only difference between that and the ESP32C6 DevkitC…

So far, I am able to build mcuboot with the default 4MB partition table for the ESP32C6, as outlined in my custom board overlay, as well as the main Zephyr application that is code combined from the reference design template and certificate provisioning example.

I can flash successfully, and mcuboot seems to start and jump to the main application just fine

Build:Sep 19 2022
rst:0x1 (POWERON),boot:0x68 (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:2
load:0x40851d08,len:0x1bb4
load:0x4085be00,len:0x7b2c
load:0x40867d08,len:0x3c8
entry 0x4085e594
I (30) soc_init: chip revision: v0.2
I (30) flash_init: SPI Speed      : 80MHz
I (30) flash_init: SPI Mode       : DIO
I (30) flash_init: SPI Flash Size : 4MB
*** Booting MCUboot v2.2.0-54-g4eba8087fa60 ***
*** Using Zephyr OS build v4.2.1 ***
I: Starting bootloader
I (67) boot: Image index: 0, Swap type: none
I: Bootloader chainload address offset: 0x20000
I: Image version: v2.8.0
I: Jumping to the first image slot
I: br_image_off = 0x20000

I: ih_hdr_size = 0x20

I (1815) boot: Loading image 0 - slot 0 from flash, area id: 2
I (1820) boot: Application start=40804524h
I (1824) boot: DRAM segment: paddr=0002e894h, vaddr=4080e830h, size=0569ch ( 22172) load
I (1835) boot: IRAM segment: paddr=00020080h, vaddr=40800000h, size=0e814h ( 59412) load
I (1852) boot: IROM segment: paddr=00040000h, vaddr=42000000h, size=B00D8h (721112) map
I (1852) boot: DROM segment: paddr=00100000h, vaddr=42800000h, size=25590h (152976) map
I (1867) boot: libc heap size 261 kB.
I (1867) spi_flash: detected chip: gd
I (1867) spi_flash: flash io: dio

[00:00:00.005,000] <inf> wifi_init: rx ba win: 6
[00:00:00.005,000] <inf> littlefs: littlefs partition at /lfs1
[00:00:00.005,000] <inf> littlefs: LittleFS version 2.11, disk version 2.1
[00:00:00.005,000] <inf> littlefs: FS at flash-controller@60002000:0x3b0000 is 48 0x1000-byte blocks with 512 cycle
[00:00:00.005,000] <inf> littlefs: partition sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.006,000] <inf> littlefs: Automount /lfs1 succeeded
[00:00:00.017,000] <err> fs: file open error (-2)
*** Booting Zephyr OS build v4.2.1 ***
*** Golioth Firmware SDK v0.21.1 ***
[00:00:00.017,000] <dbg> golioth_rd_template: main: Start Reference Design Template sample
[00:00:00.017,000] <inf> golioth_rd_template: Firmware version: 2.8.0
[00:00:00.017,000] <wrn> golioth_rd_template: Could not stat /lfs1/credentials/crt.der, err: -2
[00:00:05.019,000] <wrn> golioth_rd_template: Could not stat /lfs1/credentials/crt.der, err: -2
[00:00:10.020,000] <wrn> golioth_rd_template: Could not stat /lfs1/credentials/crt.der, err: -2

I get an error initially stating <err> fs: file open error (-2), but if I do fs ls, I do see the /lfs1 folder there:

uart:~$ fs ls
lfs1/

If I then try to do fs mkdir /lfs1/credentials I get a fatal error:

uart:~$ fs mkdir /lfs1/credentials
[00:01:50.040,000] <wrn> golioth_rd_template: Could not stat /lfs1/credentials/crt.der, err: -2
[00:01:51.040,000] <err> os:
[00:01:51.040,000] <err> os:  mcause: 1, Instruction Access fault
[00:01:51.040,000] <err> os:   mtval: 2
[00:01:51.040,000] <err> os:      a0: 00000000    t0: 408134b4
[00:01:51.040,000] <err> os:      a1: 4082fd28    t1: 400008c6
[00:01:51.040,000] <err> os:      a2: 01000000    t2: 00000000
[00:01:51.040,000] <err> os:      a3: 4082fd28    t3: aaaaaaaa
[00:01:51.040,000] <err> os:      a4: 4082dec0    t4: aaaaaaaa
[00:01:51.040,000] <err> os:      a5: 4082dec0    t5: aaaaaaaa
[00:01:51.040,000] <err> os:      a6: 00000001    t6: aaaaaaaa
[00:01:51.040,000] <err> os:      a7: 0000003e
[00:01:51.040,000] <err> os:      sp: 40814490
[00:01:51.040,000] <err> os:      ra: 00000002
[00:01:51.040,000] <err> os:    mepc: 00000002
[00:01:51.040,000] <err> os: mstatus: 00001880
[00:01:51.040,000] <err> os:
[00:01:51.040,000] <err> os: call trace:
[00:01:51.040,000] <err> os:
[00:01:51.040,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:01:51.040,000] <err> os: Current thread: 0x4082dec0 (logging)
[00:01:51.209,000] <err> os: Halting system

Anyways, If I go ahead and do log halt that works fine, and so then I proceed to uploading my certificate and key that was generated by the Golioth Console (just for testing), and these seem to work fine:

ubuntu@Megalodon:~/esp32c6-golioth-zephyr-starter$ mcumgr --conntype serial --connstring=dev=/dev/ttyUSB1,baud=115200 fs upload ivory-mild-aardwolf.crt.der /lfs1/credentials/crt.der
310
371
Done

ubuntu@Megalodon:~/esp32c6-golioth-zephyr-starter$ mcumgr --conntype serial --connstring=dev=/dev/ttyUSB1,baud=115200 fs upload ivory-mild-aardwolf.key.der /lfs1/credentials/key.der
138
Done

So when I go back to connecting to the device, it looks like it reads the certificate and key just fine, but it cannot connect, because there is no WiFi yet. So far so good, it’s working as I would expect…

[00:00:00.005,000] <inf> wifi_init: rx ba win: 6
[00:00:00.005,000] <inf> littlefs: littlefs partition at /lfs1
[00:00:00.005,000] <inf> littlefs: LittleFS version 2.11, disk version 2.1
[00:00:00.005,000] <inf> littlefs: FS at flash-controller@60002000:0x3b0000 is 48 0x1000-byte blocks with 512 cycle
[00:00:00.005,000] <inf> littlefs: partition sizes: rd 16 ; pr 16 ; ca 64 ; la 32
[00:00:00.007,000] <inf> littlefs: Automount /lfs1 succeeded
[00:00:00.018,000] <err> fs: file open error (-2)
*** Booting Zephyr OS build v4.2.1 ***
*** Golioth Firmware SDK v0.21.1 ***
[00:00:00.018,000] <dbg> golioth_rd_template: main: Start Reference Design Template sample
[00:00:00.018,000] <inf> golioth_rd_template: Firmware version: 2.8.0
[00:00:00.020,000] <inf> golioth_rd_template: Read 371 bytes from /lfs1/credentials/crt.der
[00:00:00.022,000] <inf> golioth_rd_template: Read 138 bytes from /lfs1/credentials/key.der
[00:00:00.022,000] <inf> golioth_mbox: Mbox created, bufsize: 1408, num_items: 10, item_size: 128
[00:00:00.022,000] <inf> golioth_fw_update: Current firmware version: main - 2.8.0
[00:00:00.023,000] <inf> golioth_fw_update: Waiting for golioth client to connect before cancelling rollback
[00:00:00.024,000] <err> golioth_coap_client_zephyr: Fail to get address (coap.golioth.io 5684) -11
[00:00:00.024,000] <err> golioth_coap_client_zephyr: Failed to connect: -11
[00:00:00.024,000] <wrn> golioth_coap_client_zephyr: Failed to connect: -11

Now the real problem starts when I try to assign a WiFi credential and then try to make it auto-connect:

uart:~$ wifi cred add -k 1 -s test_ssid -p test_pass
[00:00:25.030,000] <err> golioth_coap_client_zephyr: Fail to get address (coap.golioth.io 5684) -11
[00:00:25.030,000] <err> golioth_coap_client_zephyr: Failed to connect: -11
[00:00:25.030,000] <wrn> golioth_coap_client_zephyr: Failed to connect: -11
[00:00:26.030,000] <err> os:
[00:00:26.030,000] <err> os:  mcause: 5, Load access fault
[00:00:26.030,000] <err> os:   mtval: 822261e
[00:00:26.030,000] <err> os:      a0: 0822261e    t0: 408134b4
[00:00:26.030,000] <err> os:      a1: ffffffff    t1: 0000000f
[00:00:26.030,000] <err> os:      a2: 00000020    t2: 00000000
[00:00:26.030,000] <err> os:      a3: 4081449c    t3: aaaaaaaa
[00:00:26.030,000] <err> os:      a4: 408144e0    t4: aaaaaaaa
[00:00:26.030,000] <err> os:      a5: 00000000    t5: aaaaaaaa
[00:00:26.030,000] <err> os:      a6: 00000001    t6: aaaaaaaa
[00:00:26.030,000] <err> os:      a7: 0000003e
[00:00:26.030,000] <err> os:      sp: 40814490
[00:00:26.030,000] <err> os:      ra: 4080d618
[00:00:26.030,000] <err> os:    mepc: 4080d620
[00:00:26.030,000] <err> os: mstatus: 00001880
[00:00:26.030,000] <err> os:
[00:00:26.030,000] <err> os: call trace:
[00:00:26.030,000] <err> os:
[00:00:26.030,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:26.030,000] <err> os: Current thread: 0x4082dec0 (logging)
[00:00:26.221,000] <err> os: Halting system

So now, if I do wifi cred auto_connect, it just immediately reboots, and then gets stuck after I (1868) spi_flash: flash io: dio, and it never proceeds to do [00:00:00.005,000] <inf> wifi_init: rx ba win: 6 as it did before…

*** Booting Zephyr OS build v4.2.1 ***
*** Golioth Firmware SDK v0.21.1 ***
[00:00:00.019,000] <dbg> golioth_rd_template: main: Start Reference Design Template sample
[00:00:00.019,000] <inf> golioth_rd_template: Firmware version: 2.8.0
[00:00:00.022,000] <inf> golioth_rd_template: Read 371 bytes from /lfs1/credentials/crt.der
[00:00:00.023,000] <inf> golioth_rd_template: Read 138 bytes from /lfs1/credentials/key.der
[00:00:00.024,000] <inf> golioth_mbox: Mbox created, bufsize: 1408, num_items: 10, item_size: 128
[00:00:00.024,000] <inf> golioth_fw_update: Current firmware version: main - 2.8.0
[00:00:00.025,000] <inf> golioth_fw_update: Waiting for golioth client to connect before cancelling rollback
[00:00:00.025,000] <err> golioth_coap_client_zephyr: Fail to get address (coap.golioth.io 5684) -11
[00:00:00.025,000] <err> golioth_coap_client_zephyr: Failed to connect: -11
[00:00:00.025,000] <wrn> golioth_coap_client_zephyr: Failed to connect: -11
uart:~$ wifi cred auto_connect <---- SET auto_connect
ESP-ROM:esp32c6-20220919
Build:Sep 19 2022
rst:0xf (LP_BOD_SYS),boot:0x8 (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:2
load:0x40851d08,len:0x1bb4
load:0x4085be00,len:0x7b2c
load:0x40867d08,len:0x3c8
entry 0x4085e594
I (30) soc_init: chip revision: v0.2
I (30) flash_init: SPI Speed      : 80MHz
I (30) flash_init: SPI Mode       : DIO
I (30) flash_init: SPI Flash Size : 4MB
*** Booting MCUboot v2.2.0-54-g4eba8087fa60 ***
*** Using Zephyr OS build v4.2.1 ***
I: Starting bootloader
I (67) boot: Image index: 0, Swap type: none
I: Bootloader chainload address offset: 0x20000
I: Image version: v2.8.0
I: Jumping to the first image slot
I: br_image_off = 0x20000
I: ih_hdr_size = 0x20
I (1815) boot: Loading image 0 - slot 0 from flash, area id: 2
I (1820) boot: Application start=40804524h
I (1824) boot: DRAM segment: paddr=0002e894h, vaddr=4080e830h, size=0569ch ( 22172) load
I (1835) boot: IRAM segment: paddr=00020080h, vaddr=40800000h, size=0e814h ( 59412) load
I (1852) boot: IROM segment: paddr=00040000h, vaddr=42000000h, size=B00D8h (721112) map
I (1852) boot: DROM segment: paddr=00100000h, vaddr=42800000h, size=25590h (152976) map
I (1867) boot: libc heap size 261 kB.
I (1867) spi_flash: detected chip: gd
I (1868) spi_flash: flash io: dio <--- STUCK HERE

I am not sure how to diagnose this problem now… everything seems to look good on the Zephyr side, in terms of the overlay, the partitions, and the main code is fairly simple and hasn’t been modified much from the reference design… Have you guys seen this behavior before? Since I am brand new to Zephyr, MCUboot, and Golioth, I am not sure who is to blame for this error, and what path to start going down in terms of troubleshooting…

Thank you, and I appreciate any help on this!

Nicolas

Steps to Reproduce

Follow the instructions on my repo to get the source code I am running from, and then follow the instructions from the certificate_provisioning repo. You will need an ESP32C6 devkit.

Expected Behavior

Successful connection to Golioth Console.

Actual Behavior

Fatal errors and load access faults when trying to set wifi credentials.. also getting the same error when trying to do fs mkdir /lfs1/credentials

Impact

Cannot continue project development.

Environment

Same as that used in the reference-design-template, just modified to also incorporate the uploading of certificates, so no sample code is used, and is closer to a production ready project.

Logs and Console Output

[00:00:26.030,000] os:
[00:00:26.030,000] os: mcause: 5, Load access fault
[00:00:26.030,000] os: mtval: 822261e
[00:00:26.030,000] os: a0: 0822261e t0: 408134b4
[00:00:26.030,000] os: a1: ffffffff t1: 0000000f
[00:00:26.030,000] os: a2: 00000020 t2: 00000000
[00:00:26.030,000] os: a3: 4081449c t3: aaaaaaaa
[00:00:26.030,000] os: a4: 408144e0 t4: aaaaaaaa
[00:00:26.030,000] os: a5: 00000000 t5: aaaaaaaa
[00:00:26.030,000] os: a6: 00000001 t6: aaaaaaaa
[00:00:26.030,000] os: a7: 0000003e
[00:00:26.030,000] os: sp: 40814490
[00:00:26.030,000] os: ra: 4080d618
[00:00:26.030,000] os: mepc: 4080d620
[00:00:26.030,000] os: mstatus: 00001880
[00:00:26.030,000] os:
[00:00:26.030,000] os: call trace:
[00:00:26.030,000] os:
[00:00:26.030,000] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:26.030,000] os: Current thread: 0x4082dec0 (logging)
[00:00:26.221,000] os: Halting system

Thanks for the detailed investigation. Looking at the different register dumps you’ve posted, the crashes aren’t coming from a specific line of code in your application, but from corrupted state inside Zephyr’s logging thread.

In the first case, where fs mkdir /lfs1/credentials causes a fault, the key part of the dump is:

mepc = 0x00000002
mcause = Instruction Access fault
Current thread: 0x4082dec0 (logging)

That means the CPU tried to execute code at address 0x00000002, which can’t ever be valid. This happens when a return address or function pointer gets overwritten, usually a stack overflow or memory corruption. The fact that the active thread is the logging thread reinforces that the logging stack ran out of room and courpted its own state. This is further confirmed by the fact that once you halted shell logging with log halt, everything started working normally.

In the second case, when you add Wi-Fi credentials and Zephyr faults with:

mcause = Load access fault
mepc   = 0x4080d620
Current thread: 0x4082dec0 (logging)

you’re seeing a different symptom but the same root cause: the logging thread is still the one crashing.

After increasing:

CONFIG_LOG_PROCESS_THREAD_STACK_SIZE=4096
CONFIG_LOG_BUFFER_SIZE=4096

the crashes disappear, which confirms that the underlying issue wasn’t FS, Wi-Fi, or Golioth, at least for now.

Regarding the hang here:

I (1815) boot: Loading image 0 - slot 0 from flash, area id: 2
I (1820) boot: Application start=40804524h
I (1824) boot: DRAM segment: ...
...
I (1868) spi_flash: flash io: dio   <-- stuck here

This is the point where the bootloader has finished loading your Zephyr image and is about to jump into the application. If the board stalls right after flash io: dio, it usually means something is crashing very early in Zephyr’s startup path, before the console, logging, or drivers are fully up. At that stage it’s difficult to tell exactly what’s going on without doing deeper debugging.

Everything you’re seeing here points to application level issues and misconfiguration. At this stage, the application needs more detailed debugging on the Zephyr side to understand what’s failing so early and what is wrong with the FS and WiFi.

Since these issues aren’t related to Golioth and are specific to the application, the best approach is to start stripping things back and figure out where it went wrong. Revert to an earlier known good commit, clean up the project configuration, and then add pieces back in gradually until you pinpoint what’s causing the failure. For anything that looks like a Zephyr issue, it’s best to reach out directly to the Zephyr community. And of course, if you hit anything Golioth-specific along the way, we’re always happy to help on that front.

1 Like