OTA firmware update loop using excessive data

I was updating my device over OTA from version 0.0.2 to 0.0.4 and it continuously updated and used >10MB of data in an hour. It appears to download the data but doesn’t successfully update it. Any idea what would cause this?

here is the serial console dump for one of the loops.

[00:03:07.903,198] <dbg> app_dfu: data_received: Received 1024 bytes at offset 423936
[00:03:08.562,927] <dbg> app_dfu: data_received: Received 522 bytes at offset 424960 (last)
[00:03:09.375,701] <inf> app_dfu: Requesting upgrade
[00:03:09.383,728] <inf> app_dfu: Rebooting in 1 second(s)
*** Booting Zephyr OS build v3.3.99-ncs1-1 ***
[00:00:00.529,235] <inf> golioth_system: Initializing
[00:00:00.543,457] <inf> fs_nvs: 2 Sectors of 4096 bytes
[00:00:00.551,544] <inf> fs_nvs: alloc wra: 0, fb8
[00:00:00.559,082] <inf> fs_nvs: data wra: 0, 60
Steadfast SM-IOT Foundation Monitor. Copyright 2023. version: v0.23 10/4/2023 (Oct  4 2023 09:58:26) MCUboot Ver:0.0.2
[00:00:00.590,423] <inf> sm_iot: Connecting to LTE, this may take some time...
[00:00:03.057,495] <inf> sm_iot: Network: Searching
[00:00:17.632,202] <inf> sm_iot: Network: Registered (roaming)
[00:00:17.641,265] <inf> golioth_system: Starting connect
[00:00:17.774,383] <dbg> sm_iot: main: Steadfast SM-IOT Foundation Monitor. Copyright 2023. version: v0.23 10/4/2023 (Oct  4 2023 09:58:26) MCUboot Ver: 0.0.2
[00:00:18.986,358] <inf> sm_iot: Golioth connected. Registering observations
[00:00:19.000,091] <inf> golioth_system: Client connected!
[00:00:19.434,600] <dbg> app_dfu: golioth_desired_update: Desired
                                  a3 01 1a 65 25 6d 16 02  78 40 34 31 37 66 36 34 |...e%m.. x@417f64
                                  61 30 31 64 37 36 37 34  39 38 38 62 32 64 33 35 |a01d7674 988b2d35
                                  63 39 38 64 39 65 38 35  34 31 34 35 64 31 34 32 |c98d9e85 4145d142
                                  32 33 37 31 61 38 38 35  32 62 65 65 30 66 37 34 |2371a885 2bee0f74
                                  61 36 64 61 39 65 64 35  66 38 03 81 a6 01 64 6d |a6da9ed5 f8....dm
                                  61 69 6e 02 65 30 2e 30  2e 34 03 78 40 61 66 39 |ain.e0.0 .4.x@af9
                                  31 34 31 39 63 31 36 34  62 30 31 64 35 34 39 65 |1419c164 b01d549e
                                  66 66 34 39 65 64 63 31  64 32 30 36 64 62 39 66 |ff49edc1 d206db9f
                                  32 66 32 30 38 35 61 38  35 39 64 34 61 64 39 36 |2f2085a8 59d4ad96
                                  63 35 31 30 38 66 65 65  63 36 64 62 35 04 1a 00 |c5108fee c6db5...
                                  06 7e 0a 05 70 2f 2e 75  2f 63 2f 6d 61 69 6e 40 |.~..p/.u /c/main@
                                  30 2e 30 2e 34 06 67 6d  63 75 62 6f 6f 74       |0.0.4.gm cuboot
[00:00:19.556,579] <inf> app_settings: Set DEBUG_LEVEL to 20
[00:00:19.571,319] <dbg> app_state: async_handler: State successfully set
[00:00:19.590,362] <inf> sm_iot: nRF9160 hardwire ID :            0x54c69ae8a26120e9
[00:00:19.600,921] <inf> sm_iot: Cellular Modem IMEI number:      350457798708960
[00:00:19.611,236] <dbg> sm_iot: main: Cellular Modem ICCID number:     8944501207217644380
[00:00:19.623,138] <dbg> sm_iot: main: Cellular Modem Signal Strength:  -81db
[00:00:19.642,608] <dbg> sm_iot: main: OS time and date: 10-10-23 16:38:06
[00:00:20.073,577] <dbg> app_dfu: data_received: Received 1024 bytes at offset 0
[00:00:20.085,021] <inf> mcuboot_util: Swap type: test
[00:00:20.093,811] <inf> golioth_dfu: swap type: test
[00:00:20.677,337] <dbg> app_dfu: data_received: Received 1024 bytes at offset 1024
[00:00:21.126,434] <dbg> app_dfu: data_received: Received 1024 bytes at offset 2048

I created a 0.0.3 version and programmed it manually using SWD. it did the update the first time and then reported “00:00:09.254,486] app_dfu: Desired version (0.0.4) matches current firmware version!” on the second boot like I would expect so I’ve fixed it for now.

Looking at the golioth_dfu log it went from UPDATING to IDLE with no other status when it was failing.

2023-10-10T16:38:06.855311378Z,DIAGNOSTICS,INFO,golioth_dfu,main(IDLE): ready state,main,ready state,0,IDLE,0,2023-10-10T16:38:06.855311378Z,0.0.2,651ce9fc5d4d6a17c7d39011,
2023-10-10T16:37:45.668010156Z,DIAGNOSTICS,INFO,golioth_dfu,main(UPDATING): ready state,main,ready state,0,UPDATING,3,0.0.4,2023-10-10T16:37:45.668010156Z,0.0.2,651ce9fc5d4d6a17c7d39011,

When I manually re-programmed it to v0.03 it passed with:

2023-10-10T16:46:42.827268176Z,DIAGNOSTICS,INFO,golioth_dfu,main(IDLE): firmware updated successfully,main,firmware updated successfully,1,IDLE,0,2023-10-10T16:46:42.827268176Z,0.0.4,651ce9fc5d4d6a17c7d39011
2023-10-10T16:45:49.514255380Z,DIAGNOSTICS,INFO,golioth_dfu,main(UPDATING): ready state,main,ready state,0,UPDATING,3,0.0.4,2023-10-10T16:45:49.51425538Z,0.0.3,651ce9fc5d4d6a17c7d39011

I’m worried that it will happen in the field and consume vast amounts of data in the field so I need to know what went wrong. Any ideas?

Hey John,

It looks to me that we have two issues here: first, the device uses too much data as it continuously downloads the same FW image, and the second issue is the device not running the downloaded v0.0.4 version of the firmware.

The first issue could be addressed in the application; that being said, it would be a future good feature in the firmware SDK. A temporary way to solve it could be to use something like LightDB State to track the number of times an update has been attempted and stop trying to perform it after some number of tries.

The second issue could be a mcuboot issue. Between v0.0.2 and v0.0.4, have you made any changes to app_dfu or flash files, or project configuration files? Have you been customizing the bootloader or making changes to the signing keys?
Have you seen behavior like this before?

I haven’t modified app_dfu, flash files, or bootloader at all. They are straight from the reference code. No change to signing keys recently either. There are always changes to the project config file but manually flashing the same code with a 0.0.3 revision on it using SWD fixed it.
This was the first update I’ve ever done on my code base. I did create a 0.0.5 version and it updated from 0.0.4 correctly though.

This has happened twice more. It continuously tries to update averaging 18 attempts an hour at 520K so it will use ~10MB/hour until I disable the update. Firmware updates are all library functions so I’d have to go into the library functions to add a lightDB state. Surely this has been seen before.

Unfortunately, we haven’t had a situation like this before, and it’s hard to tell what could be the reason for this behavior without being able to reproduce it ourselves.

One thing that could help is a more verbose bootloader to show which application image is running, whether it has done the image swap, etc. You can try executing mcuboot commands in the Zephyr shell in order to figure out what the versions of the images are. Running mcuboot -h in the Zephyr shell will output a list of available subcommands; for example, you can execute mcuboot command during the second download to get information on images in the primary and secondary area.

In order to move forward on our side, we would need to reproduce the issue on the nRF9160 DK to verify if the problem lies with Golioth SDK, MCUBoot, or if it’s something else.
If the mcuboot shell commands don’t give the desired outcome, i.e. we figure out what’s the issue, would you be open to sending me firmware images in order to reproduce the behavior?

Looking back at the logs here is what I see when running the shell mcuboot command

uart:~$ mcuboot
swap type: none
confirmed: 1

primary area (3):
  version: 0.0.10+0
  image size: 437580
  magic: unset
  swap type: none
  copy done: unset
  image ok: unset

failed to read secondary area (9) header: -5
[00:01:01.213,256] <inf> mcuboot_util: Image index: 0, Swap type: none

It looks like there was some issue and it couldn’t read the secondary area.
I’m not sure what I changed but I got it out of that mode and now it’s updating properly.

primary area (3):
  version: 0.0.10+0
  image size: 437580
  magic: good
  swap type: test
  copy done: set
  image ok: set

secondary area (9):
  version: 0.0.7+0
  image size: 437692
  magic: unset
  swap type: none
  copy done: unset
  image ok: unset

I’ll update the post later if I run into it again or have any more insights.

I have reproduced this easily. When using a new signing key the system will download and try to update the firmware in an endless loop but never succeed. This can easily generate GB of data usage per device in a single day.

My initial failure wasn’t using a different key so there are definitely other failure methods that will generate this excessive data.
Any IOT system needs to be robust enough to not allow these kinds of data intensive loops and this is the third one I’ve come across. The other two are: