Golioth_system: Receive timeout

Hi,

I’m testing the lightdb_stream sample program using SparkFun Thing Plus nRF9160 and was running fine for a few hours. But then I got “golioth_system: Receive timeout” warning message and it stopped sending data.

I changed the uplink frequency from 5sec to 30min and repeated the test. This time, it sent the new data after 30min, but then 20 min later I got the same warning message and it won’t recover until I reset the board.

Has anyone experience similar issues? If yes, please help!

Also, is there an example on how to implement a watchdog?

Thanks!

Hey @invosa welcome to the forum!

Do you call out a specific version of the Golioth Zephyr SDK in your manifest? If not, when did you pull the Golioth code into your project?

Chris

Hi @ChrisGammell, thanks!

No, I did not call out a specific version. I just used the standard sample program which I first downloaded on 10/31, but I’ve done west update before compiling it today.

Btw, I reset the board and it’s currently been running for 3hrs. I’ll let you know when it stops again.

Hi invosa! Will you please send the hash of the Golioth SDK? You can do this from the modules/lib/golioth folder as follows:

➜ cd ~/golioth-ncs-workspace/modules/lib/golioth/
➜ git rev-parse HEAD
fa816c5c74848dfdca0827e25a6562a1bc237e94

Hi @mike, here’s the hash of the Golioth SDK: e77f9355307d1177c3b569cbda3ea796f59955a7

After the last reset, so far it’s been running for almost 12hrs with no issue (only sending new values every 30min). I plan to run it for several days.

FYI, it stopped again after 22.5 hrs. See below logs:

[22:00:46.187,683] golioth_lightdb_stream: main: Sending temperature 22.000000
[22:00:46.831,207] golioth_lightdb_stream: temperature_push_sync: Temperature successfully sync pushed
[22:00:46.831,207] golioth_lightdb_stream: main: Sending humidity 72.000000
[22:00:47.066,619] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
[22:30:47.066,680] golioth_lightdb_stream: main: Sending temperature 22.500000
[22:30:49.357,482] golioth_lightdb_stream: temperature_push_sync: Temperature successfully sync pushed
[22:30:49.357,513] golioth_lightdb_stream: main: Sending humidity 72.500000
[22:30:49.606,811] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
[22:50:43.360,839] golioth_system: Receive timeout
[23:00:49.606,872] golioth_lightdb_stream: main: Sending temperature 23.000000

No more messages after that and the device went offline.

Thanks for reporting and sending over the hash. I’ve checkout out the same version of Golioth and have my Sparkfun Thing Plus nrf9160 running the lightdb_stream sample to see if I can replicate the same behavior you are seeing.

As a workaround while we troubleshoot this, the watchdog timer you mentioned isn’t a bad idea. Zephyr has a watchdog sample that you can use as an implementation guide. I checked and this board already has a watchdog0 node in the Devicetree.

Thank you @mike, I’ll try the watchdog timer. Btw, the board didn’t reset automatically after flashing so I wonder if the reset pin was set correctly. Where is this being defined?

FYI, the issue is intermittent. Over the weekend it was running for more than 29 hrs before I reset it accidentally. But today it stopped again after ~16 hrs.

uart:~$ [15:37:23.330,383] golioth_lightdb_stream: main: Sending temperature 22.000000
uart:~$ [15:37:23.602,203] golioth_lightdb_stream: temperature_push_sync: Temperature successfully sync pushed
uart:~$ [15:37:23.602,233] golioth_lightdb_stream: main: Sending humidity 62.000000
uart:~$ [15:37:23.843,231] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
uart:~$ uart:~$ [15:40:25.862,945] golioth_system: Receive timeout
uart:~$ uart:~$ [16:07:23.843,292] golioth_lightdb_stream: main: Sending temperature 22.500000

So far I’m 22 hours in and haven’t encountered the issue yet.

[22:15:11.331,420] <dbg> golioth_lightdb_stream: main: Sending temperature 24.000000
[22:15:11.977,966] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed
[22:15:16.978,027] <dbg> golioth_lightdb_stream: main: Sending temperature 24.500000
[22:15:18.063,171] <dbg> golioth_lightdb_stream: temperature_push_handler: Temperature successfully pushed
[22:15:21.978,759] <dbg> golioth_lightdb_stream: main: Sending temperature 25.000000

Do you get any other messages after the Receive Timeout? I would expect the stream to keep trying to send data and spitting out failure messages.

Regarding the reset pin: I also have the issue that the board needs manual reset after flashing. I don’t know if this is a problem with the board mapping or not. The workaround I’ve used is to flash it using:

west flash -r jlink --reset-after-load

I’ve tried putting the reset-after-load flag into the board_runner_args in board.cmake but haven’t have success with that yet.

Mine is still running without issue. It did reset the timestamp after 36 hours and 24 minutes but it didn’t hang or reboot:

[36:24:07.232,482] <dbg> golioth_lightdb_stream: main: Sending temperature 26.000000                                                                                                           
[36:24:07.646,057] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed                                                                                        
[36:24:12.646,148] <dbg> golioth_lightdb_stream: main: Sending temperature 26.500000                                                                                                           
[36:24:13.822,265] <dbg> golioth_lightdb_stream: temperature_push_handler: Temperature successfully pushed                                                                                     
[36:24:17.646,881] <dbg> golioth_lightdb_stream: main: Sending temperature 27.000000                                                                                                           
[36:24:18.294,433] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed                                                                                        
[36:24:23.294,525] <dbg> golioth_lightdb_stream: main: Sending temperature 27.500000                                                                                                           
[36:24:24.362,640] <dbg> golioth_lightdb_stream: temperature_push_handler: Temperature successfully pushed                                                                                     
[36:24:28.295,257] <dbg> golioth_lightdb_stream: main: Sending temperature 28.000000                                                                                                           
[36:24:28.837,829] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed                                                                                        
[00:00:01.837,890] <dbg> golioth_lightdb_stream: main: Sending temperature 28.500000                                                                                                           
[00:00:02.916,046] <dbg> golioth_lightdb_stream: temperature_push_handler: Temperature successfully pushed                                                                                     
[00:00:06.838,623] <dbg> golioth_lightdb_stream: main: Sending temperature 29.000000                                                                                                           
[00:00:07.396,209] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed                                                                                        
[00:00:12.396,301] <dbg> golioth_lightdb_stream: main: Sending temperature 29.500000                                                                                                           
[00:00:13.668,426] <dbg> golioth_lightdb_stream: temperature_push_handler: Temperature successfully pushed                                                                                     
[00:00:17.397,033] <dbg> golioth_lightdb_stream: main: Sending temperature 20.000000                                                                                                           
[00:00:18.143,585] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed   

Okay, good news, I was able to replicate the issue and we are working on it now. I’ll post back when I have more information:

[12:02:41.496,551] <dbg> golioth_lightdb_stream: main: Sending temperature 26.500000                                                                                                                                 
[12:02:42.577,697] <dbg> golioth_lightdb_stream: temperature_push_handler: Temperature successfully pushed                                                                                                           
[12:02:46.497,283] <dbg> golioth_lightdb_stream: main: Sending temperature 27.000000                                                                                                                                 
[12:02:47.055,847] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed                                                                                                              
[12:02:52.055,938] <dbg> golioth_lightdb_stream: main: Sending temperature 27.500000                                                                                                                                 
[12:02:53.226,074] <dbg> golioth_lightdb_stream: temperature_push_handler: Temperature successfully pushed                                                                                                           
[12:02:57.056,671] <dbg> golioth_lightdb_stream: main: Sending temperature 28.000000                                                                                                                                 
[12:02:59.563,751] <wrn> golioth: Resending request 0x20020e78 (reply 0x20020eac) (retries 2)                                                                                                                        
[12:03:00.587,280] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed                                                                                                              
[12:03:05.587,402] <dbg> golioth_lightdb_stream: main: Sending temperature 28.500000                                                                                                                                 
[12:03:07.803,833] <wrn> golioth: Resending request 0x20020e78 (reply 0x20020eac) (retries 2)                                                                                                                        
[12:03:10.588,134] <dbg> golioth_lightdb_stream: main: Sending temperature 29.000000                                                                                                                                 
[12:03:12.208,221] <wrn> golioth: Resending request 0x20020e78 (reply 0x20020eac) (retries 1)                                                                                                                        
[12:03:12.784,606] <wrn> golioth: Resending request 0x20020ed8 (reply 0x20020f0c) (retries 2)                                                                                                                        
[12:03:14.407,836] <dbg> golioth_lightdb_stream: temperature_push_handler: Temperature successfully pushed                                                                                                           
[12:03:15.368,804] <dbg> golioth_lightdb_stream: temperature_push_sync: Temperature successfully pushed                                                                                                              
[12:03:20.368,896] <dbg> golioth_lightdb_stream: main: Sending temperature 29.500000                                                                                                                                 
[12:03:22.794,311] <wrn> golioth: Resending request 0x20020f70 (reply 0x20020fa4) (retries 2)                                                                                                                        
[12:03:25.369,628] <dbg> golioth_lightdb_stream: main: Sending temperature 20.000000                                                                                                                                 
[12:03:27.624,145] <wrn> golioth: Resending request 0x20020f70 (reply 0x20020fa4) (retries 1)                                                                                                                        
[12:03:28.343,505] <wrn> golioth: Resending request 0x20020e78 (reply 0x20020eac) (retries 2)                                                                                                                        
[12:03:34.262,664] <wrn> golioth: Resending request 0x20020e78 (reply 0x20020eac) (retries 1)                                                                                                                        
[12:03:37.275,573] <wrn> golioth: Resending request 0x20020f70 (reply 0x20020fa4) (retries 0)                                                                                                                        
[12:03:46.119,018] <wrn> golioth: Resending request 0x20020e78 (reply 0x20020eac) (retries 0)                                                                                                                        
[12:03:54.888,458] <wrn> golioth_system: Receive timeout                                                                                                                                                             
uart:~$

Hi @mike, I’m glad that you’re able to replicate the issue. On my setup, I get one last attempt of sending data (see below) after the Receive Timeout before it got hang. No failure messages after that. Is it the same in your setup?

uart:~$ [15:37:23.602,233] golioth_lightdb_stream: main: Sending humidity 62.000000
uart:~$ [15:37:23.843,231] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
uart:~$ uart:~$ [15:40:25.862,945] golioth_system: Receive timeout
uart:~$ uart:~$ [16:07:23.843,292] golioth_lightdb_stream: main: Sending temperature 22.500000

Likely the loop was still running when the timeout occurred and you get one more send attempt before the synchronous call halts main thread flow.

To us the Receive Timeout appears to be the correct failure mode, but the system client should try to reconnect automatically, which isn’t happening. Thanks for reporting this, it will be nice to get it fixed!

Hi @mike, do you know what is causing the Receive Timeout, and were you able to find the root cause of why it wasn’t reconnect automatically?

Hi @invosa, sorry for the long delay without an update.

We have continued to troubleshoot this problem and have narrowed it down to the disconnect action not returning when the timeout happens. We’re continuing to drill down to pinpoint the problem (I have another test running now) with the time between replicating the failure being the blocker.

We’ll keep on it and post back here with updates. Apologies that I don’t have more concrete info for you right now.

Hi @invosa,

Our Firmware team just merged a fix for the Receive Timeout issue you reported. You can pull your Golioth SDK to the tip of main to incorporate the fix. If you’re curious, here is the actual commit.

Thank you again for reporting this issue. It was a bit difficult to replicate the failure and the info you sent over about it made it much easier for us to troubleshoot.

Hi @mike, thank you for the follow-up! I’ll pull the latest SDK with the fix and do the long-run test again. Will let you know if I see any other issues.

Hi @mike, I’m just curious. After I did west update, I checked the SDK hash but found it didn’t change and confirmed that the sockets_offload_poll_wrapper.c remains unchanged. I had to do git stash followed by manual git pull to get the latest fix. Isn’t west update supposed to do that for me?

Hi @invosa, this is one of those “it depends” answers :smiley:

If you installed your Zephr workspace following our quickstart guide, then the Golioth SDK is outside of the scope of west.

The west update command uses a west.yml manifest file to make changes. In this case, that file is called west-ncs.yml part of the Golioth Zephyr SDK. So the manual git pull updates the SDK and if there were changes to the manifest file (what version of NCS we’ve validated) then the west update will make sure to reset your workspace to match.

In contrast, if you used what we have been calling a “standalone repo”, we include Golioth via the manifest file. Looking at the contents of one such file, you can see that we specify the revision for Golioth, then import the west-zephyr.yml from the Golioth SDK. In this case, west update will pull the entire workspace, including Golioth.

The topic is further complicated because the Golioth Zephyr SDK can be used for both “vanilla” Zephyr and Nordic’s NCS flavor of Zephyr. This is why the Golioth SDK contains more than one west manifest.

Thanks @mike, that make sense.

Btw, I’ve got an update for the test with the fix. The “Receive timeout” still occurs (5 times over the course of 16 hours). But the good news is, it doesn’t get stuck anymore. However, just noticed that there’s a period where it stopped sending data as you can see from the log below, although it was able to recover about 4 hours later. I have also attached a graph that shows when this happened.

[2022-12-09 10:13:23.677] uart:~$ [11:02:12.197,723] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
[2022-12-09 10:13:23.693] uart:~$ uart:~$ [11:03:12.197,784] golioth_lightdb_stream: main: Sending temperature 27.500000
[2022-12-09 10:14:24.211] uart:~$ uart:~$ [11:03:14.663,604] golioth: Resending request 0x20020f90 (reply 0x20020fc8) (retries 2)
[2022-12-09 10:14:26.686] uart:~$ uart:~$ [11:03:19.567,169] golioth: Resending request 0x20020f90 (reply 0x20020fc8) (retries 1)
[2022-12-09 10:14:31.574] uart:~$ uart:~$ [11:03:29.128,234] golioth_system: Receive timeout
[2022-12-09 10:14:41.147] uart:~$ [11:03:29.133,239] golioth_system: Starting connect
[2022-12-09 10:14:41.147] uart:~$ uart:~$ [15:04:24.834,564] net_sock_tls: TLS handshake error: -4c
[2022-12-09 14:15:36.188] uart:~$ [15:04:24.842,834] golioth_system: Failed to connect: -113
[2022-12-09 14:15:36.204] uart:~$ [15:04:24.842,864] golioth_system: Failed to connect: -113
[2022-12-09 14:15:36.204] uart:~$ uart:~$ [15:04:29.842,956] golioth_system: Starting connect
[2022-12-09 14:15:41.198] uart:~$ uart:~$ [15:04:31.499,572] golioth_system: Client connected!
[2022-12-09 14:15:42.862] uart:~$ [15:04:31.499,633] golioth: Packet 0x20020f90 (reply 0x20020fc8) was not replied to
[2022-12-09 14:15:42.862] uart:~$ [15:04:31.499,664] golioth: req_sync finished with error -116
[2022-12-09 14:15:42.878] uart:~$ [15:04:31.499,664] golioth_lightdb_stream: Failed to push temperature: -116
[2022-12-09 14:15:42.878] uart:~$ [15:04:31.499,694] golioth_lightdb_stream: main: Sending humidity 77.500000
[2022-12-09 14:15:42.893] uart:~$ [15:04:31.795,898] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
[2022-12-09 14:15:42.910] uart:~$ uart:~$ [15:05:31.795,989] golioth_lightdb_stream: main: Sending temperature 28.000000