Golioth_system: Receive timeout

Hi @mike, I repeated the test and got another TLS handshake error. Unfortunately, this time it never recovered. It didn’t freeze, just repeating the same error messages for almost 3 days until I reset the board. Please see the error message below.

[2022-12-12 21:04:31.575] uart:~$ [09:49:23.261,749] golioth_lightdb_stream: temperature_push_sync: Temperature successfully sync pushed
[2022-12-12 21:04:31.591] uart:~$ [09:49:23.261,749] golioth_lightdb_stream: main: Sending humidity 70.500000
[2022-12-12 21:04:31.591] uart:~$ [09:49:23.504,760] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
[2022-12-12 21:04:31.607] uart:~$ uart:~$ [09:50:23.504,821] golioth_lightdb_stream: main: Sending temperature 21.000000
[2022-12-12 21:05:32.136] uart:~$ [09:50:23.833,801] golioth_lightdb_stream: temperature_push_sync: Temperature successfully sync pushed
[2022-12-12 21:05:32.152] uart:~$ [09:50:23.833,831] golioth_lightdb_stream: main: Sending humidity 71.000000
[2022-12-12 21:05:32.152] uart:~$ [09:50:24.073,852] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
[2022-12-12 21:05:32.167] uart:~$ uart:~$ [09:50:29.484,313] nrf_modem: Modem error: 0x10, PC: 0x75d1e
[2022-12-12 21:05:38.111] uart:~$ uart:~$ [09:50:42.080,017] net_sock_tls: TLS handshake error: -4e
[2022-12-12 21:05:50.714] uart:~$ uart:~$ [09:50:51.085,479] net_sock_tls: TLS handshake error: -4e
[2022-12-12 21:05:59.722] uart:~$ uart:~$ [09:50:54.073,699] golioth_system: Receive timeout
[2022-12-12 21:06:02.710] uart:~$ [09:50:54.077,789] golioth_system: Starting connect
[2022-12-12 21:06:02.710] uart:~$ [09:50:54.077,850] golioth: Fail to get address (coap.golioth.io 5684) -11
[2022-12-12 21:06:02.725] uart:~$ [09:50:54.077,850] golioth_system: Failed to connect: -11
[2022-12-12 21:06:02.725] uart:~$ [09:50:54.077,880] golioth_system: Failed to connect: -11
[2022-12-12 21:06:02.741] uart:~$ uart:~$ [09:50:59.077,972] golioth_system: Starting connect
[2022-12-12 21:06:07.714] uart:~$ [09:50:59.078,033] golioth: Fail to get address (coap.golioth.io 5684) -11
[2022-12-12 21:06:07.714] uart:~$ [09:50:59.078,063] golioth_system: Failed to connect: -11
[2022-12-12 21:06:07.730] uart:~$ [09:50:59.078,063] golioth_system: Failed to connect: -11
[2022-12-12 21:06:07.730] uart:~$ uart:~$ [09:51:04.078,186] golioth_system: Starting connect

Hi @invosa, sorry you keep running into these issue, but I appreciate you reporting them so that we can take a look.

For now, would you be able to pull your Golioth SDK revision up to v0.5.0? I’m unsure if it will fix this specific issue, but there are networking improvements in that release that may help.

Hi @mike, I have updated the firmware with SDK v0.5.0 and re-running it.

Btw, could you explain what the error messages mean, for both the initial receive timeout TLS handshake error?

These numbers are all based on the Zephyr error number definitions: Zephyr API Documentation: Error numbers

Thanks @mike. I looked into the error codes but it’s still not clear whether the error is caused by the SDK, the board, or the AT&T connection. There’s no error code for the ‘Receive timeout’, what kind of timeout is it?

The test with SDK v0.5.0 also failed, now with a different error code which seems to be a network issue.

uart:~$ [12:15:20.779,815] golioth_lightdb_stream: humidity_push_sync: Humidity successfully sync pushed
rt:~$ uart:~$ [12:16:00.276,580] golioth_system: Receive timeout
uart:~$ [12:16:00.290,069] golioth_system: Starting connect
uart:~$ uart:~$ [12:16:20.779,907] golioth_lightdb_stream: main: Sending temperature 25.500000
uart:~$ [12:16:20.780,609] golioth: Failed to make CoAP request: -115
uart:~$ [12:16:20.780,609] golioth_lightdb_stream: Failed to push temperature: -115
uart:~$ [12:16:20.780,609] golioth_lightdb_stream: main: Sending humidity 75.500000
uart:~$ [12:16:20.781,280] golioth: Failed to make CoAP request: -115
uart:~$ [12:16:20.781,280] golioth_lightdb_stream: Failed to push humidity: -115

Somehow, I think all of these errors are still related with the initial problem that I reported which is the ‘Receive timeout’. What do you think?

Have you done long run tests on other boards, and if yes, do you see it happening as well or is it just particular to this board?

Happy New Year @invosa!

You’re right, it would be helpful if we had more information on exactly what those error codes are indicating. We’ll work on a troubleshooting section for our SDK that explains common causes of these errors and how to fix them.

I’ve set up an nrf9160dk this morning to see if I can replicate the issues you’ve been having with
CoAP requests. I’ll keep an eye on it and post my results here.

Hi @Invosa,

I was able to reproduce the issue on an nRF9160dk. I’m seeing the Failed to make CoAP request -115 after running the LightDB sample for about 43 hours.

We’re looking into this to see why the client isn’t reestablishing a connection. I’ll post updates as we have more info. Thanks!

Hi @mike, thanks for looking into this. Have you figured out what the root cause is and hopefully a permanent solution?

Hi @invosa, we are still working on this issue and don’t have a definitive root cause. From our research we suspect there may be an LTE issue and are investigating that area but we don’t yet have evidence to pinpoint it. We are continuing to troubleshoot.