Golioth client connects after a client stop is queued, leading to a thread lock

Description

In my application, I have implemented a timeout for starting the Golioth client as to not lock it up indefinitely. After this timeout, I trigger a call of golioth_stop_client(). This is all working, however: if I run into said timeout, the stop function seems to not work, and after a few seconds it says the client is connected. Because my application is expecting the client to be off at this point, this leads to a softlock.

In the log I posted the timeout was 10 milliseconds, but it happened with a more normal timeout of 10 seconds as well, it was just easier to reproduce with a super short timeout.

Steps to Reproduce

Stopping the client while it’s still trying to connect to Golioth.

Expected Behavior

The client is stopped.

Actual Behavior

The client is not stopping, and connects eventually.

Impact

Softlocks the device.

Environment

Golioth SDK 0.15.0, NCS 2.5.2, Zephyr 3.6.99

Logs and Console Output

[00:00:05.656,890] libcetus_datatransfer_rq_q: network_dtrqq: Queue provider connected, flushing 1 datatransfers
[00:00:05.657,257] libcetus_datatransfer_rq_q: underlying_request_cb: golioth_dtrqq: Underlying queue (network_dtrqq) called our request
[00:00:05.657,287] libcetus_datatransfer_rq_q: flush: golioth_dtrqq: Flushing the queue now
[00:00:05.657,318] golioth_client: start: Starting the Golioth client
[00:00:05.658,020] golioth_mbox: Mbox created, bufsize: 3696, num_items: 32, item_size: 112
[00:00:05.659,698] golioth_fw_update: Current firmware version: main - 0.2.14
[00:00:05.659,851] golioth_client: Failed to register reboot RPC
[00:00:05.669,982] libcetus_datatransfer_rq_q: golioth_dtrqq: Connection attempt timed out
[00:00:05.670,043] libcetus_datatransfer_rq_q: golioth_dtrqq: Queue provider failed to connect
[00:00:05.670,104] libcetus_datatransfer_rq_q: disconnect_provider: Finishing dtrq on underlying queue (network_dtrqq)
[00:00:05.670,135] libcetus_datatransfer_rq_q: network_dtrqq: All datatransfers completed, timed out or cancelled
[00:00:05.670,196] libcetus_datatransfer_rq_q: golioth_dtrqq: Disconnecting provider
[00:00:05.670,227] golioth_coap_client_zephyr: Attempting to stop client
[00:00:08.300,201] golioth_coap_client_zephyr: Golioth CoAP client connected
[00:00:08.300,262] golioth_client: Golioth client connected

Hey @kolozspe,

From an initial review, it seems that because the application is managing the lifecycle of the Golioth client (i.e., when it starts and stops), it’s also responsible for ensuring that those transitions happen in a safe state. In this case, calling golioth_stop_client() while the client is still in the middle of initializing or connecting may not be fully canceling the internal connection process, which leads to the unexpected behavior you’re seeing.

That said, I’ll flag this internally to the firmware team so we can take a closer look at how golioth_stop_client() behaves during the connection phase and whether improvements are needed around cancellation or state handling. I’ll follow up once we’ve had a chance to investigate further.

Thanks for reporting this!

Hello Marko,

thanks for the quick reply! So am I understanding correctly that it’s not a good idea to try and stop the client before it connects? If that’s the case, is there any way to safely implement a timeout of some sort? Because we had scenarios in the past (and present) where the connection didn’t work at all, even with LTE connection present. We definitely need a failsafe for these cases. Maybe this could even be a feature on your side in the future?

Another question would be the length of the timeout. In my development process and testing, I have never had a situation where the connection didn’t happen within 5 seconds, if it did happen. I tested this with up to 2 minutes of waiting time. Do you think a timeout of 10 seconds is reasonable?

EDIT: I just realized that the error first happened with a timeout of 10 seconds because the connection happens after 14 or so seconds, so it’s obviously not long enough. Maybe 20 seconds?

Hey @kolozspe,

You’re right—trying to stop the client before it connects can be tricky, mostly because the connection process isn’t really designed to be aborted cleanly mid-flight. That said, adding a failsafe like a timeout definitely makes sense, especially for robustness in production environments where LTE can be flaky or behave inconsistently.

In our experience, connection times can vary quite a bit depending on network conditions, SIM card, carrier, etc. So while 5–10 seconds might work in ideal dev/test setups, in the field it’s not uncommon to see it takes considerably longer than that.

If you want to implement a timeout, one approach could be to run the connection logic in a separate thread or work queue and then use a timer to abort or soft-reset the system if it exceeds your limit. It’s not perfect, but it gives you a safety net.

I also agree that this would be a good candidate for a future feature that handles edge cases more gracefully. The firmware team is actively looking into this and will follow up on this thread.

1 Like