Golioth_client_stop is not stopping the client. Locks the thread and reset by watchdog timer

Description

This issue occurs when the golioth_client_stop(client); function is called after a connection attempt has timed out. The function gets stuck in while loop, causing a thread lock that is eventually reset by the Watchdog Timer (WDT). The last log message observed is "Attempting to stop client".

Steps to Reproduce

I need some guide here, how can I reproduce a connection timeout with/without changing client parameters?
I don’t want to disconnect LTE, also not reduce connection timeout as normally the connection works in 2-5 seconds and keeping the timeout less than 3 seconds lead to other issues of getting the connection after trying to stop the client at timeout which was already reported back in time. However I have attached the logs of last night when this issue happened.

Expected Behavior

The golioth_client_stop() function should successfully exit the while loop, shut down the client, and return GOLIOTH_OK as it does in normal scenerio and print the log msg after golioth_client_stop(client);

static void stop() {
    LOG_INF("Golioth client stopping");
     golioth_client_stop(client);
     LOG_INF("Golioth client stopped");
}

Actual Behavior

LOG_INF("Golioth client stopped");, this msgs is never printed and the firmware is rebooted by watchdog timer after the wdt timeout.

Impact

The device stops responding and is reset by WDT.

Environment

The hardware is nrf9160 based custom board working completely fine in normal cases.
Golioth sdk: 0.15.0
NCS: 2.7.0
zephyr os: v3.6.99-ncs2 (nordic fork)

Connection timeout is set to 30 seconds.

Logs and Console Output

Normal output when everything works fine:

[02:50:01.262,512] <inf> buffer: Publishing 1 datapoint
[02:50:01.615,203] <inf> rf_manager: RRC mode: connected
[02:50:14.018,310] <inf> golioth_coap_client_zephyr: Golioth CoAP client connected
[02:50:14.018,341] <inf> golioth_client: Golioth client connected
[02:50:14.020,233] <inf> golioth_coap_client_zephyr: Entering CoAP I/O loop
[02:50:15.240,844] <inf> golioth_settings: Setting ACCELERATION_OBA_INTERRUPT_THRESHOLD = 2.000000
[02:50:15.343,231] <inf> golioth_rpc: RPC observation established
[02:50:15.413,024] <inf> golioth_fw_update: Received OTA manifest
[02:50:15.413,024] <inf> golioth_fw_update: Manifest does not contain different firmware version. Nothing to do.
[02:50:15.413,024] <inf> golioth_fw_update: Waiting to receive OTA manifest
[02:50:15.624,023] <inf> golioth_uplink: Successfully published data
[02:50:15.624,145] <inf> golioth_dtrqq: All datatransfers completed
[02:50:17.040,100] <inf> golioth_dtrqq: Disconnecting provider
[02:50:17.040,130] <wrn> golioth_client: Golioth client stopping
[02:50:17.040,161] <inf> golioth_coap_client_zephyr: Attempting to stop client
[02:50:17.040,252] <inf> golioth_coap_client_zephyr: Stop request
[02:50:17.040,252] <inf> golioth_coap_client_zephyr: Ending session
[02:50:17.040,252] <inf> golioth_client: Golioth client disconnected
[02:50:18.140,930] <wrn> golioth_client: Golioth client stopped
[02:50:30.015,655] <inf> rf_manager: RRC mode: idle

Logs when it is not working properly:

[02:51:15.262,512] <inf> buffer: Publishing 1 datapoint
[02:51:42.338,897] <inf> rf_manager: RRC mode: idle
[02:51:45.262,786] <wrn> golioth_dtrqq: Connection attempt timed out
[02:51:45.263,061] <wrn> golioth_dtrqq: Queue provider failed to connect
[02:51:45.263,122] <inf> golioth_dtrqq: All datatransfers timed out or cancelled
[02:51:45.263,305] <inf> golioth_dtrqq: Disconnecting provider
[02:51:45.263,305] <wrn> golioth_client: Golioth client stopping
[02:51:45.263,305] <inf> golioth_coap_client_zephyr: Attempting to stop client
[00:00:00.518,157] <inf> LIS2DW12: lis2dw12@19: int-pin is on INT1
[00:00:00.518,218] <inf> LIS2DW12: lis2dw12@19: int on [email protected]

Hi @shahab.amjad, thank you for reporting! I believe this report may match one from your colleague: Golioth client connects after a client stop is queued, leading to a thread lock

Can you confirm that this is the same behavior you are experiencing?

Hi hasheddan,

That was a different issue:

keeping the timeout less than 3 seconds lead to other issues of getting the connection after trying to stop the client at timeout which was already reported back in time

This is not the issue I am reporting right now, I just mentioned it to ask what can I do to reproduce the connection issue because setting the timeout to very low value leads to that issue mentioned in other thread.

What I am talking about now is that we implemented 30 seconds timeout and even 60 seconds and sometimes when the golioth connection is not available, after this 30-60 seconds timeout we call the golioth_client_stop(client); function and this locks the thread and the last message we see is Attempting to stop client

Why that is happening and what is the fix for it?

Should I implement the golioth_client_destroy(client); and then recreate the client when the connection times out?

EDIT: I used the golioth_client_destroy(client); last night for testing and the same issue happens. This function has the golioth_client_stop(client); inside as well.

And what can I do to reproduce the issue that golioth connection is not available to the device so I can do some other workaround and testing because this is happening to some of the devices and we are loosing data.

Like for testing the transfer timeout what I do is , I disable the device from golioth, in that way the device gets the connection but cannot transfer data so this is a good testing for transfer failures. Similar to this what can I do to prevent the golioth connection without disabling LTE?, may be manipulating some parameters of the static struct golioth_client *client = NULL;

@shahab.amjad, we tried to reproduce the issue on our side, but could not do so. We tested with SDK v0.15.0, v0.18.0, and the tip of main, and in all cases the client stopped as expected. The client also stops correctly while the modem is still attaching to the network, regardless of whether the device is enabled or disabled in the console. Additionally, disabling the device in the console while it was connected to Golioth, and then stopping the client, did not reproduce the issue either.

Could you share more about your testing scenario and how your application manages the client, network and modem control? Do you perhaps have a minimal application we could use to reproduce the behavior on our side?

If possible, please also test with this branch: GitHub - golioth/golioth-firmware-sdk at test_client_stop_hang

On our side, the client stopped correctly when using it, so we’d like to know if you observe the same behavior.

Hi Marko,

I am a bit busy and can’t find time to make a small application but I can do it next week for you. However, this is the part of the code:

// NOTE: this is not the whole code, just the relevent part.
static struct connect_trigger_result
start(connecting_result_cb_fn result_callback) {

struct connect_trigger_result result = {0};

// connect_result_cb to be called in 'on_client_event' to stop the timeout timer and start datatransfer
connect_result_cb = result_callback;

LOG_INF("Starting the Golioth client");

if (client == NULL) {
config.credentials = *golioth_get_credentials();
client = golioth_client_create(&config);

   if (client == NULL) {
       LOG_ERR("Failed to create Golioth client");
       result.status = CONNECTION_STATUS_UNABLE_TO_CONNECT;
       return result;
   }

golioth_client_register_event_callback(client, on_client_event, result_callback);

golioth_settings_init(client);

#if defined(CONFIG_GOLIOTH_FW_UPDATE)
golioth_fw_update_init(client, APP_VERSION_STRING);
golioth_fw_update_register_state_change_callback(on_fw_update_state_change, NULL);
#endif /* defined(CONFIG_GOLIOTH_FW_UPDATE) */

#if defined(CONFIG_REBOOT)
struct golioth_rpc *rpc = golioth_rpc_init(client);
if (!golioth_rpc_register(rpc, "reboot", on_reboot, NULL)) {
LOG_ERR("Failed to register reboot RPC");
}
#endif /* defined(CONFIG_REBOOT) */

    } else {

golioth_client_start(client);

    }

// This return status will start the timer and will submit the work to call the stop function
//  on success this timer is stopped in the 'connect_result_cb'

result.connect_timeout = CLIENT_START_TIMEOUT; // 30 seconds
result.status = CONNECTION_STATUS_CONNECTING;
return result;
}

static void stop() {

LOG_INF("Golioth client stopping");
golioth_client_stop(client);
LOG_INF("Golioth client stopped");

}

These are start and stop functions for golioth, now to reproduce the issue what I did was to change the port for golioth client as follows:

Golioth sdk: 0.15.0

NCS: 2.7.0

zephyr os: v3.6.99-ncs2 (nordic fork)

//      file: coap_client_zephyr.c line 997

static int golioth_connect(struct golioth_client *client)

{
char uri[] = CONFIG_GOLIOTH_COAP_HOST_URI;
char *host = &uri[sizeof("coaps://") - 1];
char ipv6_addr[40];
const char *port = "5684";  // line 997
char *colon;
int err;

static uint8_t counter = 0;

   if (counter < 2) {
      counter++;
      port = "5684"; // right
   } else {
      counter = 0;
      port = "5685"; // wrong port
   }

//rest of the code below unchanged

This would make the golioth client to timeout every 3rd time it tries to connect and at timeout when the stop function is called it never comes out the while loop as mentioned before

enum golioth_status golioth_client_stop(struct golioth_client *client)
{
   if (!client)
   {
     return GOLIOTH_ERR_NULL;
   }
GLTH_LOGI(TAG, "Attempting to stop client");

k_sem_take(&client->run_sem, K_NO_WAIT);
atomic_set_bit(&flags, FLAG_STOP_CLIENT);
golioth_client_wakeup(client);

   // Wait for client to be fully stopped
   while (golioth_client_is_running(client))
   {
      // gets stuck here, confirmed by log msgs
      LOG_ERR("stuck");
      golioth_sys_msleep(100);
   }
   return GOLIOTH_OK;
}

@shahab.amjad, I ran some tests with the SDK changes you shared. Even when I change the port number in coap_client_zephyr.c, the client still stops after the DTLS handshake times out.

If my diagnosis is correct, what’s happening is that the CoAP thread doesn’t check the stop flag until after it finishes trying to connect, so it blocks until the handshake fails, and if your watchdog timer fires earlier than that, the MCU resets before zsock_connect has a chance to return. That would explain why the last log you see is “Attempting to stop client” before the MCU resets.

To verify this, I’d recommend temporarily increasing your WDT timeout so the DTLS handshake has time to fail and the client can stop cleanly before the WDT triggers.

Let us know what you find after adjusting the WDT.

Hi Marko,

Any idea what time should I set for WDT, I had previously 30 seconds and even 60 seconds and it still fails to stop.

But anyways I will test it with even larger value later.

For testing purposes, I recommend setting the WDT to 120 seconds, or even up to 300 seconds (5 minutes), so we can confirm that once the DTLS handshake times out, the client stops when golioth_client_stop() is called. Alternatively, you can disable the WDT temporarily. Based on my testing, the client should stop within about 120 seconds, but extending or disabling the WDT will help verify that behavior.

Hey @shahab.amjad, circling back on this topic, did increasing the WDT timeout help to give the DTLS handshake enough time to fail gracefully, allowing the client to stop cleanly?

Hi marko,

Sorry I did not run the test yet, running busy these days. Thanks for the reminder, I will do the test in the next 2 days, may be tomorrow and let you know.

Hi Marko,

I ran the tests and the client is stopping after 2 minutes and 2-5 seconds.

Now what is the solution to this. 2 minutes are okay for testing purposes but I would not like to wait for 2 minutes every time the connection fails as some of our products need frequent data publishing, sometimes every minute and a golioth connection failure there is a mess.

Currently a work around for golioth connection failure is that I disconnect lte and connect it again instead of even trying to disconnect the golioth client directly. Currently this is working for us but this is a dirty solution and a proper solution from golioth sdk would be the best thing.

Hey @shahab.amjad,

The DTLS handshake timeout in the Zephyr tree is what’s blocking golioth_client_stop() function. Unfortunately, the handshake process can’t be interrupted once it starts, and the timeout value can’t be set to zero. However, we could introduce a new Kconfig symbol to make the timeout configurable, which you could then adjust according to your preference.

For the products that need to stream data every minute, is that a strict requirement that the data in the cloud must always be less than one minute old, or could you batch and send updates less frequently?

Also, just to clarify: you’re able to connect to the network, but the connection to Golioth fails during the DTLS handshake? If so, how reliable is the cellular connection?

Hi Marko,

Yes the Kconfig option would be good.

We have both settings, sampling internal and sending interval and usually the better option is to sample frequently and send after hours so we save alot of power. But some customers require frequent sending as well for live location purposes. That is the only area we currently need this one minute sending interval for specific customers.

The golioth connection failure happens when lte is relatively weak at some locations. So when the device tries to send data, the golioth connection fails, that is understandable but it should stop properly at timeout as well so that the device is not locked. But for now we can set WDT timeout for more than 2 minutes and wait for the Kconfig option.

Thanks for your help.

Hey @shahab.amjad,

We’ve added new Kconfig symbols to the SDK to make the DTLS handshake timeout configurable (based on MBEDTLS_SSL_DTLS_TIMEOUT_DFL from mbedTLS).

The new symbols are:

  • GOLIOTH_ZEPHYR_DTLS_HANDSHAKE_TIMEOUT_MIN_MS
  • GOLIOTH_ZEPHYR_DTLS_HANDSHAKE_TIMEOUT_MAX_MS

By default, both are set to -1, which means the Zephyr defaults will be used (1000 and 60000 ms, respectively).

This update applies only to the Zephyr port, and you can find more details in this PR.

Hi Marko,

Thanks for the update. Great to hear that I will implement it in my firmware soon.

1 Like