System hangs calling golioth_client_stop(client)

In Zephyr SDK 0.7.1 I called golioth_system_client_stop(client) and it worked properly.
Porting to Software SDK 0.11.1 (west-ncs.yml) the call is now golioth_client_stop(client) and my system will hang until the WDT resets it whenever I call it.
The documentation states that “This function will block until the client thread is actually stopped” and I’ve confirmed that it’s hanging in coap_client.c while loop.

    while (golioth_client_is_running(client))
    {
        golioth_sys_msleep(1000);
    }

What would cause the golioth client thread to refuse to stop?

Hey @john.stuewe, that is indeed a different behavior than the one in Zephyr SDK. I’ll run this by the Firmware Team and get back to you.

Let me know what you find out. I haven’t found a way around this.
With Golioth being an IOT company and IOT needing to run from battery I’m very surprised there aren’t more examples turning the radio and Golioth connection on and off.

@john.stuewe, we confirmed that the golioth_client_stop() isn’t working as expected, and we are treating it as a bug. We’ll include this fix in our next SDK release, which we expect to happen in the very near future.

I experience the same issue. I agree with @john.stuewe about the low power examples. It would be really nice to have a good example how to handle sleep modes. I opened an issue on the Reference design template github in december last year but no comments yet. (Add support for low power applications · Issue #71 · golioth/reference-design-template · GitHub)

@john.stuewe, @thijs, we agree that it would be nice to have more examples of low power and sleep modes and we have made some strides in that direction with a series of blog posts to show users how to start with that endeavour. Low power and sleep modes are part of the application layer and it’s up to the user to design their application and product with power constraints in mind and should seek information in relevant vendor documentation for the MCU as well as the example code.

We will continue to make new content for “sleepy” devices and incorporate low power modes into the Reference Design Template as soon as our resources allow. Until then, we welcome every contribution to the RD Template and we’ll gladly collaborate with the open-source community to help us achieve that goal.

Thanks! If the golioth_client_stop() function is working again in the coming update I can provide some examples how I would use it. Until now I did not find another way to stop the traffic from the application to the server. Because of this PSM mode will never be activated.

If you have some tips other than stopping the golioth service I would be happy to test them.

I played a bit with the CONFIG_GOLIOTH_COAP_KEEPALIVE_INTERVAL_S settings but that does not show any difference for me.

This is a blocking issue for me. Can you provide any workaround while we’re waiting on the new SDK?

1 Like

I don’t see this fix in the release notes for Release v0.12.0 · golioth/golioth-firmware-sdk · GitHub. Does this release fix this issue?

I’ve confirmed that Golioth 0.12.0 does not fix this issue. If Golioth won’t stop then I can’t turn off the modem so this SDK isn’t usable in a battery operated configuration.

Any ETA on a fix??

Hey @john.stuewe, the golioth_client_stop issue is handled in this PR. After approval and merging, we’ll create a release that will include this bugfix.

1 Like

I’ve confirmed that this code does eventually stop but it has an odd delay. Every time golioth_client_stop(client) is called it takes this routine 10.026 seconds to return.
Why is there an artificial 10 second delay to stop the Golioth client?

Hey @john.stuewe, tested with the hello example and the Golioth client is stopped in ~1 second. There is no artificial delay to stop the Golioth client.
The Golioth client is stopped in the golioth_coap_client_thread() using the client->run_sem semaphore.

What would cause it to stop for 10 seconds then?
If I add a LOG_DBG in coap_client.c it always sends out 10 logs before stopping.

diff --git a/deps/modules/lib/golioth-firmware-sdk/src/coap_client.c b/../golioth11/deps/modules/lib/golioth-firmware-sdk/src/coap_client.c
index bbd8616..71fe623 100644
--- a/deps/modules/lib/golioth-firmware-sdk/src/coap_client.c
+++ b/../golioth11/deps/modules/lib/golioth-firmware-sdk/src/coap_client.c
@@ -56,7 +56,6 @@ enum golioth_status golioth_client_stop(struct golioth_client *client)
     while (golioth_client_is_running(client))
     {
         golioth_sys_msleep(1000);
+       LOG_DBG("Waiting for Golioth to stop\n");
     }

     return GOLIOTH_OK;

Hey @john.stuewe, the golioth_client_stop PR was just merged to the Firmware SDK.
I am unsure at the moment why it would print out exactly 10 logs before stopping, the while loop is “waiting” for the client to gracefully stop by checking the client->is_running variable.

One thing that stands out from the code snippet above is that golioth_sys_msleep is set to 1 second, while in the Firmware SDK is set to 100ms.

That appears to be my debugging to slow down the logging so it didn’t generate 100 logs waiting to shut down. Any ideas why it would take 10 seconds?

Even with a 100 ms sleep in the while loop, it takes 10 seconds?

I put it back to the default 100ms and see ~88-93 loops before it shuts down.

         dtimelog( "start golioth_client_stop");
         golioth_client_stop(client);
         dtimelog("golioth client stopped.");

I have the timing routine above called every 5 minutes and it returned 9062ms, 0565ms, 9058ms, and 9059ms on four loops. It’s just over 9 seconds which would have rounded to 10 seconds with a 1 second delay.

Hey @john.stuewe, thank you for the detailed response. We are looking into it and will let you know if we need more information.

Still having odd issues every time I change something and without examples it’s hard to know what the sequence is. I’m trying to reproduce it in the reference template and it will work for several cycles and then reboot shortly after calling golioth_client_start with SDK 0.12.1. It typically has a ~2 second shutdown delay and the code ignores k_sleep statements without delaying which is very odd.

Let me know if there are issues in this sequence.

/*
 * Copyright (c) 2022-2023 Golioth, Inc.
 *
 * SPDX-License-Identifier: Apache-2.0
 */
//Note: reference-design-template/boards/aludel_mini_v1_sparkfun9160_ns.conf edited to remove CONFIG_LIB_OSTENTUS

#include <zephyr/logging/log.h>
LOG_MODULE_REGISTER(golioth_rd_template, LOG_LEVEL_DBG);

#include "app_rpc.h"
#include "app_settings.h"
#include "app_state.h"
#include "app_sensors.h"
#include <golioth/client.h>
#include <golioth/fw_update.h>
#include <samples/common/net_connect.h>
#include <samples/common/sample_credentials.h>
#include <zephyr/kernel.h>

#ifdef CONFIG_SOC_NRF9160
#include <modem/lte_lc.h>
#endif
#ifdef CONFIG_LIB_OSTENTUS
#include <libostentus.h>
#endif
#ifdef CONFIG_ALUDEL_BATTERY_MONITOR
#include "battery_monitor/battery.h"
#endif

#include <zephyr/drivers/gpio.h>

#ifdef CONFIG_MODEM_INFO
#include <modem/modem_info.h>
#endif

/* Current firmware version; update in prj.conf or via build argument */
static const char *_current_version = CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION;

static struct golioth_client *client;
K_SEM_DEFINE(connected, 0, 1);

static k_tid_t _system_thread = 0;

#if DT_NODE_EXISTS(DT_ALIAS(golioth_led))
static const struct gpio_dt_spec golioth_led = GPIO_DT_SPEC_GET(DT_ALIAS(golioth_led), gpios);
#endif /* DT_NODE_EXISTS(DT_ALIAS(golioth_led)) */
static const struct gpio_dt_spec user_btn = GPIO_DT_SPEC_GET(DT_ALIAS(sw1), gpios);
static struct gpio_callback button_cb_data;

/* forward declarations */
void golioth_connection_led_set(uint8_t state);

void wake_system_thread(void)
{
	k_wakeup(_system_thread);
}

//debug routine for timing functions.
void dtimelog(int lnum, char *myString)
{
   static int64_t uptime_last[10];
   static int deltatime;
   if(uptime_last[lnum] == 0) uptime_last[lnum] = k_uptime_get();
   deltatime = k_uptime_get() - uptime_last[lnum]; // initial readtime is init time
   if(strlen(myString) > 0) {
      LOG_WRN("%5ims %s", deltatime, myString);
   }
   uptime_last[lnum] = k_uptime_get();
}

static void on_client_event(struct golioth_client *client,
			    enum golioth_client_event event,
			    void *arg)
{
   switch(event) {
   case GOLIOTH_CLIENT_EVENT_DISCONNECTED:
      k_sem_take(&connected, K_NO_WAIT); // Do not wait if sem take fails
      break;
   case GOLIOTH_CLIENT_EVENT_CONNECTED:
      k_sem_give(&connected);
      golioth_connection_led_set(1);
//      golioth_fw_update_init(client, _current_version);
//      app_state_observe(client);
//      app_sensors_set_client(client);
//      app_settings_register(client);
//      app_rpc_register(client);
   }
   LOG_INF("Golioth client: %s", k_sem_count_get(&connected) ? "connected" : "disconnected");
}

static void start_golioth_client(void)
{
	/* Get the client configuration from auto-loaded settings */
	const struct golioth_client_config *client_config = golioth_sample_credentials_get();

	/* Create and start a Golioth Client */
	client = golioth_client_create(client_config);

	/* Register Golioth on_connect callback */
	golioth_client_register_event_callback(client, on_client_event, NULL);

	/* Initialize DFU components */
	golioth_fw_update_init(client, _current_version);


	/*** Call Golioth APIs for other services in dedicated app files ***/

	/* Observe State service data */
	app_state_observe(client);

	/* Set Golioth Client for streaming sensor data */
	app_sensors_set_client(client);

	/* Register Settings service */
	app_settings_register(client);

	/* Register RPC service */
	app_rpc_register(client);
}

#ifdef CONFIG_SOC_NRF9160

static void lte_handler(const struct lte_lc_evt *const evt)
{
	if (evt->type == LTE_LC_EVT_NW_REG_STATUS) {

		if ((evt->nw_reg_status == LTE_LC_NW_REG_REGISTERED_HOME) ||
		    (evt->nw_reg_status == LTE_LC_NW_REG_REGISTERED_ROAMING)) {

			/* Change the state of the Internet LED on Ostentus */
			IF_ENABLED(CONFIG_LIB_OSTENTUS, (led_internet_set(1);));

			if (!client) {
				/* Create and start a Golioth Client */
				start_golioth_client();
			}
		}
	}
}
#endif /* CONFIG_SOC_NRF9160 */

#ifdef CONFIG_MODEM_INFO
static void log_modem_firmware_version(void)
{
	char sbuf[128];

	/* Initialize modem info */
	int err = modem_info_init();

	if (err) {
		LOG_ERR("Failed to initialize modem info: %d", err);
	}

	/* Log modem firmware version */
	modem_info_string_get(MODEM_INFO_FW_VERSION, sbuf, sizeof(sbuf));
	LOG_INF("Modem firmware version: %s", sbuf);
}
#endif

void button_pressed(const struct device *dev, struct gpio_callback *cb, uint32_t pins)
{
	LOG_DBG("Button pressed at %d", k_cycle_get_32());
	/* This function is an Interrupt Service Routine. Do not call functions that
	 * use other threads, or perform long-running operations here
	 */
	k_wakeup(_system_thread);
}

/* Set (unset) LED indicators for active Golioth connection */
void golioth_connection_led_set(uint8_t state)
{
	uint8_t pin_state = state ? 1 : 0;
#if DT_NODE_EXISTS(DT_ALIAS(golioth_led))
	/* Turn on Golioth logo LED once connected */
	gpio_pin_set_dt(&golioth_led, pin_state);
#endif /* #if DT_NODE_EXISTS(DT_ALIAS(golioth_led)) */
	/* Change the state of the Golioth LED on Ostentus */
	IF_ENABLED(CONFIG_LIB_OSTENTUS, (led_golioth_set(pin_state);));
}

int main(void)
{
	int err;

	LOG_DBG("Start Reference Design Template sample");

	LOG_INF("Firmware version: %s", CONFIG_MCUBOOT_IMGTOOL_SIGN_VERSION);
	IF_ENABLED(CONFIG_MODEM_INFO, (log_modem_firmware_version();));

	/* Get system thread id so loop delay change event can wake main */
	_system_thread = k_current_get();

#if DT_NODE_EXISTS(DT_ALIAS(golioth_led))
	/* Initialize Golioth logo LED */
	err = gpio_pin_configure_dt(&golioth_led, GPIO_OUTPUT_INACTIVE);
	if (err) {
		LOG_ERR("Unable to configure LED for Golioth Logo");
	}
#endif /* #if DT_NODE_EXISTS(DT_ALIAS(golioth_led)) */

	/* Start LTE asynchronously if the nRF9160 is used.
	 * Golioth Client will start automatically when LTE connects
	 */

	LOG_INF("Connecting to LTE, this may take some time...");
	lte_lc_init_and_connect_async(lte_handler);

	/* Set up user button */
	err = gpio_pin_configure_dt(&user_btn, GPIO_INPUT);
	if (err) {
		LOG_ERR("Error %d: failed to configure %s pin %d", err, user_btn.port->name,
			user_btn.pin);
		return err;
	}

	err = gpio_pin_interrupt_configure_dt(&user_btn, GPIO_INT_EDGE_TO_ACTIVE);
	if (err) {
		LOG_ERR("Error %d: failed to configure interrupt on %s pin %d", err,
			user_btn.port->name, user_btn.pin);
		return err;
	}

	gpio_init_callback(&button_cb_data, button_pressed, BIT(user_btn.pin));
	gpio_add_callback(user_btn.port, &button_cb_data);

   while (!client) { // don't continue of Golioth client hasn't been created due to missing cellular link
      LOG_DBG("250ms cellular connect loop");
      k_msleep(250);
   }

   int i = 0;
   k_sleep(K_SECONDS(10));  // Let client stabilize.
   LOG_DBG("Starting main loop");
   while(true) {
      i++;
      LOG_DBG("New Loop");
      
      if (!golioth_client_is_connected(client)) {
         golioth_client_start(client);
         LOG_DBG("Starting client");
      }

      k_sleep(K_SECONDS(5)); // let client stabilize
      
      dtimelog(0, "");
      app_sensors_read_and_stream();
      dtimelog(0, "Sensors read");

      k_sleep(K_SECONDS(5)); // let client stabilize

      dtimelog(0, "");
      golioth_client_stop(client);
      dtimelog(0, "golioth client stopped.");
      LOG_DBG("End of loop %i\n", i);
      k_sleep(K_SECONDS(get_loop_delay_s()));
   }
}