When does a device check for updates from the server?

Description

I’m working on a “sleepy” cellular device where the device goes to sleep when it’s not sending data to the Stream service. I’m also using the OTA and settings services.

What causes a device to check for updates to the device settings, LightDB State, and receive a new manifest from the OTA service?

For example, let’s say the following config is used:

CONFIG_GOLIOTH_COAP_KEEPALIVE_INTERVAL_S=0
CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC=3600
CONFIG_GOLIOTH_USE_CONNECTION_ID=y

If the device sends data to the Stream service every 15 minutes, should I expect that the device will also check for updated settings/state/new firmware versions at the same time (i.e. every 15min)? Or are these only checked on the interval defined by CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC=3600 (1hr)?

Actual Behavior

It looks like the device only checks for updates based on CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC=3600
(see comment below)

Environment

SDK v0.19.1

1 Like

Apologies… I’m testing this again and it looks like the device is updating the settings and checking for new firmware whenever the device sends new data to the Stream service. The first time I tested out the longer RX timeout settings it appeared the device wasn’t getting updates to settings/OTA, but it was probably something I misconfigured.

Just to confirm, the device should get updates from the server whenever it connects to stream new data, right? And the CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC defines how long before the device tries to connect again in the absence of any other connection to the server?

I’m seeing the same behavior (settings not updating) to what I originally posted after upgrading a second device now.

Here’s the sequence of events:

  1. Device was running v2.0.0 firmware
  2. I published a new v2.0.1 firmware and created a deployment in the cohort
  3. Device upgrades to v2.0.1 firmware successfully
  4. Device publishes new data to Stream service every 15 min
  5. Device settings still have not been synchronized since the OTA completed



Here’s the prj.conf from the v2.0.1 firmware.

Hey @cdwilson,

The settings are not re-fetched on a cadence. There is the possibility of removing the settings observation and then re-establishing it, but I don’t think we continuously fetch them in the background.

If you performed an OTA update, the device should automatically re-establish the settings observation once it reconnects. It’s highly peculiar that you’re seeing otherwise, and I wasn’t able to reproduce this issue on my end.

As for GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC , it defines how long (in seconds) the CoAP client will wait for a response after sending a request to the Golioth Cloud.

Hey @marko,

Thanks for the reply.

I was able to reproduce this behavior a couple different times on a modified version of the Thingy:91 Example repo where I removed the LightDB State & RPC functionality and changed the prj.conf for low power (here are the changes I made)

Here’s the specific behavior I’m seeing:

  1. After the device finishes an OTA update, settings are initially synchronized successfully with the device.
  2. The device transmits sensor data to LightDB Stream successfully on an interval defined by LOOP_DELAY_S (I was using 60s).
  3. If I change the LOOP_DELAY_S to something else (e.g. 65), the new setting is never synchronized and the console shows “Out of sync” indefinitely.



Here’s the exact sequence of steps I followed to reproduce this:

  1. I programmed the device via JTAG with the thingy91-golioth_v1.6.0_thingy91x_nrf9151_ns_full.hex pre-build release firmware.
  2. I created a sample project with the v1.6.0 firmware deployed in a cohort and added the device to the cohort.
  3. I built a new OTA firmware v99.99.99 from this modified branch of the Thingy:91 Example app: GitHub - cgnd/thingy91-golioth at cdwilson/golioth-sync-test (west build -p -b thingy91x/nrf9151/ns --sysbuild)
  4. I uploaded the zephyr.signed.bin as a new package v99.99.99 and created a new deployment in the cohort
  5. The device OTA upgraded successfully to v99.99.99 firmware and settings sync’d correctly
  6. I waited for a couple minutes to verify the device was sending Stream data correctly
  7. I changed the LOOP_DELAY_S from 60 to 65
  8. The settings remain “Out of sync” indefinitely even though the device is sending Stream data every 60s.

It’s as if the Settings service gets “stuck” and the device can no longer receive new settings.

However, while I’ve seen this same behavior multiple times now, I can’t figure out how to reproduce it 100% of the time on demand… sometimes I repeat the whole process above (or even just reboot the device) and the settings just sync correctly without any changes to the firmware image.

Can you confirm something for me?

If I have an app that is only using the Stream, Settings, and OTA services, and has the following config:

CONFIG_GOLIOTH_COAP_KEEPALIVE_INTERVAL_S=0
CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC=3600
CONFIG_GOLIOTH_USE_CONNECTION_ID=y

Assuming the app is sending stream data more frequently than CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC, e.g. every 60s, then the device should only check for new settings and OTA updates each time the stream data is sent to Golioth? Is that a true statement?

@cdwilson I can reproduce the described behavior with GOLIOTH_COAP_KEEPALIVE_INTERVAL_S = 0 (as the blog post suggests). Stream uploads continue to work in that state, but Settings/OTA pushes go missing. The core issue is that pushes require a live server→device path and an active Observation. With keepalive off, the NAT/CGN downlink mapping times out between publishes. When the server sends a confirmable notification, the device often never sees it, doesn’t ACK, and the server eventually cancels the Observation. Outbound Stream still works (client traffic reopens the path), but you won’t get Settings/OTA again until the client reconnects or re-subscribes.

Connection ID helps the DTLS association survive IP/port changes, but it’s not a keepalive—it doesn’t keep the downlink open or maintain the Observation.

We suggest you:

  • Enable a small keepalive (9–60 s). This keeps NAT/CGN downlink mapping so notifications arrive in time to be ACKed, and the Observation stays alive.
  • Keep RX timeout sane (30–120 s). If the downlink path dies, the client notices sooner, reconnects, and re-establishes Settings/OTA observations. We’re planning to introduce a polling option for observations, which I believe would be helpful in this case. This should be available relatively soon for OTA, and in the medium term,m we’ll extend it to all observation types.

@marko thanks for confirming the behavior.

:+1: I’ll experiment with this.

I originally had a small RX timout set, but in my app the delay between sending outbound stream data is configurable via a runtime setting configured via the Settings service. This delay could be 60s or it could be 24hr. If the stream delay is 24hr, but the client’s RX timeout is 30s, the device will wake up every 30s to check in with the server and battery life suffers. To avoid this, I set the RX timeout to be greater than the max allowed stream delay value (24hr) to avoid having the client wake up earlier due to the RX timeout.

That would be helpful in my case.

Alternatively, I looked to see if there was a way to dynamically set the CoAP client RX timeout at runtime, but it appears to only be configurable at compile time via the CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC Kconfig setting.

Would it be possible to make this RX timeout configurable at runtime? (similar to the way that I can set a default LTE RPTAU period via CONFIG_LTE_PSM_REQ_RPTAU_SECONDS but also change it at runtime via lte_lc_psm_param_set() LC API function)

If the RX timeout was configurable at runtime, I could automatically change it to a reasonable value whenever the user updates the stream delay value.

BTW, I think I followed most of this explanation, but there is some terminology in here that was initially not familiar to me (e.g. “Observation”, “confirmable notification”, etc). Is there any documentation or blog post you can point me to that describes how the Golioth client/server interaction is intended to work at a high-level for each of these services? If not, I would find that documentation super helpful to have. When I first saw this behavior, I wasn’t sure what was supposed to happen, so it was hard to determine if what I was seeing was intended/correct behavior, a bug, or me using something in the wrong way.

When I enable the keepalive (GOLIOTH_COAP_KEEPALIVE_INTERVAL_S), the client gets updates from the server (as expected, since it’s keeping the session active). However, I’d like to avoid waking up frequently to sending keepalive requests to conserve battery life.

Let’s say that I change a setting in the Golioth web console while my device is currently sleeping. If the device wakes up to send data to the stream service, doesn’t the client need to reconnect in order to send the stream request? I’m still a bit confused as to why the settings/OTA aren’t received when this connection is opened by the device.

You mentioned that the “server eventually cancels the Observation”. Does this mean that even though the device eventually reconnects to send the stream data, the server won’t push updates to the settings/OTA because the observations have been cancelled from the server’s perspective? Is there a way to extend the timeout on the server such that it won’t cancel the observations before the device has a chance to reconnect? Or is there something I can call on the device SDK side to recover/reinitialize the observations if the server has cancelled them?

I guess I’m trying to figure out if there is a hard limit (on the server or elsewhere) to how long a device can sleep while still keeping the ability to receive settings/OTA updates when the device eventually wakes up and sends data to the stream service.

@cdwilson, thanks for your patience as we’ve been discussing this. I know the back-and-forth has created some confusion, so I’d like to take a step back and give you a clear picture of how Observations work today, what their limitations are, and how we’re planning to improve the experience moving forward.

Keepalive keeps the NAT mapping and DTLS session alive. That way, existing Observations for Settings/OTA can continue to deliver notifications to the device without being dropped. Observations are tied to the CoAP/DTLS session, not to the individual UDP socket mapping. As long as the DTLS session is still valid, the server will keep the Observes active in memory. That means:

  • If the device is online and reachable, it will receive notifications (e.g. Settings/OTA) as they happen.
  • If the device is offline or asleep, it obviously won’t see those pushes. CoAP doesn’t buffer them for later — if you miss them, they’re gone.

So, Observations remain active for as long as the DTLS session is valid, but reachability is the limiting factor. Changing a Setting does not guarantee delivery: from the server’s perspective, the notification was sent, but if the device was offline or unreachable at that moment, it may never actually receive it (regardless of its ability to send Stream payloads). In practice, this means a Settings change can be missed by the device, and the device must reconcile state (e.g., by re-subscribing or fetching).

We’re aware of the limitations of CoAP Observations and the fact that they don’t guarantee delivery in all scenarios; that is why our goal is to define an upper bound for how long a device might go without receiving an update from each Service. To achieve that, the device could combine Observations with a polling strategy at a higher cadence (e.g., periodically re-fetching or re-subscribing), which guarantees that even if a push was missed, the device will eventually reconcile its state within a known timeframe.

For sleepy devices, one short term approach is to call golioth_client_stop before sleep and golioth_client_start when the device wakes. This ensures the client restarts cleanly and automatically fetches the latest Settings and OTA manifest file.

1 Like

Thanks @marko, this was really helpful! When I first posted this question, I didn’t have an understanding of how CoAP observations or DTLS sessions worked…thanks for your patience with me as I’m wrapping my head around how this all works!

In the context of the Golioth SDK, I’m not totally clear on what events cause the CoAP/DTLS session to terminate.

Inactivity/timeout

I would assume that after some period of inactivity for an existing session, the client/server assume the session is dead and drop the session.

On the device side, we can control CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC. Can you confirm: when this timer expires, does the client establish a new DTLS session and resubscribe all observations?

On the server side, is there a similar timeout before the server drops the DTLS session? Seems like you’d want to ensure the device timeout is set smaller than the server’s timeout.

Explicit closure

What events will cause the Golioth client to explicitly terminate the session? I believe golioth_client_stop explicitly closes the session. Are there other events which cause the client to terminate and reestablish the session? (other than the rx timeout)

Can you explain how Stream data is sent in this case? Is Connection ID used when sending Stream data? For a device that sleeps for long periods of time, is there some timeout on the server side after which the server will require a new CID to be established when the client tries to send Stream data?

I experimented with this approach yesterday and it works as you describe. This is probably the approach I’m going to use for now.

In this case, it appears a new session is created each time the device wakes up, which uses more battery/data to establish the session, but avoids the need for the keepalive.

I’m not yet sure if Connection ID is beneficial in this case–from Dan’s blog post, it sounds like it increases the data for each message sent. Since the connection is being reestablished each time in this case, it seems like the overhead of CID may not be worth it?

In the context of the Golioth SDK, what events cause the CoAP/DTLS session to terminate?

A CoAP/DTLS session only terminates when the underlying DTLS socket is closed. This can happen due to an explicit teardown by the application (e.g. golioth_client_stop(), reboot, modem sleep), a network loss, DTLS errors, or when a carrier NAT/CGN closes the UDP mapping after a period of inactivity.

I would assume that after some period of inactivity for an existing session, the client/server assume the session is dead and drop it.

On the server side, there is an inactivity timer: if no traffic is seen for a while, the server will proactively close the DTLS session. In practice, though, what most often forces a reconnect on devices is the network path — for example, carrier NAT/CGN expiring the UDP mapping after a few minutes of inactivity. From the device’s point of view, both of these (server inactivity timeout or NAT expiry) look the same: the socket stops working and a new handshake is required

On the device side, what does CONFIG_GOLIOTH_COAP_CLIENT_RX_TIMEOUT_SEC do? When this timer expires, does the client establish a new DTLS session and resubscribe all observations?

This Kconfig symbol controls how long the client waits for a response after sending a CoAP request. If the timeout expires, that request fails (you’ll get a timeout error), but the DTLS session is not torn down automatically. The client does not re-handshake or resubscribe just because this timer fired. A new DTLS session (and re-subscription) only happens if the socket itself fails (e.g. network drop, NAT timeout, DTLS error) or if you explicitly stop/restart the client.

What events will cause the Golioth client to explicitly terminate the session? Are there any besides golioth_client_stop()?

From the Golioth SDK perspective, the only explicit teardown is triggered by the application (e.g. golioth_client_stop(), reboot, or modem sleep that drops the IP context). Other causes such as network loss, DTLS errors, or carrier NAT/CGN closing the UDP mapping are external to the SDK but will also lead to the session being lost and a re-handshake being required.

How is Stream data sent in this case? Is Connection ID (CID) used? For a sleepy device, does the server require a new CID after some timeout?

Stream data is sent as a normal CoAP request over DTLS. If DTLS Connection ID (CID) is enabled, the CID is included so the server can associate packets with the existing DTLS session, even if the device’s IP or port has changed. This avoids unnecessary handshakes when the IP changes. There is no separate “CID timeout” enforced on the server side — the CID remains valid as long as the DTLS session is valid.
For devices that sleep for long periods, the best practice is to stop the client before sleep and start it again on wake. This ensures a clean handshake and re-establishes Settings/OTA observations.

1 Like