Unable to reconnect, following initial successful connection on BG95: Fail to get address (coap.golioth.io 5684) -11

Description

My nRF52840 + BG95 device successfully connects to Golioth upon initial power-up. If I then put the device in a Faraday cage, the device will lose signal and gracefully end the session as expected. Once the device is removed from the Faraday cage, assuming the device automatically attempts to reconnect, subsequent reconnection attempts will always fail (Fail to get address (coap.golioth.io 5684) -11).

Interestingly, I always get an error ( net_dns_dispatcher: DNS recv error (-4)) during the initial connection, but the initial connection will always complete ~1 second later.

I don’t really know, but it might suggest that DNS timing issues exist in my setup. The reconnection problem could be the same issue, but the same patience/retry that eventually works during power-up is present during reconnections.

Steps to Reproduce

Running the reference-design-template with very minimal changes, the initial connection is successful, but if the device loses cellular signal for any reason, then subsequent reconnects will fail.

Expected Behavior

The device would automatically reconnect to Golioth and have no problem doing so.

Actual Behavior

Reconnection attempts following the initial connection will fail.

Environment

Golioth Firmware SDK v0.18.1, Quectel BG95 + nRF52840 (very similar to the RAK5010).

Logs and Console Output

Power-up and initial connection:

$ *** Booting My Application v2.7.8-e42b387a1667 ***
$ *** Using nRF Connect SDK v3.0.1-9eb5615da66b ***
$ *** Using Zephyr OS v4.0.99-77f865b8f8d0 ***
$ *** Golioth Firmware SDK v0.18.1 ***
$ [00:00:00.005,371] golioth_settings_autoload: Initializing settings subsystem
$ [00:00:00.011,474] golioth_settings_autoload: Loading settings
$ [00:00:00.012,115] golioth_rd_template: Firmware version: 2.7.8
$ [00:00:00.012,207] golioth_mbox: Mbox created, bufsize: 1232, num_items: 10, item_size: 112
$ [00:00:48.748,168] golioth_fw_update: Current firmware version: main - 2.7.8
$ [00:00:48.749,694] golioth_fw_update: State = Idle
$ [00:00:49.971,343] net_dns_dispatcher: DNS recv error (-4)
$ [00:00:51.482,910] golioth_coap_client_zephyr: Golioth CoAP client connected
$ [00:00:51.483,612] golioth_rd_template: Golioth client connected
$ [00:00:51.483,642] golioth_coap_client_zephyr: Entering CoAP I/O loop

Device placed in Faraday cage:

$ [00:05:15.197,387] golioth_coap_client: 1 resends in last 10 seconds
$ [00:05:23.873,046] golioth_coap_client_zephyr: Receive timeout
$ [00:05:23.873,046] golioth_coap_client_zephyr: Ending session
$ [00:05:23.873,077] golioth_rd_template: Golioth client disconnected

Then:

$ [00:06:03.880,432] golioth_coap_client_zephyr: Fail to get address (coap.golioth.io 5684) -11
$ [00:06:03.880,462] golioth_coap_client_zephyr: Failed to connect: -11
$ [00:06:03.880,462] golioth_coap_client_zephyr: Failed to connect: -11
$ [00:06:08.881,042] golioth_coap_client_zephyr: Fail to get address (coap.golioth.io 5684) -11
$ [00:06:08.881,072] golioth_coap_client_zephyr: Failed to connect: -11
$ [00:06:08.881,072] golioth_coap_client_zephyr: Failed to connect: -11

Hey @mathew,

The behavior you’re seeing, where the very first connection always succeeds, but every subsequent reconnect immediately errors out with Fail to get address … (–11) , points at how Zephyr’s DNS resolver state is handled across your cellular link going down and back up.

On the first connection after boot, the BG95 brings up its PDN, and Zephyr starts an asynchronous DNS lookup. You see a transient DNS recv error (–4) while the modem finishes attaching, but about a second later it resolves successfully and opens the CoAP socket.

When you put the device into a Faraday cage, the modem tears down its PDN, and Zephyr’s CoAP client times out and closes the session. Once you remove the cage and the radio re-attaches, Zephyr’s DNS context remains in that “finished” or error state from the initial lookup. It never automatically re-initializes or retries the query. So on the next Golioth connection attempt, Zephyr immediately returns –11 (EAGAIN), Fail to get address because it believes there’s no new resolver work to do, and the DNS query never actually goes out.

Zephyr should manage resetting the DNS context, and we haven’t seen this issue on Ethernet or Wi-Fi.

Would you be open to creating a branch in the Reference Design Template repo with your changes so we can test it on the nRF52840 DK with a BG95 module?