Unable to reconnect, following initial successful connection on BG95: Fail to get address (coap.golioth.io 5684) -11

Description

My nRF52840 + BG95 device successfully connects to Golioth upon initial power-up. If I then put the device in a Faraday cage, the device will lose signal and gracefully end the session as expected. Once the device is removed from the Faraday cage, assuming the device automatically attempts to reconnect, subsequent reconnection attempts will always fail (Fail to get address (coap.golioth.io 5684) -11).

Interestingly, I always get an error ( net_dns_dispatcher: DNS recv error (-4)) during the initial connection, but the initial connection will always complete ~1 second later.

I don’t really know, but it might suggest that DNS timing issues exist in my setup. The reconnection problem could be the same issue, but the same patience/retry that eventually works during power-up is present during reconnections.

Steps to Reproduce

Running the reference-design-template with very minimal changes, the initial connection is successful, but if the device loses cellular signal for any reason, then subsequent reconnects will fail.

Expected Behavior

The device would automatically reconnect to Golioth and have no problem doing so.

Actual Behavior

Reconnection attempts following the initial connection will fail.

Environment

Golioth Firmware SDK v0.18.1, Quectel BG95 + nRF52840 (very similar to the RAK5010).

Logs and Console Output

Power-up and initial connection:

$ *** Booting My Application v2.7.8-e42b387a1667 ***
$ *** Using nRF Connect SDK v3.0.1-9eb5615da66b ***
$ *** Using Zephyr OS v4.0.99-77f865b8f8d0 ***
$ *** Golioth Firmware SDK v0.18.1 ***
$ [00:00:00.005,371] golioth_settings_autoload: Initializing settings subsystem
$ [00:00:00.011,474] golioth_settings_autoload: Loading settings
$ [00:00:00.012,115] golioth_rd_template: Firmware version: 2.7.8
$ [00:00:00.012,207] golioth_mbox: Mbox created, bufsize: 1232, num_items: 10, item_size: 112
$ [00:00:48.748,168] golioth_fw_update: Current firmware version: main - 2.7.8
$ [00:00:48.749,694] golioth_fw_update: State = Idle
$ [00:00:49.971,343] net_dns_dispatcher: DNS recv error (-4)
$ [00:00:51.482,910] golioth_coap_client_zephyr: Golioth CoAP client connected
$ [00:00:51.483,612] golioth_rd_template: Golioth client connected
$ [00:00:51.483,642] golioth_coap_client_zephyr: Entering CoAP I/O loop

Device placed in Faraday cage:

$ [00:05:15.197,387] golioth_coap_client: 1 resends in last 10 seconds
$ [00:05:23.873,046] golioth_coap_client_zephyr: Receive timeout
$ [00:05:23.873,046] golioth_coap_client_zephyr: Ending session
$ [00:05:23.873,077] golioth_rd_template: Golioth client disconnected

Then:

$ [00:06:03.880,432] golioth_coap_client_zephyr: Fail to get address (coap.golioth.io 5684) -11
$ [00:06:03.880,462] golioth_coap_client_zephyr: Failed to connect: -11
$ [00:06:03.880,462] golioth_coap_client_zephyr: Failed to connect: -11
$ [00:06:08.881,042] golioth_coap_client_zephyr: Fail to get address (coap.golioth.io 5684) -11
$ [00:06:08.881,072] golioth_coap_client_zephyr: Failed to connect: -11
$ [00:06:08.881,072] golioth_coap_client_zephyr: Failed to connect: -11

Hey @mathew,

The behavior you’re seeing, where the very first connection always succeeds, but every subsequent reconnect immediately errors out with Fail to get address … (–11) , points at how Zephyr’s DNS resolver state is handled across your cellular link going down and back up.

On the first connection after boot, the BG95 brings up its PDN, and Zephyr starts an asynchronous DNS lookup. You see a transient DNS recv error (–4) while the modem finishes attaching, but about a second later it resolves successfully and opens the CoAP socket.

When you put the device into a Faraday cage, the modem tears down its PDN, and Zephyr’s CoAP client times out and closes the session. Once you remove the cage and the radio re-attaches, Zephyr’s DNS context remains in that “finished” or error state from the initial lookup. It never automatically re-initializes or retries the query. So on the next Golioth connection attempt, Zephyr immediately returns –11 (EAGAIN), Fail to get address because it believes there’s no new resolver work to do, and the DNS query never actually goes out.

Zephyr should manage resetting the DNS context, and we haven’t seen this issue on Ethernet or Wi-Fi.

Would you be open to creating a branch in the Reference Design Template repo with your changes so we can test it on the nRF52840 DK with a BG95 module?

Hey Marko,

Thanks for your reply.

As requested, here’s a link to a repo with my changes: GitHub - one-giant-leap/reference-design-template: Template for making new Golioth Reference Design repositories

west build -b mkgw4/nrf52840 --sysbuild app

Thanks, @mathew. We’ll try to replicate the behaviour on our end and get back to you.

1 Like

Hey @marko ,

Following up on this one - have you been able to replicate the behaviour?

Hi @mathew,

Apologies for the late reply.

I reproduced your scenario on an nRF52840 DK + BG95 with your code. The behavior matches what you described: the very first connection succeeds, but after a radio blackout, the reconnect fails at Fail to get address (coap.golioth.io 5684) -11.

This points to the PPP/DNS path in Zephyr as suggested in the comment above, and not to Golioth’s cloud or CoAP client.

Your app currently triggers a reconnect on NET_EVENT_IPV4_ADDR_ADD. With PPP modems, the IP address and DNS servers arrive in close succession; on a reconnect, the resolver can run before DNS servers are installed. In the net_connect() function, you should subscribe to net mgmt events and wait for NET_EVENT_DNS_SERVER_ADD (in addition to NET_EVENT_IPV4_ADDR_ADD) before calling golioth_client_start().

Additionally, your application is using Zephyr’s generic modem_cellular drivers. In our experience, dedicated, vendor-specific drivers handle reconnects more gracefully, so using BG95 dedicated drivers is recommended. For instance, on a nRF52840 + HL7800 (using the hl7800 driver), I didn’t see reconnect issues even after ~15 minutes in a radio blackout scenario.

Thanks @marko for taking the time to reproduce the issue.

Were you able to get reconnections working on your setup after making changes to the net_connect() function?

For me, the NET_EVENT_DNS_SERVER_ADD event never seems to occur.

Additionally, net_connect() is only called upon start-up, so I’m not sure I understand how modifying it can resolve the reconnection issue?

The RAK5010 (nRF52840 + BG95) is a Golioth Continuously Verified Board. Am I correct to assume this issue affects all the Golioth Firmware SDK examples when using the RAK5010 also (How to use the RAK5010 cellular dev board with Zephyr - The Golioth Developer Blog)?

If you’re still using the generic modem_cellular driver, what you’re seeing isn’t unusual. On PPP, NET_EVENT_DNS_SERVER_ADD often doesn’t fire as DNS is negotiated over IPCP, but the generic driver doesn’t always surface it as an event.

Just to clarify, RAK5010 is a Golioth Continuously Verified Board, which means we build and test it against every release of the Firmware SDK. However, the SDK itself doesn’t manage the reconnection policy, which always lives in the application layer and depends heavily on the modem/driver behavior, so you need to implement reconnection logic yourself. It’s also worth pointing out that net_connect() is example code provided in the SDK. It’s not designed to handle every product scenario as-is, and it should be treated as a reference implementation. In most cases, it needs to be modified and adjusted to match the requirements of your product.

For production use we recommend the BG95-specific driver and not the generic modem_cellular driver. I’m currently working on replacing modem_cellular driver with BG95-specific support to make connection and reconnection handling more robust and easier for you and others to adapt in your own projects.