Skip to content

fix(net/netmon): skip RTM_MISS route messages on Darwin#113

Merged
ethanndickson merged 1 commit intomainfrom
ethan/skip-rtm-miss
Apr 9, 2026
Merged

fix(net/netmon): skip RTM_MISS route messages on Darwin#113
ethanndickson merged 1 commit intomainfrom
ethan/skip-rtm-miss

Conversation

@ethanndickson
Copy link
Copy Markdown
Member

@ethanndickson ethanndickson commented Apr 9, 2026

Problem

On Darwin, the AF_ROUTE socket delivers RTM_MISS on every failed route lookup. When netcheck sends a STUN probe to an IPv6 address that has no route (common on machines without global IPv6 connectivity), the kernel emits an RTM_MISS for each probe. The Darwin network monitor (darwinRouteMon) does not filter these, so each one is treated as a LinkChange. The LinkChange triggers ReSTUN → netcheck → another v6 STUN probe → another RTM_MISS, creating a self-sustaining loop (~1.3 events/s vs the normal ~1/30s cadence).

At scale this causes:

  • Excessive DERP home-region flapping (PreferredDERP returning 0 when all probes time out, defeating hysteresis)
  • Downstream load on coordination infrastructure from the resulting peer-update storm

Why this is safe

RTM_MISS is emitted exclusively from XNU's route lookup path (rtalloc1_common_locked at the unreachable: label). Route table mutations go through rtrequest_locked, which emits RTM_ADD / RTM_DELETE / RTM_CHANGE. The critical ordering guarantee: RTM_DELETE is emitted synchronously during route removal, before any subsequent lookup can miss. So RTM_MISS is never the leading signal for a real topology change — if it's the first thing seen for a destination, the route never existed.

Every mature BSD route-socket consumer confirms this treatment:

  • bird — dispatch switch handles ADD/DELETE/CHANGE/IFINFO/NEWADDR/DELADDR; RTM_MISS falls through to default: break
  • dhcpcd — explicitly filters: if (rtm_type != RTM_MISS && if_realroute(rtm))
  • frr — name-table entry only, no dispatch handler

Scope and follow-up

This is a targeted workaround to unblock a customer experiencing the loop in production. The root cause — why certain macOS versions generate elevated RTM_MISS traffic in the first place (likely a change in IPv6 route-table initialization removing a catch-all entry that previously satisfied lookups) — requires further investigation and is not addressed here.

This fix is also upstream-worthy to tailscale/tailscale, which has the same gap (zero RTM_MISS references in the repo).

@ethanndickson ethanndickson requested a review from a team as a code owner April 9, 2026 06:41
On Darwin, the AF_ROUTE socket delivers RTM_MISS on every failed
route lookup. When a STUN probe targets an IPv6 address with no
route (common on machines without global v6 connectivity), each
probe generates an RTM_MISS that netmon treats as a LinkChange.
The LinkChange triggers ReSTUN, which fires another probe, creating
a self-sustaining loop (~1.3 events/s vs the normal ~1/30s cadence).

RTM_MISS is emitted from the route lookup path, not the table
mutation path. Route withdrawals always emit RTM_DELETE before any
subsequent lookup can miss, so RTM_MISS is never the leading signal
for a real topology change. Every mature BSD route-socket consumer
(bird, dhcpcd, frr) ignores it for table-change purposes.
@ethanndickson ethanndickson changed the title fix(net/netmon): skip non-mutating RTM_MISS/RESOLVE/LOSING on Darwin fix(net/netmon): skip RTM_MISS route messages on Darwin Apr 9, 2026
@ethanndickson ethanndickson merged commit e956a95 into main Apr 9, 2026
21 of 24 checks passed
ethanndickson added a commit to coder/coder that referenced this pull request Apr 9, 2026
)

## What

Bumps `coder/tailscale` to
[`e956a95`](coder/tailscale@e956a95)
([PR #113](coder/tailscale#113)) to pick up the
`RTM_MISS` fix for the Darwin network monitor.

## Why

On Darwin, `RTM_MISS` route-socket messages (fired on every failed route
lookup) were not filtered by `netmon`, causing each one to be treated as
a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with
no route, this creates a self-sustaining feedback loop: `RTM_MISS` →
`LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → …

The loop drives DERP home-region flapping at ~70× baseline, which at
fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator
health-check timeouts.

The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return
true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is
a lookup-path signal, not a table-mutation signal — route withdrawals
always emit `RTM_DELETE` before any subsequent lookup can miss.

## Scope

This is a targeted workaround to unblock a customer. The root cause —
why certain macOS versions generate elevated `RTM_MISS` traffic (likely
a change in IPv6 route-table initialization) — requires further
investigation.
spikecurtis pushed a commit to coder/coder that referenced this pull request Apr 9, 2026
## What

Bumps `coder/tailscale` to
[`e956a95`](coder/tailscale@e956a95)
([PR #113](coder/tailscale#113)) to pick up the
`RTM_MISS` fix for the Darwin network monitor.

Already released on `release/2.31` as v2.31.8. (#24185) to unblock a
customer. This PR is to update `main`.

## Why

On Darwin, `RTM_MISS` route-socket messages (fired on every failed route
lookup) were not filtered by `netmon`, causing each one to be treated as
a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with
no route, this creates a self-sustaining feedback loop: `RTM_MISS` →
`LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → …

The loop drives DERP home-region flapping at ~70× baseline, which at
fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator
health-check timeouts.

The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return
true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is
a lookup-path signal, not a table-mutation signal — route withdrawals
always emit `RTM_DELETE` before any subsequent lookup can miss.

Of note is that this issue has only been reported recently, since users
updated to macOS 26.4.

Relates to ENG-2394
spikecurtis pushed a commit to coder/coder that referenced this pull request Apr 9, 2026
## What

Bumps `coder/tailscale` to
[`e956a95`](coder/tailscale@e956a95)
([PR #113](coder/tailscale#113)) to pick up the
`RTM_MISS` fix for the Darwin network monitor.

Already released on `release/2.31` as v2.31.8. (#24185) to unblock a
customer. This PR is to update `main`.

## Why

On Darwin, `RTM_MISS` route-socket messages (fired on every failed route
lookup) were not filtered by `netmon`, causing each one to be treated as
a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with
no route, this creates a self-sustaining feedback loop: `RTM_MISS` →
`LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → …

The loop drives DERP home-region flapping at ~70× baseline, which at
fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator
health-check timeouts.

The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return
true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is
a lookup-path signal, not a table-mutation signal — route withdrawals
always emit `RTM_DELETE` before any subsequent lookup can miss.

Of note is that this issue has only been reported recently, since users
updated to macOS 26.4.

Relates to ENG-2394
spikecurtis pushed a commit to coder/coder that referenced this pull request Apr 9, 2026
## What

Bumps `coder/tailscale` to
[`e956a95`](coder/tailscale@e956a95)
([PR #113](coder/tailscale#113)) to pick up the
`RTM_MISS` fix for the Darwin network monitor.

Already released on `release/2.31` as v2.31.8. (#24185) to unblock a
customer. This PR is to update `main`.

## Why

On Darwin, `RTM_MISS` route-socket messages (fired on every failed route
lookup) were not filtered by `netmon`, causing each one to be treated as
a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with
no route, this creates a self-sustaining feedback loop: `RTM_MISS` →
`LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → …

The loop drives DERP home-region flapping at ~70× baseline, which at
fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator
health-check timeouts.

The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return
true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is
a lookup-path signal, not a table-mutation signal — route withdrawals
always emit `RTM_DELETE` before any subsequent lookup can miss.

Of note is that this issue has only been reported recently, since users
updated to macOS 26.4.

Relates to ENG-2394
spikecurtis added a commit to coder/coder that referenced this pull request Apr 10, 2026
## Cherry-pick of #24187 onto `release/2.29`

Bumps `coder/tailscale` to
[`e956a95`](coder/tailscale@e956a95)
([PR #113](coder/tailscale#113)) to pick up the
`RTM_MISS` fix for the Darwin network monitor.

### Why

On Darwin, `RTM_MISS` route-socket messages (fired on every failed route
lookup) were not filtered by `netmon`, causing each one to be treated as
a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with
no route, this creates a self-sustaining feedback loop: `RTM_MISS` →
`LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → …

The loop drives DERP home-region flapping at ~70× baseline, which at
fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator
health-check timeouts.

The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return
true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is
a lookup-path signal, not a table-mutation signal — route withdrawals
always emit `RTM_DELETE` before any subsequent lookup can miss.

This issue has only been reported recently, since users updated to macOS
26.4.

### Notes

The cherry-pick resolved as empty due to divergent tailscale base
versions between `main` (`33e050f`) and `release/2.29` (`6eafe0f`). The
dependency was manually updated with `go mod tidy`, which also bumped
the `go` directive to 1.25.7 (required by the new tailscale module) and
pulled in required transitive dependency updates (golang.org/x/*, etc.).

Relates to ENG-2394

> 🤖 Generated by Coder Agents
spikecurtis pushed a commit to coder/coder that referenced this pull request Apr 10, 2026
## What

Bumps `coder/tailscale` to
[`e956a95`](coder/tailscale@e956a95)
([PR #113](coder/tailscale#113)) to pick up the
`RTM_MISS` fix for the Darwin network monitor.

Already released on `release/2.31` as v2.31.8. (#24185) to unblock a
customer. This PR is to update `main`.

## Why

On Darwin, `RTM_MISS` route-socket messages (fired on every failed route
lookup) were not filtered by `netmon`, causing each one to be treated as
a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with
no route, this creates a self-sustaining feedback loop: `RTM_MISS` →
`LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → …

The loop drives DERP home-region flapping at ~70× baseline, which at
fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator
health-check timeouts.

The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return
true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is
a lookup-path signal, not a table-mutation signal — route withdrawals
always emit `RTM_DELETE` before any subsequent lookup can miss.

Of note is that this issue has only been reported recently, since users
updated to macOS 26.4.

Relates to ENG-2394
spikecurtis added a commit to coder/coder that referenced this pull request Apr 10, 2026
#24214)

## Cherry-pick of #24187 onto `release/2.32`

This cherry-picks commit ad2415e to
bring the `coder/tailscale` bump (`e956a95`, [PR
#113](coder/tailscale#113)) onto the
`release/2.32` branch.

### Context

On Darwin, `RTM_MISS` route-socket messages (fired on every failed route
lookup) were not filtered by `netmon`, causing each one to be treated as
a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with
no route, this creates a self-sustaining feedback loop: `RTM_MISS` →
`LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → …

The loop drives DERP home-region flapping at ~70× baseline, which at
fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator
health-check timeouts.

The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return
true }` check to `skipRouteMessage`, which is safe because `RTM_MISS` is
a lookup-path signal, not a table-mutation signal.

This issue has been reported since users updated to macOS 26.4.

Relates to ENG-2394

> 🤖 Generated by Coder Agents

Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com>
sreya pushed a commit that referenced this pull request Apr 18, 2026
On Darwin, the AF_ROUTE socket delivers RTM_MISS on every failed
route lookup. When a STUN probe targets an IPv6 address with no
route (common on machines without global v6 connectivity), each
probe generates an RTM_MISS that netmon treats as a LinkChange.
The LinkChange triggers ReSTUN, which fires another probe, creating
a self-sustaining loop (~1.3 events/s vs the normal ~1/30s cadence).

RTM_MISS is emitted from the route lookup path, not the table
mutation path. Route withdrawals always emit RTM_DELETE before any
subsequent lookup can miss, so RTM_MISS is never the leading signal
for a real topology change. Every mature BSD route-socket consumer
(bird, dhcpcd, frr) ignores it for table-change purposes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants