fix(net/netmon): skip RTM_MISS route messages on Darwin#113
Merged
ethanndickson merged 1 commit intomainfrom Apr 9, 2026
Merged
fix(net/netmon): skip RTM_MISS route messages on Darwin#113ethanndickson merged 1 commit intomainfrom
ethanndickson merged 1 commit intomainfrom
Conversation
ibetitsmike
approved these changes
Apr 9, 2026
On Darwin, the AF_ROUTE socket delivers RTM_MISS on every failed route lookup. When a STUN probe targets an IPv6 address with no route (common on machines without global v6 connectivity), each probe generates an RTM_MISS that netmon treats as a LinkChange. The LinkChange triggers ReSTUN, which fires another probe, creating a self-sustaining loop (~1.3 events/s vs the normal ~1/30s cadence). RTM_MISS is emitted from the route lookup path, not the table mutation path. Route withdrawals always emit RTM_DELETE before any subsequent lookup can miss, so RTM_MISS is never the leading signal for a real topology change. Every mature BSD route-socket consumer (bird, dhcpcd, frr) ignores it for table-change purposes.
5598adf to
ad599a8
Compare
ethanndickson
added a commit
to coder/coder
that referenced
this pull request
Apr 9, 2026
) ## What Bumps `coder/tailscale` to [`e956a95`](coder/tailscale@e956a95) ([PR #113](coder/tailscale#113)) to pick up the `RTM_MISS` fix for the Darwin network monitor. ## Why On Darwin, `RTM_MISS` route-socket messages (fired on every failed route lookup) were not filtered by `netmon`, causing each one to be treated as a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with no route, this creates a self-sustaining feedback loop: `RTM_MISS` → `LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → … The loop drives DERP home-region flapping at ~70× baseline, which at fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator health-check timeouts. The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is a lookup-path signal, not a table-mutation signal — route withdrawals always emit `RTM_DELETE` before any subsequent lookup can miss. ## Scope This is a targeted workaround to unblock a customer. The root cause — why certain macOS versions generate elevated `RTM_MISS` traffic (likely a change in IPv6 route-table initialization) — requires further investigation.
spikecurtis
pushed a commit
to coder/coder
that referenced
this pull request
Apr 9, 2026
## What Bumps `coder/tailscale` to [`e956a95`](coder/tailscale@e956a95) ([PR #113](coder/tailscale#113)) to pick up the `RTM_MISS` fix for the Darwin network monitor. Already released on `release/2.31` as v2.31.8. (#24185) to unblock a customer. This PR is to update `main`. ## Why On Darwin, `RTM_MISS` route-socket messages (fired on every failed route lookup) were not filtered by `netmon`, causing each one to be treated as a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with no route, this creates a self-sustaining feedback loop: `RTM_MISS` → `LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → … The loop drives DERP home-region flapping at ~70× baseline, which at fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator health-check timeouts. The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is a lookup-path signal, not a table-mutation signal — route withdrawals always emit `RTM_DELETE` before any subsequent lookup can miss. Of note is that this issue has only been reported recently, since users updated to macOS 26.4. Relates to ENG-2394
spikecurtis
pushed a commit
to coder/coder
that referenced
this pull request
Apr 9, 2026
## What Bumps `coder/tailscale` to [`e956a95`](coder/tailscale@e956a95) ([PR #113](coder/tailscale#113)) to pick up the `RTM_MISS` fix for the Darwin network monitor. Already released on `release/2.31` as v2.31.8. (#24185) to unblock a customer. This PR is to update `main`. ## Why On Darwin, `RTM_MISS` route-socket messages (fired on every failed route lookup) were not filtered by `netmon`, causing each one to be treated as a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with no route, this creates a self-sustaining feedback loop: `RTM_MISS` → `LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → … The loop drives DERP home-region flapping at ~70× baseline, which at fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator health-check timeouts. The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is a lookup-path signal, not a table-mutation signal — route withdrawals always emit `RTM_DELETE` before any subsequent lookup can miss. Of note is that this issue has only been reported recently, since users updated to macOS 26.4. Relates to ENG-2394
spikecurtis
pushed a commit
to coder/coder
that referenced
this pull request
Apr 9, 2026
## What Bumps `coder/tailscale` to [`e956a95`](coder/tailscale@e956a95) ([PR #113](coder/tailscale#113)) to pick up the `RTM_MISS` fix for the Darwin network monitor. Already released on `release/2.31` as v2.31.8. (#24185) to unblock a customer. This PR is to update `main`. ## Why On Darwin, `RTM_MISS` route-socket messages (fired on every failed route lookup) were not filtered by `netmon`, causing each one to be treated as a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with no route, this creates a self-sustaining feedback loop: `RTM_MISS` → `LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → … The loop drives DERP home-region flapping at ~70× baseline, which at fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator health-check timeouts. The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is a lookup-path signal, not a table-mutation signal — route withdrawals always emit `RTM_DELETE` before any subsequent lookup can miss. Of note is that this issue has only been reported recently, since users updated to macOS 26.4. Relates to ENG-2394
This was referenced Apr 9, 2026
spikecurtis
added a commit
to coder/coder
that referenced
this pull request
Apr 10, 2026
## Cherry-pick of #24187 onto `release/2.29` Bumps `coder/tailscale` to [`e956a95`](coder/tailscale@e956a95) ([PR #113](coder/tailscale#113)) to pick up the `RTM_MISS` fix for the Darwin network monitor. ### Why On Darwin, `RTM_MISS` route-socket messages (fired on every failed route lookup) were not filtered by `netmon`, causing each one to be treated as a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with no route, this creates a self-sustaining feedback loop: `RTM_MISS` → `LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → … The loop drives DERP home-region flapping at ~70× baseline, which at fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator health-check timeouts. The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is a lookup-path signal, not a table-mutation signal — route withdrawals always emit `RTM_DELETE` before any subsequent lookup can miss. This issue has only been reported recently, since users updated to macOS 26.4. ### Notes The cherry-pick resolved as empty due to divergent tailscale base versions between `main` (`33e050f`) and `release/2.29` (`6eafe0f`). The dependency was manually updated with `go mod tidy`, which also bumped the `go` directive to 1.25.7 (required by the new tailscale module) and pulled in required transitive dependency updates (golang.org/x/*, etc.). Relates to ENG-2394 > 🤖 Generated by Coder Agents
spikecurtis
pushed a commit
to coder/coder
that referenced
this pull request
Apr 10, 2026
## What Bumps `coder/tailscale` to [`e956a95`](coder/tailscale@e956a95) ([PR #113](coder/tailscale#113)) to pick up the `RTM_MISS` fix for the Darwin network monitor. Already released on `release/2.31` as v2.31.8. (#24185) to unblock a customer. This PR is to update `main`. ## Why On Darwin, `RTM_MISS` route-socket messages (fired on every failed route lookup) were not filtered by `netmon`, causing each one to be treated as a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with no route, this creates a self-sustaining feedback loop: `RTM_MISS` → `LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → … The loop drives DERP home-region flapping at ~70× baseline, which at fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator health-check timeouts. The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return true }` check to `skipRouteMessage`. This is safe because `RTM_MISS` is a lookup-path signal, not a table-mutation signal — route withdrawals always emit `RTM_DELETE` before any subsequent lookup can miss. Of note is that this issue has only been reported recently, since users updated to macOS 26.4. Relates to ENG-2394
spikecurtis
added a commit
to coder/coder
that referenced
this pull request
Apr 10, 2026
#24214) ## Cherry-pick of #24187 onto `release/2.32` This cherry-picks commit ad2415e to bring the `coder/tailscale` bump (`e956a95`, [PR #113](coder/tailscale#113)) onto the `release/2.32` branch. ### Context On Darwin, `RTM_MISS` route-socket messages (fired on every failed route lookup) were not filtered by `netmon`, causing each one to be treated as a `LinkChange`. When netcheck sends STUN probes to an IPv6 address with no route, this creates a self-sustaining feedback loop: `RTM_MISS` → `LinkChange` → `ReSTUN` → netcheck → v6 STUN probe → `RTM_MISS` → … The loop drives DERP home-region flapping at ~70× baseline, which at fleet scale saturates PostgreSQL's `NOTIFY` lock and causes coordinator health-check timeouts. The upstream fix adds a single `if msg.Type == unix.RTM_MISS { return true }` check to `skipRouteMessage`, which is safe because `RTM_MISS` is a lookup-path signal, not a table-mutation signal. This issue has been reported since users updated to macOS 26.4. Relates to ENG-2394 > 🤖 Generated by Coder Agents Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com>
sreya
pushed a commit
that referenced
this pull request
Apr 18, 2026
On Darwin, the AF_ROUTE socket delivers RTM_MISS on every failed route lookup. When a STUN probe targets an IPv6 address with no route (common on machines without global v6 connectivity), each probe generates an RTM_MISS that netmon treats as a LinkChange. The LinkChange triggers ReSTUN, which fires another probe, creating a self-sustaining loop (~1.3 events/s vs the normal ~1/30s cadence). RTM_MISS is emitted from the route lookup path, not the table mutation path. Route withdrawals always emit RTM_DELETE before any subsequent lookup can miss, so RTM_MISS is never the leading signal for a real topology change. Every mature BSD route-socket consumer (bird, dhcpcd, frr) ignores it for table-change purposes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Darwin, the
AF_ROUTEsocket deliversRTM_MISSon every failed route lookup. When netcheck sends a STUN probe to an IPv6 address that has no route (common on machines without global IPv6 connectivity), the kernel emits anRTM_MISSfor each probe. The Darwin network monitor (darwinRouteMon) does not filter these, so each one is treated as aLinkChange. TheLinkChangetriggersReSTUN→ netcheck → another v6 STUN probe → anotherRTM_MISS, creating a self-sustaining loop (~1.3 events/s vs the normal ~1/30s cadence).At scale this causes:
PreferredDERPreturning 0 when all probes time out, defeating hysteresis)Why this is safe
RTM_MISSis emitted exclusively from XNU's route lookup path (rtalloc1_common_lockedat theunreachable:label). Route table mutations go throughrtrequest_locked, which emitsRTM_ADD/RTM_DELETE/RTM_CHANGE. The critical ordering guarantee:RTM_DELETEis emitted synchronously during route removal, before any subsequent lookup can miss. SoRTM_MISSis never the leading signal for a real topology change — if it's the first thing seen for a destination, the route never existed.Every mature BSD route-socket consumer confirms this treatment:
RTM_MISSfalls through todefault: breakif (rtm_type != RTM_MISS && if_realroute(rtm))Scope and follow-up
This is a targeted workaround to unblock a customer experiencing the loop in production. The root cause — why certain macOS versions generate elevated
RTM_MISStraffic in the first place (likely a change in IPv6 route-table initialization removing a catch-all entry that previously satisfied lookups) — requires further investigation and is not addressed here.This fix is also upstream-worthy to
tailscale/tailscale, which has the same gap (zeroRTM_MISSreferences in the repo).