Skip to content

Cache /.well-known/databricks-config lookups in the CLI#5011

Open
simonfaltum wants to merge 20 commits intomainfrom
simonfaltum/hostmetadata-cache
Open

Cache /.well-known/databricks-config lookups in the CLI#5011
simonfaltum wants to merge 20 commits intomainfrom
simonfaltum/hostmetadata-cache

Conversation

@simonfaltum
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum commented Apr 17, 2026

Why

Every CLI command (databricks auth profiles, bundle validate, every workspace or account call) goes through Config.EnsureResolved, which triggers an unauthenticated GET to {host}/.well-known/databricks-config to populate host metadata. That round trip is ~700ms against production and gets paid on every invocation, doubling the latency of otherwise single-request commands.

Changes

Before: every CLI invocation hits the well-known endpoint once (or more when multiple configs get constructed).
Now: the first invocation populates a local disk cache under ~/.cache/databricks/<version>/host-metadata/; subsequent invocations read from it. Failures are negatively cached for 60s (except for context.Canceled / context.DeadlineExceeded, which are transient and never cached).

  • New libs/hostmetadata package. NewResolver(fetch) is the primary API — takes an injected fetch function and returns a config.HostMetadataResolver. Attach(cfg) is a one-line convenience that wires cfg.DefaultHostMetadataResolver() as the fetch. SDK API from Add HostMetadataResolver for customizable host metadata fetching databricks-sdk-go#1572, shipped in v0.127.0 which is already bumped on main.
  • Positive cache wraps the miss path — the hit path is a single disk read. Negative cache is only consulted when positive misses.
  • Attach wired at every non-allowlisted *config.Config construction site: cmd/root/auth.go (4 sites), bundle/config/workspace.go, cmd/api/api.go, cmd/auth/{env,login,profiles}.go (3 sites across 2 files), cmd/labs/project/entrypoint.go, libs/auth/arguments.go.
  • Test-cleanup env now pins DATABRICKS_CACHE_DIR to a temp dir so tests don't leak cache files into HOME.
  • Injection guardrail test at libs/hostmetadata/injection_guardrail_test.go walks the tree and flags any new config.Config{ construction site that lacks a nearby hostmetadata.Attach call (allowlist for the handful of legitimately cfg-less-resolve sites).

Collateral cleanups

  • libs/cache/file_cache.go: drop the failed to stat cache file debug log when the file is simply missing (fs.ErrNotExist). It was pure noise (the next line, cache miss, computing, conveys the same info) and its OS-specific error text diverged between Unix (no such file or directory) and Windows (The system cannot find the file specified.), breaking cross-platform acceptance goldens. Genuine stat failures (permission, corruption) still log.
  • libs/testdiff/replacement.go: devVersionRegex now accepts either +SHA or -SHA after 0.0.0-dev. build.GetSanitizedVersion() swaps + to - for filesystem safety when the version is used in cache paths, and the old regex only covered the + form.

Test plan

  • make checks clean
  • make lint clean (0 issues)
  • go test ./libs/hostmetadata/... -race — 7 tests (smoke + cache hit + fetch error + cancellation-not-cached + host isolation + end-to-end integration + injection guardrail), all unit tests use an injected mock fetch so no httptest.Server required
  • go test ./libs/cache/... -race clean
  • go test ./cmd/root/... -race clean
  • go test ./bundle/config/... -race clean
  • End-to-end acceptance test acceptance/auth/host-metadata-cache/ asserts exactly ONE /.well-known/databricks-config GET across two auth profiles invocations sharing a DATABRICKS_CACHE_DIR
  • Existing acceptance tests regenerated: fewer well-known GETs in out.requests.txt (caching works), new [Local Cache] debug lines in cache/telemetry tests, two Warn: Failed to resolve host metadata lines removed (intentional: the resolver returns (nil, nil) on fetch errors, which is how the SDK interprets "no metadata available"), stat-not-found lines removed (see Collateral cleanups)

Live validation against dogfood

Built locally (go build -o /tmp/databricks-cache-test .) and ran databricks -p e2-dogfood current-user me with and without a warm cache:

Scenario Elapsed well-known time Cache log output
Cold cache (fresh DATABRICKS_CACHE_DIR) ~713ms fetch cache miss, computingGET /.well-known/databricks-configcomputed and stored result
Warm cache (second invocation) ~1ms single [Local Cache] cache hit line

Net per-command savings: ~700ms, matching the Why. Cache dir after one auth profiles run contained five JSON files (one per host in .databrickscfg). Inspecting one:

{"oidc_endpoint":"https://db-deco-test.databricks.com/oidc/accounts/{account_id}","account_id":"...","workspace_id":"","cloud":"","host_type":"UNIFIED_HOST","token_federation_default_oidc_audiences":["..."]}

Verifies that two CLI invocations sharing DATABRICKS_CACHE_DIR produce only
one /.well-known/databricks-config GET: the first populates the on-disk
cache, the second reads from it.

Co-authored-by: Isaac
Cached /.well-known/databricks-config lookups persist across CLI
invocations now, so recorded request logs drop duplicate GETs and debug
output shows the new host-metadata cache keys. Silenced SDK warnings on
failed well-known fetches (the resolver returns nil,nil) also remove a
couple of Warn lines from auth test outputs.

Co-authored-by: Isaac
Inverts the internal newResolver(cfg, ...) into an exported
NewResolver(fetch) that takes an injected fetch function. Attach stays
as a one-liner convenience. Unit tests for the caching logic no longer
need httptest servers or PAT-authed configs; one integration test
retains the end-to-end SDK wiring.

Co-authored-by: Isaac
Flips the resolver so the happy path is one disk read: positive cache
wraps the miss flow, which now probes negative and falls through to
fetch only on true miss. Context cancellation and deadline errors are
no longer written to the negative cache because they say nothing about
the host's long-term availability.

Regenerates cache/telemetry acceptance outputs — the synthetic
negative-cache probe no longer runs on cache hits.

Co-authored-by: Isaac
@simonfaltum simonfaltum marked this pull request as ready for review April 17, 2026 12:06
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 17, 2026

Approval status: pending

/acceptance/auth/ - needs approval

6 files changed
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/acceptance/bundle/ - needs approval

4 files changed
Suggested: @pietern
Also eligible: @denik, @anton-107, @andrewnester, @lennartkats-db, @shreyas-goenka, @janniklasrose

/bundle/ - needs approval

Files: bundle/config/workspace.go
Suggested: @pietern
Also eligible: @denik, @anton-107, @andrewnester, @lennartkats-db, @shreyas-goenka, @janniklasrose

/cmd/api/ - needs approval

Files: cmd/api/api.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/cmd/auth/ - needs approval

Files: cmd/auth/env.go, cmd/auth/login.go, cmd/auth/profiles.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/cmd/labs/ - needs approval

Files: cmd/labs/project/entrypoint.go
Eligible: @alexott, @nfx

/cmd/root/ - needs approval

Files: cmd/root/auth.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/internal/ - needs approval

Files: internal/testutil/env.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

/libs/auth/ - needs approval

Files: libs/auth/arguments.go
Suggested: @tanmay-db
Also eligible: @mihaimitrea-db, @tejaskochar-db, @hectorcast-db, @renaudhartert-db, @parthban-db, @Divyansh-db, @chrisst, @rauchy

General files (require maintainer)

20 files changed
Based on git history:

  • @pietern -- recent work in cmd/auth/, internal/testutil/, cmd/labs/project/

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

simonfaltum and others added 3 commits April 17, 2026 14:31
GetSanitizedVersion replaces + with - in build version metadata for
filesystem safety, but the [DEV_VERSION] replacement regex only
covered the + form. Cache paths use the sanitized form, so telemetry
tests failed across machines with different git HEAD SHAs. Regex now
accepts either + or - before the SHA suffix.

Co-authored-by: Isaac
os.Stat on a missing cache file returns an OS-specific error message
(Unix: "no such file or directory"; Windows: "The system cannot find
the file specified."), causing acceptance-test goldens to diverge
between platforms. The error is also pure noise — the follow-up
"cache miss, computing" line conveys the same information. Drop the
log for fs.ErrNotExist; keep it for genuine stat failures (permissions,
corruption).

Co-authored-by: Isaac
@simonfaltum simonfaltum marked this pull request as draft April 17, 2026 14:54
@simonfaltum simonfaltum marked this pull request as ready for review April 20, 2026 06:00
@simonfaltum simonfaltum removed the request for review from andrewnester April 20, 2026 06:01
github-merge-queue bot pushed a commit to databricks/databricks-sdk-go that referenced this pull request Apr 20, 2026
## Changes

[PR #1572](#1572)
added `Config.HostMetadataResolver` so callers could override the SDK's
`/.well-known/databricks-config` fetch on a per-Config basis. That
covers "I have one Config and I want to wrap it."

The gap: programs that construct many Configs across their command
surface (e.g. the Databricks CLI) end up copying the same
`cfg.HostMetadataResolver = ...` assignment at every construction site,
in the CLI roughly 10 sites across 7 files plus a guardrail test to
catch drift.

This PR adds a package-level default consulted when a Config has no
explicit resolver set. Callers set a factory once during startup; every
subsequent Config gets the same resolver without per-site wiring. The
Config-level field still takes precedence, so PR #1572's contract is
unchanged.

### API

```go
// config/host_metadata.go
var DefaultHostMetadataResolverFactory func(*Config) HostMetadataResolver
```

Plain public variable, set once at init. Matches the stdlib pattern for
single-default hooks: `http.DefaultClient`, `http.DefaultTransport`,
`log.Default`. Callers needing per-Config or dynamic behaviour should
use `Config.HostMetadataResolver` instead.

### Resolution order inside `Config.EnsureResolved`

1. If `Config.HostMetadataResolver` is set, use it.
2. Else, if `DefaultHostMetadataResolverFactory` is non-nil, invoke it
with the resolving Config and use its return value. If it returns nil,
fall through.
3. Else, SDK's default HTTP fetch (unchanged behavior for all existing
callers).

## How the Databricks CLI will use this

The canonical Go idiom for "library A registers itself with library B"
is a blank import that triggers an `init()` in A. This is how
`database/sql` drivers (`_ "github.com/lib/pq"`), image codecs (`_
"image/png"`), and encoding formats register themselves.

After this PR lands and is bumped into the CLI, [CLI PR
#5011](databricks/cli#5011) will collapse from
~10 wired-in `hostmetadata.Attach(cfg)` calls + a guardrail test down to
two small pieces:

**`repos/cli/libs/hostmetadata/resolver.go`** — set the caching factory
at package init:

```go
func init() {
    config.DefaultHostMetadataResolverFactory = func(cfg *config.Config) config.HostMetadataResolver {
        return NewResolver(cfg.DefaultHostMetadataResolver())
    }
}
```

**`repos/cli/cmd/databricks/main.go`** — one blank import to pull the
package in at startup:

```go
import (
    // Registers a disk-cached HostMetadataResolver with the SDK so every
    // Config the CLI constructs reuses the cached /.well-known lookup.
    _ "github.com/databricks/cli/libs/hostmetadata"
)
```

That's the full integration. Every Config the CLI creates, now and in
the future from any new command a developer adds, automatically gets
caching. No per-site `Attach` call to remember, no guardrail test to
maintain, no new developer ever has to learn this mechanism exists to
benefit from it.

### Experimental

Marked experimental to match the existing `HostMetadataResolver` field.
No default behavior change for callers that never set
`DefaultHostMetadataResolverFactory`.

## Tests

Three new tests in `config/config_test.go`, each using a small
`withDefaultHostMetadataResolverFactory(t, factory)` helper that
captures and restores the prior value, so tests never clobber each other
via the package-level default:

- Factory is invoked when Config has no resolver; back-fill works
end-to-end.
- Config-level resolver takes precedence (factory not consulted).
- Factory returning nil falls through to the SDK's HTTP fetch.

- `make fmt test lint` clean
- `go test ./config/... -count=1 -race` clean

Signed-off-by: simon <simon.faltum@databricks.com>

---------

Signed-off-by: simon <simon.faltum@databricks.com>
Co-authored-by: Renaud Hartert <renaud.hartert@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant