Skip to content

Latest commit

 

History

History
341 lines (242 loc) · 27.4 KB

File metadata and controls

341 lines (242 loc) · 27.4 KB

RFC: Bazel Support for Datadog Test Optimization via Module Extension and Uploader

Author: Tony Redondo Date: Feb 18, 2026
Status: Implemented (v1)

Last reviewed: 2026-05-07

This document outlines a proposed method for integrating Datadog Test Optimization with Bazel. The approach involves using a module extension and repository rule to gather metadata during module/repository resolution, and a runtime uploader to transmit test and coverage data back to the backend. The document details the rationale behind this design, current behaviors, limitations, and a strategy for widespread adoption across various services and programming languages.

Note: this RFC is primarily design rationale and historical context. For current implementation details and operational guidance, use README.md, docs/Initial_documentation.md, and docs/Language_Onboarding.md.

Bazel 101

This section provides a short overview of Bazel concepts that are relevant to the proposal. It is not intended as a full introduction to Bazel, but rather a quick reference for terms used throughout this document.

Hermetic Sandboxes

Bazel executes builds and tests inside isolated sandboxes. Inputs are declared explicitly, network access is usually disabled, and outputs are cached deterministically. This ensures reproducibility but restricts ad-hoc network activity during test execution.

Cache

Bazel caches outputs of build and test actions based on their declared inputs. If the inputs do not change, the cached outputs can be reused without re-execution. This is central to Bazel’s performance model, and why undeclared inputs such as live network calls break reproducibility and invalidate cache guarantees.

Repository Rules

Repository rules run during repository resolution, before builds and tests. They can fetch external data or generate files, and their outputs are tracked by the cache. They are commonly used for dependency setup.

Rules vs Macros (what is a "real rule"?)

In Bazel, macros are Starlark functions that expand to other rule calls at load time; they cannot inspect providers or traverse dependency edges. A "real rule" refers to a Starlark rule (declared with rule(...)) that runs during the analysis phase. Real rules can:

  • Access providers from their dependencies.
  • Participate in analysis-time traversals (e.g., via aspects).
  • Declare actions and outputs with proper dependency tracking and caching.

When this RFC mentions "a real rule + aspect," it means we intentionally use an analysis-time rule (not just a macro) so we can read providers like rules_go’s GoArchive.importpath and make decisions that are visible to Bazel’s incremental analysis and cache.

Aspects

Aspects are analysis-time traversals that attach to specific attributes (e.g., deps, embed) and collect providers across a portion of the graph without changing the targets themselves. They are ideal for computing metadata derived from dependencies (such as a Go package’s effective importpath) and feeding that information into a rule that selects inputs or produces outputs accordingly. Aspects preserve hermeticity and let us mirror how native/rules-go logic derives values used by build/test rules.

Environment Variables

Bazel allows passing environment variables into repository rules (--repo_env) and test actions (--test_env). These are the primary mechanism for injecting configuration or secrets.

Sandbox Writable Paths

By default, Bazel sandboxes are read-only. Specific directories can be marked writable via --sandbox_writable_path, allowing tests to produce artifacts such as logs. The implementation in this repository does not require this for payload writing because it uses TEST_UNDECLARED_OUTPUTS_DIR.

BEP, BES, and BSP

  • BEP (Build Event Protocol): a structured stream of build/test events (targets, actions, test results, artifacts). Bazel can write BEP locally (e.g., --build_event_json_file/--build_event_binary_file) or send it to a remote backend.
  • BES (Build Event Service): a remote endpoint that receives BEP over gRPC (--bes_backend=<addr>). Consumers can process builds/tests out-of-band (e.g., post-processing uploads) without modifying test actions or breaking hermeticity.
  • BSP (Build Server Protocol): a language-agnostic protocol used by IDEs/tools to interact with build systems. Bazel BSP integrations translate Bazel’s graph and events (often via BEP) into BSP notifications and queries. While out of scope for the core rules here, BEP/BES/BSP awareness informs future integrations (e.g., uploader driven by BEP instead of running inside tests).

Problem Statement

Modern CI/CD pipelines benefit from Bazel's hermetic, reproducible builds and tests, which are enforced through sandboxing, deterministic inputs, and caching. Datadog Test Optimization enhances CI velocity and quality through service-level test settings, features like early flake detection, "known tests," and test management, as well as the collection of test and coverage results for analysis. However, the typical Test Optimization approach, which relies on runtime network access during tests directly conflicts with Bazel's hermetic test execution, where network access is frequently blocked.

We need an integration that:

  • Works with Bazel’s hermetic sandbox model (preferably with network blocked during test actions).
  • Fetches Test Optimization metadata (settings, known tests, test management tests) at a time compatible with Bazel’s caching and repository resolution phases.
  • Scales across languages and services, including multi‑service monorepos.
  • Minimizes cache invalidation scope to avoid unnecessary test re‑execution.
  • Uploads test and coverage payloads reliably from the same bazel test invocation without compromising hermeticity or leaking secrets to disk.
  • Provides a stable, simple API (labels, macros) for consumers and avoids intrusive changes in test targets.

Current State

This section describes what exists today prior to this proposal and the work in this repository: running Datadog Test Optimization by fetching data at test runtime.

How tests run today (pre‑proposal)

  • Language tracers initialize within each test process (depending on the implementation this may be done from a parent process) and perform live network calls to Datadog to retrieve:
    • Service settings and feature flags.
    • Known Tests and Test Management tests if the feature is enabled.
  • Also fetches CI/Git metadata inferred from environment variables or by running git commands directly if data is missing.
  • Tests execute and the tracer records results. At test process completion, the tracer uploads test and coverage payloads directly to Datadog using either agentless (API key + site) or an EVP proxy URL.

Where this collides with Bazel

  • Bazel commonly enforces hermetic sandboxes with blocked network for test actions. Fetching metadata and uploading results from inside tests violates hermeticity and is not allowed in hermetic configurations.
  • Teams work around this by enabling network for tests (undermining reproducibility) or by scripting ad‑hoc prefetch steps that are not integrated into Bazel’s dependency and caching model.

Operational characteristics

  • Repeated work: every shard redundantly fetches the same settings and known tests.
  • Cache opacity: network responses at runtime are invisible to Bazel’s action cache; outcomes depend on timing rather than declared inputs.
  • Secrets everywhere: DD_API_KEY must be present in many test environments to enable uploads.
  • Multi‑service friction: in monorepos, service delineation has to be recreated per language and shard; ownership and separation are error‑prone.
  • Debuggability: failures and logs are scattered across shards and languages, making diagnosis difficult.

Proposal

We propose standardizing this approach across Bazel‑based services as the supported integration for Datadog Test Optimization. The core tenets are:

  1. Fetch metadata during module/repo resolution via a repository rule.
  2. Consume metadata via public filegroups and per‑module labels to minimize cache invalidations.
  3. Keep tests hermetic; write payloads to Bazel's built-in TEST_UNDECLARED_OUTPUTS_DIR (automatically collected to bazel-testlogs/<target>/test.outputs/).
  4. Upload payloads from a dedicated workspace-level uploader (normal rule, not test) via bazel run after tests complete, enriching with non‑secret context, supporting both agentless and EVP proxy modes.
  5. Provide thin language macros that compose these pieces for a smooth developer experience.

The implemented rules, companions, examples, and consumer-style validation fixtures now live in this repository and the public sibling fixture repository used by CI.

Design Overview

At a high level, the proposal moves all network‑dependent metadata fetching out of test actions and into Bazel's module/repository resolution phase, then ships results from a dedicated workspace-level uploader (via bazel run). This preserves hermeticity for user tests while maintaining complete Test Optimization functionality.

  • Phase 1 — Sync at module/repo resolution:

    • A module extension instantiates a repository rule that performs the Datadog API calls for Settings (always), Known Tests (when enabled), and Test Management tests (when enabled).
  • The rule writes deterministic JSON outputs under a configurable directory (default: .testoptimization/) and produces a non‑secret context.json with CI/Git/OS/runtime tags. It also writes a manifest.txt with a version marker (currently version=1) to track payload format changes.

    • It generates a BUILD file exposing stable public filegroups:
      • @<repo>//:test_optimization_files (core bundle with settings.json),
      • @<repo>//:test_optimization_context (context.json only),
      • @<repo>//:module_<sanitized> (per‑module bundles with settings.json + that module’s known/test‑management files).
    • It emits export.bzl with a structured topt_data object describing the available per‑module labels, the resolved manifest_path, and language hints (e.g., Go module path inclusion).
  • Phase 2 — Hermetic test execution:

    • Tests run without network and can optionally read the synced JSONs from runfiles (e.g., via DD_TEST_OPTIMIZATION_MANIFEST_FILE).
    • Instrumented tests write payloads to Bazel's built-in TEST_UNDECLARED_OUTPUTS_DIR, which is automatically collected to bazel-testlogs/<target>/test.outputs/. No --sandbox_writable_path or custom environment variables needed.
  • Phase 3 — Upload outside user tests:

    • A single workspace-level uploader target (normal rule, not test) runs via bazel run after tests complete.
    • The uploader discovers all test.outputs/ directories in bazel-testlogs/, waits for filesystem quiescence, enriches test payloads with context.json (if present), uploads via agentless (DD_API_KEY, DD_SITE) or an EVP proxy (DD_TEST_OPTIMIZATION_AGENT_URL), and deletes successfully uploaded files.
    • Usage: run bazel test, then //:dd_test_optimization_doctor, then //:dd_upload_payloads -- --dry-run --validate-enrichment, then the real //:dd_upload_payloads target. Preserve the test exit code, but do not run the real upload if doctor or dry-run enrichment validation fails.
  • Multi‑service monorepos:

    • A higher‑level “multi‑sync” extension materializes one repository per service and an aggregator repository that re‑exports per‑service filegroups and a service mapping (topt_data_by_service). Macros can select services by key without hardcoding repo aliases.

Why this solves the problem

  • Hermeticity: User tests run offline; only the repository rule (during resolution) and the uploader (via bazel run at the end) require network.
  • Caching: Metadata is captured as declared repository outputs; tests depend on stable filegroups, and per‑module bundles narrow cache invalidations to relevant targets.
  • Security: Secrets are not written to disk and are scoped to the uploader; test actions themselves do not require DD_API_KEY.
  • Simplicity: Consumers depend on stable labels and, where available, language macros that hide wiring details (env/runfiles/uploader).

Required Library Changes

Under Bazel, tracing libraries must avoid all network calls for Test Optimization and operate entirely on files. Libraries detect Bazel mode via DD_TEST_OPTIMIZATION_PAYLOADS_IN_FILES=true, consume pre‑fetched JSONs from runfiles (resolved via DD_TEST_OPTIMIZATION_MANIFEST_FILE), and write test and coverage payloads to $TEST_UNDECLARED_OUTPUTS_DIR for the uploader to send. Outside Bazel, libraries retain the same current behavior, fetching and accessing the network directly.

Common behavior:

  • Detection:
    • If DD_TEST_OPTIMIZATION_MANIFEST_FILE is set and non‑empty, read optimization data from files (Bazel mode for inputs).
    • If DD_TEST_OPTIMIZATION_PAYLOADS_IN_FILES is set to "true", write payloads to TEST_UNDECLARED_OUTPUTS_DIR instead of network (Bazel mode for outputs).
    • Both are typically set together by the dd_topt_go_test macro.
  • Inputs (read‑only):
    • Resolve DD_TEST_OPTIMIZATION_MANIFEST_FILE to a real file path via standard Bazel runfiles resolution:
      • If RUNFILES_MANIFEST_FILE exists, scan it to map the logical path to a real path.
      • Else, if TEST_SRCDIR or RUNFILES_DIR exists, join with the manifest path and test for existence.
      • Else, if the value is already an absolute path and exists, use it as‑is.
    • Call filepath.Dir() (or equivalent) on the resolved manifest path to get the manifest directory.
    • From that directory, load:
      • cache/http/settings.json
      • Any number of cache/http/known_tests*.json files (combined known tests)
      • Any number of cache/http/test_management*.json files (combined test management tests)
    • Accept both "combined" shapes (e.g., known_tests.json with data.attributes.tests) and per‑module shapes, exposed under canonical file names via per‑module targets. Merge by unioning entries; empty stubs are valid and should be treated as "no data".
  • Outputs (write‑only, when DD_TEST_OPTIMIZATION_PAYLOADS_IN_FILES=true):
    • Use TEST_UNDECLARED_OUTPUTS_DIR (Bazel's built-in writable directory for undeclared test outputs).
    • Ensure $TEST_UNDECLARED_OUTPUTS_DIR/payloads/tests and $TEST_UNDECLARED_OUTPUTS_DIR/payloads/coverage exist (create if needed, handling concurrent processes safely).
    • Serialize test payloads to $TEST_UNDECLARED_OUTPUTS_DIR/payloads/tests/*.json (JSON only; do not use msgpack in Bazel mode).
    • Serialize coverage payloads to $TEST_UNDECLARED_OUTPUTS_DIR/payloads/coverage/*.json (one file per logical coverage unit; uploader will wrap as multipart with a generated event.json).
    • Use unique, deterministic file names to avoid clashes across shards (e.g., include PID/TID/timestamp/random suffix). Flush and fsync where appropriate for durability.
    • Bazel automatically collects these outputs to bazel-testlogs/<package>/<target>/test.outputs/ after the test completes.
  • Network: In Bazel mode, do not perform any HTTP calls (for metadata fetch or uploads). All remote interactions are delegated to the repository rule (metadata) and the uploader (shipping via bazel run).
  • Config precedence: Honor settings.json feature flags (e.g., known tests enabled, test management enabled). If missing, default to conservative behavior (features disabled) rather than reaching the network.
  • Logging: Emit a clear startup line noting "Bazel mode enabled via DD_TEST_OPTIMIZATION_MANIFEST_FILE" and list resolved directory for troubleshooting.

Test data contracts (minimum viable)

  • cache/http/settings.json: full server response preferred; if absent, treat features as disabled and do not attempt network requests.
  • known tests: accept combined (data.attributes.tests) or per‑module canonical files (known_tests.json scoped per target) → module key → test identifiers. Merge by union.
  • test management tests: accept combined (data.attributes.modules) or per‑module canonical files (test_management.json scoped per target) → module key → test states. Merge by union.
  • Forward compatibility: ignore unknown keys; fail closed (no network) on parse errors in Bazel mode.

Backwards compatibility

  • Outside Bazel (no DD_TEST_OPTIMIZATION_MANIFEST_FILE), preserve current behavior: live metadata fetch (settings/known tests) and direct uploads according to existing environment variables.

Detailed Design

Repository Rule and Module Extension

  • The test_optimization_sync_extension tag is declared in MODULE.bazel. It instantiates test_optimization_sync with optional attributes:
    • service: explicit override for service name (else derived from DD_SERVICE).
    • runtime_name, runtime_version, runtime_arch: enrich configurations and context.json.
    • known_tests, test_management: local kill‑switches to skip specific feature requests and emit minimal stubs while adjusting settings.json accordingly.
    • debug: increases logging verbosity and writes additional artifacts (e.g., request JSONs) for troubleshooting.
  • The repository rule performs:
    1. Settings request: always issued; response persisted to cache/http/settings.json.
    2. Known Tests request: gated by settings and known_tests attribute; persisted to cache/http/known_tests.json and split by module (canonical per‑module files exposed by targets).
    3. Test Management Tests request: gated by settings and test_management attribute; persisted to cache/http/test_management.json and split by module (canonical per‑module files exposed by targets).
    4. context.json: built locally from CI/git/OS/runtime information — non‑secret and safe to ship as runfiles.
    5. A generated BUILD file that exposes:
      • :test_optimization_files → includes cache/http/settings.json and manifest.txt (stable bundle for most uses).
      • :test_optimization_contextcontext.json (opt‑in for enrichment).
      • :module_<sanitized> → per‑module bundle of settings + module‑specific JSONs.
      • Per‑module targets expose canonical runfile names rooted at the manifest directory (<out_dir>/..., default .testoptimization/...) regardless of the physical split-file locations.
    6. An export.bzl with topt_data describing labels, the resolved manifest_path, and language‑specific hints (e.g., Go module path inclusion).
  • HTTP behavior uses curl with fail‑fast and retries; DD_SITE is normalized; Windows and non‑Windows paths are handled. The rule declares all relevant env vars in environ so changes lead to re‑execution and fresh outputs.

Per‑Module Labels and Sanitization

  • Module names are sanitized for file and target names; collisions are resolved deterministically with numeric suffixes.
  • Consumers can depend only on the module(s) they need, limiting rebuilds and cache invalidations when unrelated modules change.

Multi‑Service Aggregation

  • For monorepos with multiple services, the multi‑service extension instantiates one repo per service plus an aggregator repo that exposes per‑service labels and a topt_data_by_service mapping. This allows macros to select a service by logical key without leaking the concrete repo alias.

Runtime Uploader

  • dd_payload_uploader is a normal Bazel rule (not a test) that runs via bazel run after tests complete. It discovers all test.outputs/ directories in bazel-testlogs/, waits for quiescence, then uploads and deletes payloads. It supports:
    • Agentless mode (DD_API_KEY, DD_SITE) posting to https://citestcycle-intake.<site>/api/v2/citestcycle and https://citestcov-intake.<site>/api/v2/citestcov.
    • EVP proxy mode (DD_TEST_OPTIMIZATION_AGENT_URL) posting to /evp_proxy/v2/... with subdomain routing headers.
  • A single uploader target per workspace is required (enforced via a runtime uploader lock file to prevent concurrent uploads; unrelated to MODULE.bazel.lock).
  • When context.json is present in runfiles (supplied via a data dependency on @<repo>//:test_optimization_context), test payloads are enriched by merging context keys.
  • Since bazel run executes locally with full host access, no sandbox workarounds are needed. The uploader runs after all tests complete, discovering payloads that Bazel collected from TEST_UNDECLARED_OUTPUTS_DIR.
  • Recommended invocation preserves test exit code:
    bazel test --config=test-optimization //... || test_status=$?; test_status=${test_status:-0}
    bazel run --config=test-optimization //:dd_test_optimization_doctor || doctor_status=$?; doctor_status=${doctor_status:-0}
    if [ "$doctor_status" -ne 0 ]; then
      if [ "$test_status" -ne 0 ]; then exit "$test_status"; fi
      exit "$doctor_status"
    fi
    bazel run --config=test-optimization //:dd_upload_payloads -- --dry-run --validate-enrichment || dry_run_status=$?; dry_run_status=${dry_run_status:-0}
    if [ "$dry_run_status" -ne 0 ]; then
      if [ "$test_status" -ne 0 ]; then exit "$test_status"; fi
      exit "$dry_run_status"
    fi
    DD_API_KEY="$DD_API_KEY" DD_SITE="$DD_SITE" bazel run --config=test-optimization //:dd_upload_payloads
    upload_status=$?
    if [ "$test_status" -ne 0 ]; then exit "$test_status"; fi
    exit "$upload_status"

Language Macros

  • Provide macros per language to:

    • Attach runfiles and env (DD_TEST_OPTIMIZATION_MANIFEST_FILE, DD_TEST_OPTIMIZATION_PAYLOADS_IN_FILES).
    • Configure payloads to write to TEST_UNDECLARED_OUTPUTS_DIR automatically.
    • Surface reasonable defaults and allow overrides.
  • Note: Macros no longer create per-test uploaders. Users create ONE uploader target per workspace.

  • Go importpath inference:

    • A Starlark aspect walks embed on the go_test target and reads GoArchive.importpath from rules_go providers, mirroring how go_test computes it.
    • A small rule uses the inferred importpath to pick the matching :module_<sanitized> filegroup from the synced repo and exposes it in runfiles; the macro sets DD_TEST_OPTIMIZATION_MANIFEST_FILE to $(rlocationpath <manifest_path>) using topt_data["manifest_path"], so custom out_dir values are supported.
    • Precedence: (1) explicit importpath kwarg on the go_test; (2) provider‑based inference via embed; (3) fallback to <go module path>/<bazel package>.
    • The exported topt_data["runtimes"]["go"]["module_included"] flag is consulted only in fallback mode; when inferring via (1) or (2), the macro always attempts per‑module selection and falls back to the full bundle if no match exists.
  • Module dependency: the Go companion module (datadog-rules-test-optimization-go) declares bazel_dep("rules_go", <version>) to make provider loads visible under Bzlmod; it does not configure toolchains. Consumers must still configure rules_go and the Go SDK in their own MODULE.bazel.

  • The existing dd_topt_go_test demonstrates this pattern and should be mirrored for other languages incrementally.

Security Considerations

  • Secrets are not written to disk. The repo rule’s HTTP calls rely on DD_API_KEY forwarded through --repo_env at fetch time; the uploader uses either DD_API_KEY or DD_TEST_OPTIMIZATION_AGENT_URL at upload time (bazel run step). Git metadata overrides also belong in --repo_env, not --test_env, so they do not become test action cache keys.
  • context.json contains only non‑secret metadata and is safe to include as runfiles.
  • Consumers should configure sandboxing and network blocking for tests (--config=hermetic, --sandbox_default_allow_network=false) and make only the payload directory writable.

Performance and Caching

  • Repository fetches are kept minimal and retried on transient failures. Splitting per module reduces invalidation impact.
  • Caching keys for the repository rule include attributes and declared env vars; downstream test targets cache on the content of the JSONs they depend on.

Observability and Debugging

  • Informational and debug logs are printed by the rule and the uploader; additional artifacts (request bodies) can be enabled via debug.
  • The bazelw wrapper in this repo can materialize git metadata into --repo_env to ensure consistent fetch keys across CI runs.

Backwards Compatibility and Adoption

  • Public labels are stable: :test_optimization_files, :test_optimization_context, :module_<sanitized>.
  • Consumers on WORKSPACE can instantiate the repository rule directly; Bzlmod users rely on use_extension + use_repo.
  • A gradual rollout plan can start by adopting the uploader rule and filegroup dependencies, then moving to language macros for deeper integration.

Limitations of the Proposal

Hermetic and Network Boundaries

  • Tests remain hermetic with network disabled, but the repository rule and uploader (via bazel run) still require network during repository resolution and the final upload step. For environments that mandate zero network everywhere, organizations must seed artifacts from trusted mirrors and route uploads through internal proxies. We could potentially overcome this by implementing the uploader at BEP watcher level (Build Event Protocol) but we need more research time to have a clear picture on how it might be implemented.

Test Impact Analysis (TIA)

  • Omitted initially due to interference with Bazel caching semantics; any future TIA integration should be explicit opt‑in and carefully evaluated.

Cache Invalidation and Granularity

  • Settings changes and Test Management updates appropriately invalidate caches. Per‑module filegroups minimize blast radius, but monolithic tests can still see broad invalidations.

Language Coverage

  • Convenience macros exist for Go; other languages require similar wrappers or manual wiring until macros are provided.

Platform and Tooling

  • Cross‑platform handling exists (macOS/Linux/Windows), but less common shells/environments may need tweaks. curl is required; jq is optional for enrichment.

Alternatives Considered

  • Runtime fetching in each test: violates hermeticity, increases flakiness, and explodes network calls.
  • Bazel spawn strategy wrappers or custom test runners: intrusive and language‑specific; increases maintenance burden.
  • BES (Build Event Service) post‑processing: could be implemented for the uploader but requires more research time.
  • Repository rule without per‑module splitting: simpler, but causes wider cache invalidations across large monorepos.
  • Macro‑only Go inference: macros cannot read providers in Bazel; a real rule + aspect is required to access GoArchive.importpath and follow embed dependencies during analysis while keeping BUILD usage simple.

Future Work

  • Evaluate an opt‑in TIA approach that does not fight Bazel caching semantics (e.g., coarse‑grained TIA hints that do not land in the action cache key).
  • Evaluate the inclusion of tracer telemetry data by modifying the tracer's telemetry transport and the inclusion of this type of payloads in the uploader.
  • Provide first‑class macros for other languages and unify env wiring patterns.
  • Evaluate the possibility to implement the uploader using the Bazel build event service.
  • Add optional verification tests that assert context.json enrichment in CI.