Skip to content

OpenTelemetry tracing support#786

Open
tewbo wants to merge 35 commits intoydb-platform:mainfrom
tewbo:otel-tracing-support
Open

OpenTelemetry tracing support#786
tewbo wants to merge 35 commits intoydb-platform:mainfrom
tewbo:otel-tracing-support

Conversation

@tewbo
Copy link
Copy Markdown

@tewbo tewbo commented Mar 21, 2026

Pull request type

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Other (please describe):

What is the current behavior?

The YDB Python SDK does not provide built-in OpenTelemetry tracing support. There is legacy integration with OpenTracing API which uses the deprecated standard.

Issue Number: N/A

What is the new behavior?

Adds OpenTelemetry tracing support to the YDB Python SDK. When enabled via enable_tracing(), the SDK automatically creates spans for key operations:

  • ydb.CreateSession — session creation
  • ydb.ExecuteQuery — query execution (session and transaction level, both sync and async)
  • ydb.Commit / ydb.Rollback — transaction commit and rollback
  • ydb.Driver.Initialize — driver initialization

Each span includes standard attributes: db.system.name, db.namespace, server.address, server.port, ydb.session.id, ydb.node.id, ydb.tx.id.

W3C Trace Context (traceparent) is automatically propagated in gRPC metadata, enabling end-to-end distributed tracing between client and YDB server. Execute spans cover the full operation lifecycle, including streaming result iteration — not just the initial gRPC call. Errors are recorded on spans with error.type, db.response.status_code, and exception events.

Tracing is opt-in (pip install ydb[tracing] + enable_tracing()). Without it, the SDK behavior is unchanged — all tracing code paths are no-op.

Other information

  • Includes unit tests for sync, async, error handling, parent-child relationships, context propagation, noop mode, and concurrent span isolation

@tewbo tewbo marked this pull request as draft March 23, 2026 10:36
@KirillKurdyukov KirillKurdyukov self-requested a review March 23, 2026 13:47
@tewbo tewbo marked this pull request as ready for review March 24, 2026 07:02
Comment thread ydb/opentelemetry/_plugin.py Outdated
Comment thread setup.py Outdated
Comment thread examples/opentelemetry/example.py Outdated
Comment thread ydb/opentelemetry/_plugin.py Outdated
@tewbo tewbo marked this pull request as draft April 9, 2026 17:56
@tewbo tewbo marked this pull request as ready for review April 9, 2026 18:15
KirillKurdyukov and others added 3 commits April 20, 2026 14:38
Query-service retries now emit an umbrella INTERNAL ydb.RunWithRetry span
and a ydb.Try INTERNAL span per attempt. Each ydb.Try carries the
ydb.retry.backoff_ms attribute (the sleep preceding the attempt — 0 for
the first one, i.e. the next-attempt timeline includes the backoff).
Retriable exceptions are recorded on the owning ydb.Try span, and an
exception that escapes the whole retry loop (including an
asyncio.CancelledError hitting a backoff sleep) is recorded on the outer
ydb.RunWithRetry span.

CLIENT spans (ydb.CreateSession, ydb.ExecuteQuery, ydb.Commit,
ydb.Rollback) now also emit network.peer.address / network.peer.port
for the concrete node serving the session, while server.address /
server.port keep meaning the host from the connection string.

Also fixes a "Пр" typo in docs/opentelemetry.rst and corrects span names
(ydb.CommitTransaction -> ydb.Commit, ydb.RollbackTransaction -> ydb.Rollback).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move ydb.RunWithRetry / ydb.Try span emission directly into
retry_operation_sync / retry_operation_async in ydb/retries.py, and drop
the short-lived ydb.query._retries shim. Tracing is still no-op by
default, so there is no cost for the table-service callers that share
the same retry loop; we just stop duplicating the retry logic to add
spans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p session.id/tx.id

RPC (CLIENT-kind) spans now carry the peer metadata from the discovery
endpoint map, not from the grpc-target string of the request:

  * network.peer.address = EndpointInfo.address (the node host)
  * network.peer.port    = EndpointInfo.port
  * ydb.node.dc          = EndpointInfo.location

To do that, EndpointOptions and Connection now also carry address/port/
location populated by resolver.endpoints_with_options(); sessions
resolve their peer tuple via driver._store.connections_by_node_id after
CreateSession returns, which is the right place to ask which node owns
this session.

Dropped the noisy ydb.session.id and ydb.tx.id attributes - they pollute
every span and are recoverable from trace context if really needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@vgvoleg vgvoleg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Поправьте замечания и сделайте rebase на свежий main.

Comment thread docs/index.rst Outdated
Comment thread ydb/opentelemetry/_plugin.py Outdated
Comment thread ydb/query/base.py Outdated
Comment thread ydb/retries.py Outdated
Comment thread ydb/opentelemetry/tracing.py
Comment thread ydb/opentelemetry/tracing.py Outdated
Comment thread ydb/opentelemetry/_plugin.py
Comment thread docs/opentelemetry.rst Outdated
Comment thread docs/opentelemetry.rst Outdated
Comment thread ydb/query/base.py Outdated
Comment thread ydb/query/session.py
@robot-vibe-db
Copy link
Copy Markdown

robot-vibe-db Bot commented May 1, 2026

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

  • Minor | High: Docs say ydb.retry.backoff_ms is "0 for the first one" but the code omits the attribute entirely on the first ydb.Try span — docs/opentelemetry.rst:108
  • Minor | Medium: __del__ in response iterators calls ContextVar.reset(token) which can raise ValueError if GC runs the finaliser in a different execution context, producing noisy stderr warnings — ydb/query/base.py:124, ydb/aio/query/base.py:49
  • Minor | Low: Outer except Exception in execute() methods ends the span but does not pop the gRPC propagation token, leaking a stale ContextVar value if the iterator constructor raises — ydb/query/session.py:536 (same pattern in transaction.py and aio/query/session.py)
  • Nit | High: Copy-paste artifact in "Typical local run" docs — cd ../.. followed by duplicated commands that would fail outside the repo root — docs/opentelemetry.rst:248-250

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

@KirillKurdyukov
Copy link
Copy Markdown

image

Copy link
Copy Markdown

@robot-vibe-db robot-vibe-db Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Review Summary

Verdict: ✅ No critical issues found

Critical issues

No critical issues found.

Other findings

  • Major | High: Async Connection class is missing peer_address, peer_port, peer_location slots/init — network.peer.* and ydb.node.dc span attributes are silently never populated in the async path — ydb/aio/connection.py:143-165

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

Comment thread ydb/aio/connection.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants