bench: MCP-transport sibling of the code-graph track by DvirDukhan · Pull Request #684 · FalkorDB/code-graph

DvirDukhan · 2026-05-27T11:49:30Z

bench: MCP-transport sibling of the code-graph track

Adds a second bench track that exercises code-graph through the
exact transport real-world agents use (Claude Code, Cursor, Cline…) —
JSON-RPC over stdio to a spawned cgraph-mcp server — instead of HTTP
to the FastAPI service.

Same backend graph, different transport. This lets us answer
"does the MCP integration regress accuracy or token usage vs the
direct HTTP path?" on the same SWE-bench tasks.

What's in

Adapter & shim

bench/agents/code_graph_mcp_adapter.py — sync Python adapter that
spawns cgraph-mcp per call via the official MCP Python SDK.
Covers all 8 MCP tools.
bench/cli/cg-mcp + cg_mcp.py — bash-callable CLI shim mirroring
the existing cg shim (mini-swe-agent only does bash, so each
"tool" is one CLI invocation).
bench/tools/code_graph_mcp/{tools.yaml,system_preamble.md} —
agent config; excludes ask per Q2 (no nested LLM in the
benchmarked tool set).

Runner integration (second commit)

VALID_CONFIGS gains code_graph_mcp → --config code_graph_mcp
is now a first-class track.
New INSTANCE_TEMPLATE_CODE_GRAPH_MCP: mirrors the HTTP template
but tells the agent to call cg-mcp with $PROJECT_NAME +
$BRANCH, and to use impact_analysis before non-trivial edits.
config_env("code_graph_mcp", ...) prepends the venv bin to
PATH (so cgraph-mcp is callable from the agent's bash),
passes FALKORDB_* through, and exports PROJECT_NAME /
BRANCH.
_ensure_indexed_mcp() mirrors _ensure_indexed but goes through
the MCP adapter. Skip-if-present probe hits FalkorDB's GRAPH.LIST
directly (one trip, no MCP spawn).

Tests

tests/bench/test_cg_mcp_adapter.py — 5 unit + 1 e2e. The e2e is
double-gated on FalkorDB being reachable AND api.mcp.server being
importable, so it skips cleanly on branches that pre-date the MCP
stack.

Validation

Heavy e2e validated locally against the api/ subgraph
(~6.3k nodes / ~6.2k edges) over real stdio:

search_code(prefix=index_repo)  -> id=96, api/cli.py:223
get_callers(96)                  -> [] (CLI entry, correct)
impact_analysis(96, depth=2)     -> [] (consistent)

Runner wiring smoke-verified:

VALID_CONFIGS = ('baseline','lsp','code_graph','code_graph_mcp')
load_instance_template('code_graph_mcp')  contains 'cg-mcp'
config_env populates PROJECT_NAME / BRANCH / FALKORDB_HOST

Stacking

Sits on dvirdukhan/fix-find-symbol-nested-name (the bench harness
PR). The end-to-end test needs the MCP server itself, which lands in
PRs #675 → #683. Until both stacks merge to staging, the e2e
self-skips and the 5 unit tests cover the adapter logic.

Design notes

Per-call spawn. ~0.5–1s startup per cg-mcp … invocation.
Keeps the bash shim trivially safe (one process per call, no
shared state) and matches how real agents use MCP servers today.
A future optimization could persist the server via a side-channel
daemon if benchmarks show overhead matters.
Output shape. Collection-returning tools land their array in
structuredContent.result per the MCP spec; _extract handles
both that and the text-chunk JSON shape so the bash agent always
sees a single clean JSON document.

Adds a second bench track that exercises code-graph through the exact transport real-world agents use (Claude Code, Cursor, …) — JSON-RPC over stdio to a spawned `cgraph-mcp` server — instead of HTTP to the FastAPI service. Files: - bench/agents/code_graph_mcp_adapter.py — sync Python adapter that spawns cgraph-mcp per call via the official MCP Python SDK. Knows the 8-tool MCP surface (search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path, index_repo, ask). - bench/cli/cg-mcp + cg_mcp.py — bash-callable CLI shim mirroring the existing `cg` shim. mini-swe-agent only does bash, so each "tool" is one CLI invocation. - bench/tools/code_graph_mcp/{tools.yaml,system_preamble.md} — agent config for the MCP track. Mirrors code_graph; same Q2 decision to exclude `ask` (no nested LLM in the benchmarked tool set). - tests/bench/test_cg_mcp_adapter.py — 5 unit + 1 e2e test (FalkorDB-gated AND MCP-server-gated so it skips cleanly until the MCP stack lands on staging). Heavy e2e validated against the api/ subgraph (~6.3k nodes) over real stdio: search_code -> get_callers -> impact_analysis returned expected payloads. Depends on the MCP stack (PRs #675–#683) for cgraph-mcp itself. Lands cleanly once that stack merges. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-05-27T11:49:37Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c19cc669-d798-4531-9abf-b32ab42f2fe8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dvirdukhan/bench-mcp-track

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The adapter and shim from the previous commit were inert from the runner's perspective — VALID_CONFIGS only knew baseline/lsp/code_graph. This commit makes `--config code_graph_mcp` a first-class track. Changes in bench/runners/mini_runner.py: - VALID_CONFIGS gains "code_graph_mcp" (passes argparse + help string). - New INSTANCE_TEMPLATE_CODE_GRAPH_MCP: mirrors the HTTP code_graph template but tells the agent to call `cg-mcp` with $PROJECT_NAME + $BRANCH, and to use impact_analysis before non-trivial edits. - load_instance_template dispatches the new template. - config_env("code_graph_mcp", ...) prepends venv bin to PATH (so cgraph-mcp is callable from the agent's bash), passes FALKORDB_* through to the spawned MCP server, and exports PROJECT_NAME + BRANCH which the preamble references. - New _ensure_indexed_mcp() mirrors _ensure_indexed but goes through the bench MCP adapter instead of HTTP. Skip-if-present probe hits FalkorDB's GRAPH.LIST directly (one trip, no MCP spawn). - Per-instance loop now dispatches to _ensure_indexed_mcp for the new config. Smoke-verified that: - VALID_CONFIGS == ('baseline','lsp','code_graph','code_graph_mcp') - load_instance_template('code_graph_mcp') contains 'cg-mcp' - config_env populates PROJECT_NAME/BRANCH/FALKORDB_HOST Unit tests for the adapter still pass (5 passed, 1 skipped — heavy e2e double-gated on FalkorDB + api.mcp.server availability). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The mini_runner previously printed a [index] WARN line on analyze_folder errors and continued. This meant SWE-bench instances whose path falls outside ALLOWED_ANALYSIS_DIR (e.g. when the API server is started from a sibling worktree) would silently run the agent against a missing code-graph project. The agent's first cg call returns 400 'Missing project ...', the agent falls back to grep/sed, and we get a token count that looks bad for the code_graph track but actually reflects 'tool unavailable'. Two changes: * analyze_folder errors and httpx exceptions now raise RuntimeError with the offending path. This stops the run and surfaces the ALLOWED_ANALYSIS_DIR misconfiguration immediately. * analyze_folder timeout bumped 600s -> 1800s. The 600s default was tight for sympy (~5 MB of Python, ~5000 functions) and caused a timeout during indexing. This was discovered while running the first real 3-way SWE-bench smoke. With the fix and a corrected ALLOWED_ANALYSIS_DIR, the code_graph track produces sensible numbers (-11% input vs baseline across the smoke sample vs the prior bogus +4.7%). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Python analyzer hardcoded `environment_path={path}/venv` when starting jedi-language-server via multilspy. When the repo had no venv (the common case for cloned codebases like sphinx, sympy, anything from SWE-bench), jedi raised `InvalidPythonEnvironment` on every `request_definition()` call. analyzer.resolve() then swallowed the exception silently and the indexer produced a graph with DEFINES edges only — zero CALLS, zero EXTENDS. Benchmark validation showed sphinx (5K functions) and sympy (41K functions) had no resolved cross-references at all. Fix: - source_analyzer.py: prefer {repo}/venv, then {repo}/.venv, then fall back to the host interpreter's environment (sys.executable's prefix) so jedi always has a valid Python to introspect. - analyzer.py: log resolve() failures at WARN with file/line context instead of swallowing them silently, so the next regression is loud. Verified: re-indexed sphinx-doc/sphinx-9230 with the fix: DEFINES: 5640, CALLS: 4931, EXTENDS: 484 (was DEFINES-only). Fixes #685. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two production-quality fixes from the calibration run that crashed at 14/30 trajectories: 1. Resume support: skip (instance, cfg) pairs whose trajectory file already exists. Lets us recover from crashes/kills without re-running completed work (avoids ~$3 of wasted compute on this run). 2. Ignore pathological files at index time: sympy/integrals/rubi/rules contains auto-generated 3000-line files with hundreds of unresolvable symbols per line. jedi spends hours and never makes progress. Adding it to the default ignore list unblocks sympy-19040 (and other sympy instances) without affecting graph quality. Also expanded default ignore set: __pycache__, build, dist, .tox, .eggs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

DvirDukhan · 2026-05-28T04:28:51Z

Calibration complete (n=10 SWE-bench Verified, Sonnet 4.5)

After landing the analyzer fix (PR #686, KeyError defensive patch) and bench resume support:

config	sum tokens	Δ vs baseline	mean tool calls	mean wall (s)
baseline	10,846,303	—	56.7	324.6
lsp	8,346,479	−23.0%	50.4	293.8
code_graph	9,006,634	−17.0%	47.9	267.6

Median per-task: code_graph −22.5%, lsp −22.2%.

Resolve rate: 0/10 across all configs — expected for Sonnet at step-limit 75 on SWE-bench Verified. We're measuring token cost; accuracy needs Opus + higher step limit (next phase).

Per-instance code_graph vs baseline

instance	baseline	code_graph	Δ
django-14771	1,178,608	669,425	−43%
django-16661	1,071,494	1,093,370	+2%
pydata-xarray-4629	1,097,868	223,976	−80%
pytest-6202	1,080,968	362,228	−67%
scikit-learn-10844	300,805	248,138	−18%
sphinx-8035	1,339,158	1,635,271	+22%
sphinx-9230	1,177,778	1,273,885	+8%
sympy-12481	1,434,559	1,665,642	+16%
sympy-19040	1,495,339	1,592,179	+7%
sympy-20154	669,726	242,520	−64%

Conclusions:

Token savings are real and reproducible (~22% median across both tool tracks).
Variance is enormous (−80% to +22%) — n=10 is the minimum for a signal but not enough for tight CIs. Next: n=30.
code_graph wins big on focused-API tasks (xarray, pytest, sympy-20154) and regresses when the graph surfaces many irrelevant nodes (sphinx, sympy-12481/19040).
The new bottleneck is code_graph indexing wall-time — sympy-12481 took ~50 min. Filed Index second_pass: parallelize per-file LSP symbol resolution to cut wall-time on large repos #687 to parallelize second_pass.

Operational fixes that landed this run

source_analyzer.py venv-path fallback + WARN logging (PR fix(analyzer): resolve LSP CALLS edges on repos without a venv (#685) #686)
source_analyzer.py defensive self.files.get() in second_pass (KeyError on sympy-12481/distributedmodules.py)
mini_runner.py resume support (trajectory file = skip)
mini_runner.py default ignore list incl. rubi/rules
mini_runner.py httpx timeout 1800 → 7200 for slow indexes

In source_analyzer.second_pass, the list of files we iterate can include paths that first_pass did not add to self.files (e.g. parse errors, LSP-induced timeouts, or rare edge cases where a candidate file is present in the input list but never makes it into the files map). Previously this raised KeyError and aborted the entire index. Hit on sympy/polys/distributedmodules.py during bench calibration of sympy-12481. Skip with a WARN log instead so a single bad file no longer takes down the whole index. Also bump mini_runner httpx timeout 1800s -> 7200s; observed sympy-12481 index taking >30 min in the field, which previously left the API server indexing successfully but the runner gave up early. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

DvirDukhan and others added 2 commits May 27, 2026 14:51

DvirDukhan mentioned this pull request May 27, 2026

Python analyzer silently swallows multilspy errors → empty CALLS edges on large codebases #685

Open

dvirdukhan and others added 2 commits May 27, 2026 17:24

DvirDukhan mentioned this pull request May 27, 2026

Index second_pass: parallelize per-file LSP symbol resolution to cut wall-time on large repos #687

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: MCP-transport sibling of the code-graph track#684

bench: MCP-transport sibling of the code-graph track#684
DvirDukhan wants to merge 6 commits into
dvirdukhan/fix-find-symbol-nested-namefrom
dvirdukhan/bench-mcp-track

DvirDukhan commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review skipped

Uh oh!

DvirDukhan commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DvirDukhan commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!