bench: MCP-transport sibling of the code-graph track#684
Conversation
Adds a second bench track that exercises code-graph through the
exact transport real-world agents use (Claude Code, Cursor, …) —
JSON-RPC over stdio to a spawned `cgraph-mcp` server — instead of
HTTP to the FastAPI service.
Files:
- bench/agents/code_graph_mcp_adapter.py — sync Python adapter that
spawns cgraph-mcp per call via the official MCP Python SDK. Knows
the 8-tool MCP surface (search_code, get_callers, get_callees,
get_dependencies, impact_analysis, find_path, index_repo, ask).
- bench/cli/cg-mcp + cg_mcp.py — bash-callable CLI shim mirroring
the existing `cg` shim. mini-swe-agent only does bash, so each
"tool" is one CLI invocation.
- bench/tools/code_graph_mcp/{tools.yaml,system_preamble.md} —
agent config for the MCP track. Mirrors code_graph; same Q2
decision to exclude `ask` (no nested LLM in the benchmarked tool
set).
- tests/bench/test_cg_mcp_adapter.py — 5 unit + 1 e2e test
(FalkorDB-gated AND MCP-server-gated so it skips cleanly until
the MCP stack lands on staging).
Heavy e2e validated against the api/ subgraph (~6.3k nodes) over
real stdio: search_code -> get_callers -> impact_analysis returned
expected payloads.
Depends on the MCP stack (PRs #675–#683) for cgraph-mcp itself.
Lands cleanly once that stack merges.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The adapter and shim from the previous commit were inert from the
runner's perspective — VALID_CONFIGS only knew baseline/lsp/code_graph.
This commit makes `--config code_graph_mcp` a first-class track.
Changes in bench/runners/mini_runner.py:
- VALID_CONFIGS gains "code_graph_mcp" (passes argparse + help string).
- New INSTANCE_TEMPLATE_CODE_GRAPH_MCP: mirrors the HTTP code_graph
template but tells the agent to call `cg-mcp` with $PROJECT_NAME +
$BRANCH, and to use impact_analysis before non-trivial edits.
- load_instance_template dispatches the new template.
- config_env("code_graph_mcp", ...) prepends venv bin to PATH (so
cgraph-mcp is callable from the agent's bash), passes FALKORDB_*
through to the spawned MCP server, and exports PROJECT_NAME +
BRANCH which the preamble references.
- New _ensure_indexed_mcp() mirrors _ensure_indexed but goes through
the bench MCP adapter instead of HTTP. Skip-if-present probe hits
FalkorDB's GRAPH.LIST directly (one trip, no MCP spawn).
- Per-instance loop now dispatches to _ensure_indexed_mcp for the
new config.
Smoke-verified that:
- VALID_CONFIGS == ('baseline','lsp','code_graph','code_graph_mcp')
- load_instance_template('code_graph_mcp') contains 'cg-mcp'
- config_env populates PROJECT_NAME/BRANCH/FALKORDB_HOST
Unit tests for the adapter still pass (5 passed, 1 skipped — heavy
e2e double-gated on FalkorDB + api.mcp.server availability).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The mini_runner previously printed a [index] WARN line on analyze_folder errors and continued. This meant SWE-bench instances whose path falls outside ALLOWED_ANALYSIS_DIR (e.g. when the API server is started from a sibling worktree) would silently run the agent against a missing code-graph project. The agent's first cg call returns 400 'Missing project ...', the agent falls back to grep/sed, and we get a token count that looks bad for the code_graph track but actually reflects 'tool unavailable'. Two changes: * analyze_folder errors and httpx exceptions now raise RuntimeError with the offending path. This stops the run and surfaces the ALLOWED_ANALYSIS_DIR misconfiguration immediately. * analyze_folder timeout bumped 600s -> 1800s. The 600s default was tight for sympy (~5 MB of Python, ~5000 functions) and caused a timeout during indexing. This was discovered while running the first real 3-way SWE-bench smoke. With the fix and a corrected ALLOWED_ANALYSIS_DIR, the code_graph track produces sensible numbers (-11% input vs baseline across the smoke sample vs the prior bogus +4.7%). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Python analyzer hardcoded `environment_path={path}/venv` when starting
jedi-language-server via multilspy. When the repo had no venv (the common
case for cloned codebases like sphinx, sympy, anything from SWE-bench),
jedi raised `InvalidPythonEnvironment` on every `request_definition()`
call. analyzer.resolve() then swallowed the exception silently and the
indexer produced a graph with DEFINES edges only — zero CALLS, zero
EXTENDS. Benchmark validation showed sphinx (5K functions) and sympy
(41K functions) had no resolved cross-references at all.
Fix:
- source_analyzer.py: prefer {repo}/venv, then {repo}/.venv, then fall
back to the host interpreter's environment (sys.executable's prefix)
so jedi always has a valid Python to introspect.
- analyzer.py: log resolve() failures at WARN with file/line context
instead of swallowing them silently, so the next regression is loud.
Verified: re-indexed sphinx-doc/sphinx-9230 with the fix:
DEFINES: 5640, CALLS: 4931, EXTENDS: 484 (was DEFINES-only).
Fixes #685.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two production-quality fixes from the calibration run that crashed at 14/30 trajectories: 1. Resume support: skip (instance, cfg) pairs whose trajectory file already exists. Lets us recover from crashes/kills without re-running completed work (avoids ~$3 of wasted compute on this run). 2. Ignore pathological files at index time: sympy/integrals/rubi/rules contains auto-generated 3000-line files with hundreds of unresolvable symbols per line. jedi spends hours and never makes progress. Adding it to the default ignore list unblocks sympy-19040 (and other sympy instances) without affecting graph quality. Also expanded default ignore set: __pycache__, build, dist, .tox, .eggs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Calibration complete (n=10 SWE-bench Verified, Sonnet 4.5)After landing the analyzer fix (PR #686, KeyError defensive patch) and bench resume support:
Median per-task: code_graph −22.5%, lsp −22.2%. Resolve rate: 0/10 across all configs — expected for Sonnet at step-limit 75 on SWE-bench Verified. We're measuring token cost; accuracy needs Opus + higher step limit (next phase). Per-instance code_graph vs baseline
Conclusions:
Operational fixes that landed this run
Next
|
In source_analyzer.second_pass, the list of files we iterate can include paths that first_pass did not add to self.files (e.g. parse errors, LSP-induced timeouts, or rare edge cases where a candidate file is present in the input list but never makes it into the files map). Previously this raised KeyError and aborted the entire index. Hit on sympy/polys/distributedmodules.py during bench calibration of sympy-12481. Skip with a WARN log instead so a single bad file no longer takes down the whole index. Also bump mini_runner httpx timeout 1800s -> 7200s; observed sympy-12481 index taking >30 min in the field, which previously left the API server indexing successfully but the runner gave up early. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench: MCP-transport sibling of the code-graph track
Adds a second bench track that exercises code-graph through the
exact transport real-world agents use (Claude Code, Cursor, Cline…) —
JSON-RPC over stdio to a spawned
cgraph-mcpserver — instead of HTTPto the FastAPI service.
Same backend graph, different transport. This lets us answer
"does the MCP integration regress accuracy or token usage vs the
direct HTTP path?" on the same SWE-bench tasks.
What's in
Adapter & shim
bench/agents/code_graph_mcp_adapter.py— sync Python adapter thatspawns
cgraph-mcpper call via the official MCP Python SDK.Covers all 8 MCP tools.
bench/cli/cg-mcp+cg_mcp.py— bash-callable CLI shim mirroringthe existing
cgshim (mini-swe-agent only does bash, so each"tool" is one CLI invocation).
bench/tools/code_graph_mcp/{tools.yaml,system_preamble.md}—agent config; excludes
askper Q2 (no nested LLM in thebenchmarked tool set).
Runner integration (second commit)
VALID_CONFIGSgainscode_graph_mcp→--config code_graph_mcpis now a first-class track.
INSTANCE_TEMPLATE_CODE_GRAPH_MCP: mirrors the HTTP templatebut tells the agent to call
cg-mcpwith$PROJECT_NAME+$BRANCH, and to useimpact_analysisbefore non-trivial edits.config_env("code_graph_mcp", ...)prepends the venvbintoPATH(socgraph-mcpis callable from the agent's bash),passes
FALKORDB_*through, and exportsPROJECT_NAME/BRANCH._ensure_indexed_mcp()mirrors_ensure_indexedbut goes throughthe MCP adapter. Skip-if-present probe hits FalkorDB's
GRAPH.LISTdirectly (one trip, no MCP spawn).
Tests
tests/bench/test_cg_mcp_adapter.py— 5 unit + 1 e2e. The e2e isdouble-gated on FalkorDB being reachable AND
api.mcp.serverbeingimportable, so it skips cleanly on branches that pre-date the MCP
stack.
Validation
Heavy e2e validated locally against the
api/subgraph(~6.3k nodes / ~6.2k edges) over real stdio:
Runner wiring smoke-verified:
Stacking
Sits on
dvirdukhan/fix-find-symbol-nested-name(the bench harnessPR). The end-to-end test needs the MCP server itself, which lands in
PRs #675 → #683. Until both stacks merge to
staging, the e2eself-skips and the 5 unit tests cover the adapter logic.
Design notes
cg-mcp …invocation.Keeps the bash shim trivially safe (one process per call, no
shared state) and matches how real agents use MCP servers today.
A future optimization could persist the server via a side-channel
daemon if benchmarks show overhead matters.
structuredContent.resultper the MCP spec;_extracthandlesboth that and the text-chunk JSON shape so the bash agent always
sees a single clean JSON document.