Skip to content

Add durable background jobs for memory and scanner work#180

Open
massy-o wants to merge 2 commits into
XortexAI:mainfrom
massy-o:durable-memory-jobs
Open

Add durable background jobs for memory and scanner work#180
massy-o wants to merge 2 commits into
XortexAI:mainfrom
massy-o:durable-memory-jobs

Conversation

@massy-o
Copy link
Copy Markdown

@massy-o massy-o commented May 14, 2026

Refs #162

Summary

  • add a MongoDB-backed durable job queue with idempotency keys, retries, timeouts, stale lease recovery, and dead-letter records
  • enqueue /v1/memory/ingest and /v1/memory/batch-ingest work and expose /v1/jobs/{job_id} for job status/results
  • route scanner start/resume work through the durable queue while preserving the existing scanner job/status records and falling back to in-process tasks if the job store is unavailable
  • add job worker settings for polling, timeout, retry, backoff, and lease duration

This is a focused Phase 1 implementation for the task-queue/status foundation described in the issue discussion.

Validation

  • python3 -m py_compile src/jobs.py src/api/routes/jobs.py src/api/routes/memory.py src/api/routes/scanner.py src/api/app.py src/api/schemas.py
  • git diff --check
  • uv run --with pytest --with pytest-asyncio --with fastapi --with pydantic --with pydantic-settings --with python-jose --with pymongo --with httpx --with beautifulsoup4 pytest tests/api/test_dependencies_and_routes.py -q -> 4 passed

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a durable background job system using MongoDB to handle long-running tasks in the memory and scanner modules. It adds a new job status endpoint, a worker for asynchronous execution, and relevant configuration settings. The review feedback primarily highlights the need to offload synchronous database operations to worker threads using asyncio.to_thread to avoid blocking the FastAPI event loop. Additionally, improvements were suggested for the stability of idempotency keys, the completeness of dead-letter logs, and the refactoring of duplicated user identification logic.

Comment thread src/api/routes/jobs.py
Comment thread src/api/routes/jobs.py Outdated
Comment thread src/api/routes/memory.py Outdated
Comment thread src/api/routes/memory.py Outdated
Comment thread src/api/routes/scanner.py Outdated
Comment thread src/jobs.py
Comment thread src/jobs.py Outdated
Comment thread src/jobs.py Outdated
Comment thread src/api/routes/jobs.py Outdated
@ishaanxgupta
Copy link
Copy Markdown
Member

Hi @massy-o , thank you for the contribution. the PR looks good to me, mostly I am concerned that if we need to make the changes in the /v1/memory routes or we could upgrade the versioning and make the changes in /v2/memory routes leaving the /v1/memory as it is. What do you think on this? Also let me know your thoughts on celery & redis, did you try out the ingest endpoint after job tracking improvement and notice the latency? has that increased?

@massy-o
Copy link
Copy Markdown
Author

massy-o commented May 16, 2026

Thanks @ishaanxgupta, that is a fair concern.

On the API versioning question: I agree that changing the response contract of /v1/memory/ingest is the riskiest part of this PR. Since the durable job path returns an enqueue/status response instead of the previous synchronous ingest result, my preference would be to keep /v1/memory backward-compatible and expose the async/job-tracked behavior under /v2/memory (or behind an explicit opt-in flag/header if you prefer a smaller surface). I am happy to adjust the PR in that direction so existing /v1 clients do not see a surprise contract change.

On Celery + Redis: I think that is a good production direction, especially once we want multiple worker processes, clearer operational controls, scheduling, and mature retry/dead-letter behavior. I kept this PR on the existing MongoDB dependency to make the first step smaller and avoid introducing Redis/Celery as new required infrastructure. The job store/worker boundary should also make it possible to swap the backend later without changing the route-level API much.

On ingest latency: I have not run a production-like benchmark yet, so I do not want to overstate the numbers. The intended effect is that the request path only persists the job record and returns the status URL, while the expensive ingest pipeline runs out of band. So the interactive request latency should generally decrease versus synchronous ingest, with the tradeoff that completion is now observed via polling. There is a small extra cost for the Mongo job insert/status tracking, but that should be much smaller than the embedding/judge/weaver work. If useful, I can add a lightweight timing note/test or run a before/after local measurement as part of the PR update.

So my proposed next step is move the async job-tracked ingest behavior to /v2/memory, keep /v1/memory synchronous/backward-compatible, and leave the current Mongo-backed queue as the minimal backend unless you want this PR to switch directly to Celery/Redis.

@ishaanxgupta
Copy link
Copy Markdown
Member

@massy-o yes that would be great, lets bring about these changes in /v2/memory and lets keep /v1/memory as it was. We can do the celery or redis integration in the next PR.
Could you please do a before/after local measurement and just share the results here in the comments itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants