resolved production monitor and stabilization problem by henry0816191 · Pull Request #29 · cppalliance/paperscout

henry0816191 · 2026-05-14T16:16:55Z

Summary

New / updated files

src/paperscout/errors.py– FailureCategory enum (RATE_LIMIT, NETWORK, TIMEOUT, UNKNOWN).
src/paperscout/config.py – ops_alert_channel, mq_backpressure_threshold (default 100, ge=1).
src/paperscout/monitor.py – httpx exception handling in run_forever with failure_category= (using .value so logs show TIMEOUT, not FailureCategory.TIMEOUT); _last_successful_poll, _last_probe_stats, _last_ops_alert; optional ops_alert_fn; ops alert when stale > 2 × poll_interval with debounce; success bookkeeping in poll_once (including seed).
src/paperscout/sources.py – WG21Index._download: HTTPStatusError (429 vs other), TimeoutException, and generic (HTTPError, ValueError) with structured failure_category; ISOProber._probe_one: PROBE-ERR with failure_category after retries.
src/paperscout/scout.py– MessageQueue.depth(), backpressure warning in enqueue.
src/paperscout/health.py – extra_fields_fn on start_health_server, merged into JSON, errors from callback caught so /health still returns 200.
src/paperscout/main.py – _ops_alert, _pool_status (guarded AttributeError), _extra_health_fields, Scheduler(..., ops_alert_fn=_ops_alert), start_health_server(..., extra_fields_fn=...).

Tests

tests/test_health.py – health_url_with_extras fixture + test_health_extra_fields_merged.
tests/test_monitor.py – prober._stats = {} in _make_scheduler; test_run_forever_emits_timeout_failure_category, test_run_forever_emits_network_failure_category.
tests/test_sources.py – test_download_http_error asserts NETWORK; test_download_http_status_429_emits_rate_limit, test_download_http_status_500_emits_network.

Ops / env

Set OPS_ALERT_CHANNEL to a Slack channel ID so stale-poll alerts are sent there (empty = disabled).

Related issues

close #28

Summary by CodeRabbit

New Features
- Operations alert callback system for proactive monitoring notifications
- Enhanced health status reporting with probe hit rate, queue depth, and database pool metrics
- Message queue depth monitoring with backpressure warnings
Tests
- Added coverage for health metrics expansion
- Added error categorization tests for network, timeout, and rate-limit scenarios

coderabbitai · 2026-05-14T16:17:06Z

📝 Walkthrough

Walkthrough

The PR implements production monitoring infrastructure for the PaperScout scheduler. It introduces a failure category taxonomy, extends the health endpoint with computed metrics, adds message queue backpressure tracking, implements scheduler staleness detection with ops alerts, and applies structured error categorization across polling and probing code paths.

Changes

Production Monitoring Setup

Layer / File(s)	Summary
Failure Category Taxonomy `src/paperscout/errors.py`	`FailureCategory` enum defines `RATE_LIMIT`, `NETWORK`, `TIMEOUT`, and `UNKNOWN` classifications for structured error logging.
Observability Configuration `src/paperscout/config.py`	`Settings` adds `ops_alert_channel` (Slack channel for ops alerts) and `mq_backpressure_threshold` (queue depth warning threshold).
Health Endpoint Extensibility `src/paperscout/health.py`, `tests/test_health.py`	`_HealthHandler` and `start_health_server` now accept `extra_fields_fn` callback to inject computed metrics (last successful poll, probe hit rate, queue depth, DB pool status). New fixture and test verify merged extra fields are present in `/health` response.
Message Queue Backpressure `src/paperscout/scout.py`	`MessageQueue` adds `depth()` method and emits warning logs when queue depth exceeds the configured threshold during enqueue.
Source Error Categorization `src/paperscout/sources.py`, `tests/test_sources.py`	Index download and probe error handlers now emit structured `INDEX-FETCH` and `PROBE-ERR` logs with failure categories mapped from `httpx` exceptions (timeout, rate-limit 429, network, unknown). Tests verify correct category emission for network failures, HTTP 429, and HTTP 500 errors.
Scheduler Monitoring and Staleness Detection `src/paperscout/monitor.py`, `tests/test_monitor.py`	`Scheduler` gains `ops_alert_fn` callback parameter, tracks `_last_successful_poll` and `_last_probe_stats`, and categorizes `httpx` exceptions in `run_forever`. Staleness detection triggers ops alerts when no successful poll has occurred within 2× the poll interval, with interval-based rate limiting. Tests verify `failure_category=TIMEOUT` and `failure_category=NETWORK` are logged on matching exceptions.
Application Integration `src/paperscout/__main__.py`	Defines `_ops_alert` helper to enqueue formatted Slack alerts, `_pool_status` to gather DB pool metrics, and `_extra_health_fields` to compute health payload fields. Wires `ops_alert_fn=_ops_alert` into scheduler and `extra_fields_fn=_extra_health_fields` into health server startup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

cppalliance/paperscout#19: Modifies src/paperscout/health.py's start_health_server signature and health endpoint initialization, affecting the same startup and handler code paths.

Suggested labels

bug

Suggested reviewers

wpak-ai

Poem

🐰 Whisker-twitching ode to observability

From chaos we build structure bright,
With failure types now categorized right,
The health endpoint glows with extra pride,
Queue depths and DB pools no longer hide,
Production runs smooth—or alerts ring loud! 🔔

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.71% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title is vague and generic, using a non-descriptive phrase 'production monitor and stabilization problem' that doesn't clearly convey the specific changes made.	Use a more specific title that describes the main change, such as 'Add production monitoring with failure categorization and health endpoint enhancements'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The pull request description is comprehensive, detailing all files changed, new functionality, test additions, and configuration requirements with clear structure.
Linked Issues check	✅ Passed	All acceptance criteria from issue `#28` are met: structured health endpoint fields, failure category logging, Slack ops alerts on stale polls, queue backpressure warnings, and test coverage.
Out of Scope Changes check	✅ Passed	All code changes directly support the production monitoring and stabilization objectives from issue `#28`; no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/production-monitoring

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_health.py (1)

106-114: ⚡ Quick win

Add a test for failing extra_fields_fn fallback behavior.

Please add a case where extra_fields_fn raises, then assert /health still returns 200 with base fields. This is a key resilience branch introduced in this PR.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_health.py` around lines 106 - 114, Add a new test (or extend
test_health_extra_fields_merged) that registers an extra_fields_fn which raises
an exception and then calls the /health endpoint to verify resilience: assert
the HTTP status is 200 and only the base health fields (e.g. "version" and
"last_successful_poll") are present and correctly valued; reference the existing
test helper/fixture used by tests (health_url_with_extras or create a similar
fixture that wires a failing extra_fields_fn) and ensure the failing
extra_fields_fn does not propagate, so /health returns base fields and a 200
response.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/paperscout/monitor.py`:
- Around line 115-117: The stale-alert gating currently blocks alerts when
self._last_successful_poll is None and also calls ops_alert_fn without
protection; update the gating so alerts are allowed if there has never been a
successful poll OR the last_successful_poll is older than the stale threshold
(i.e., treat None as "stale"), and wrap the ops_alert_fn invocation in a
try/except inside run_forever() to catch/log any exception and prevent crashing
while still updating/using self._last_ops_alert appropriately; reference
self._last_successful_poll, self._last_ops_alert, ops_alert_fn, and
run_forever() when making the changes.

---

Nitpick comments:
In `@tests/test_health.py`:
- Around line 106-114: Add a new test (or extend
test_health_extra_fields_merged) that registers an extra_fields_fn which raises
an exception and then calls the /health endpoint to verify resilience: assert
the HTTP status is 200 and only the base health fields (e.g. "version" and
"last_successful_poll") are present and correctly valued; reference the existing
test helper/fixture used by tests (health_url_with_extras or create a similar
fixture that wires a failing extra_fields_fn) and ensure the failing
extra_fields_fn does not propagate, so /health returns base fields and a 200
response.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a063c352-50de-4861-81b6-e0bee7bc4368

📥 Commits

Reviewing files that changed from the base of the PR and between 87a5475 and fc92c2b.

📒 Files selected for processing (10)

src/paperscout/__main__.py
src/paperscout/config.py
src/paperscout/errors.py
src/paperscout/health.py
src/paperscout/monitor.py
src/paperscout/scout.py
src/paperscout/sources.py
tests/test_health.py
tests/test_monitor.py
tests/test_sources.py

coderabbitai · 2026-05-14T16:20:03Z

+        self._last_successful_poll: float | None = None
+        self._last_probe_stats: dict[str, int] = {}
+        self._last_ops_alert: float | None = None


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix stale-alert gating and guard ops_alert_fn failures.

On Line 323, requiring self._last_successful_poll is not None suppresses alerts when the scheduler has never had a successful poll. Also, on Line 330, an exception from ops_alert_fn can crash run_forever().

💡 Proposed fix

@@ self._last_successful_poll: float | None = None self._last_probe_stats: dict[str, int] = {} self._last_ops_alert: float | None = None + self._started_at: float = time.time() @@ - if self.ops_alert_fn and self._last_successful_poll is not None: - stale = time.time() - self._last_successful_poll + if self.ops_alert_fn: + last_ok = self._last_successful_poll or self._started_at + stale = time.time() - last_ok alert_threshold = 2 * interval now_m = time.monotonic() if stale > alert_threshold and ( self._last_ops_alert is None or (now_m - self._last_ops_alert) > interval ): - self.ops_alert_fn( - f"No successful poll in {stale / 60:.0f}min " - f"(threshold={2 * self.cfg.poll_interval_minutes}min)" - ) - self._last_ops_alert = now_m + try: + self.ops_alert_fn( + f"No successful poll in {stale / 60:.0f}min " + f"(threshold={2 * self.cfg.poll_interval_minutes}min)" + ) + self._last_ops_alert = now_m + except Exception: + log.exception("OPS-ALERT-ERROR")

Also applies to: 323-334

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/paperscout/monitor.py` around lines 115 - 117, The stale-alert gating currently blocks alerts when self._last_successful_poll is None and also calls ops_alert_fn without protection; update the gating so alerts are allowed if there has never been a successful poll OR the last_successful_poll is older than the stale threshold (i.e., treat None as "stale"), and wrap the ops_alert_fn invocation in a try/except inside run_forever() to catch/log any exception and prevent crashing while still updating/using self._last_ops_alert appropriately; reference self._last_successful_poll, self._last_ops_alert, ops_alert_fn, and run_forever() when making the changes.

resolved production monitor and stabilization problem

fc92c2b

henry0816191 self-assigned this May 14, 2026

henry0816191 requested a review from wpak-ai as a code owner May 14, 2026 16:16

henry0816191 removed the request for review from wpak-ai May 14, 2026 16:17

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resolved production monitor and stabilization problem#29

resolved production monitor and stabilization problem#29
henry0816191 wants to merge 1 commit into
developfrom
feat/production-monitoring

henry0816191 commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

henry0816191 commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New / updated files

Tests

Ops / env

Related issues

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

henry0816191 commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading