Skip to content

resolved production monitor and stabilization problem#29

Open
henry0816191 wants to merge 1 commit into
developfrom
feat/production-monitoring
Open

resolved production monitor and stabilization problem#29
henry0816191 wants to merge 1 commit into
developfrom
feat/production-monitoring

Conversation

@henry0816191
Copy link
Copy Markdown
Collaborator

@henry0816191 henry0816191 commented May 14, 2026

Summary

New / updated files

  • src/paperscout/errors.py– FailureCategory enum (RATE_LIMIT, NETWORK, TIMEOUT, UNKNOWN).
  • src/paperscout/config.py – ops_alert_channel, mq_backpressure_threshold (default 100, ge=1).
  • src/paperscout/monitor.py – httpx exception handling in run_forever with failure_category= (using .value so logs show TIMEOUT, not FailureCategory.TIMEOUT); _last_successful_poll, _last_probe_stats, _last_ops_alert; optional ops_alert_fn; ops alert when stale > 2 × poll_interval with debounce; success bookkeeping in poll_once (including seed).
  • src/paperscout/sources.py – WG21Index._download: HTTPStatusError (429 vs other), TimeoutException, and generic (HTTPError, ValueError) with structured failure_category; ISOProber._probe_one: PROBE-ERR with failure_category after retries.
  • src/paperscout/scout.py– MessageQueue.depth(), backpressure warning in enqueue.
  • src/paperscout/health.py – extra_fields_fn on start_health_server, merged into JSON, errors from callback caught so /health still returns 200.
  • src/paperscout/main.py – _ops_alert, _pool_status (guarded AttributeError), _extra_health_fields, Scheduler(..., ops_alert_fn=_ops_alert), start_health_server(..., extra_fields_fn=...).

Tests

  • tests/test_health.py – health_url_with_extras fixture + test_health_extra_fields_merged.
  • tests/test_monitor.py – prober._stats = {} in _make_scheduler; test_run_forever_emits_timeout_failure_category, test_run_forever_emits_network_failure_category.
  • tests/test_sources.py – test_download_http_error asserts NETWORK; test_download_http_status_429_emits_rate_limit, test_download_http_status_500_emits_network.

Ops / env

Set OPS_ALERT_CHANNEL to a Slack channel ID so stale-poll alerts are sent there (empty = disabled).

Related issues

close #28

Summary by CodeRabbit

  • New Features

    • Operations alert callback system for proactive monitoring notifications
    • Enhanced health status reporting with probe hit rate, queue depth, and database pool metrics
    • Message queue depth monitoring with backpressure warnings
  • Tests

    • Added coverage for health metrics expansion
    • Added error categorization tests for network, timeout, and rate-limit scenarios

Review Change Stack

@henry0816191 henry0816191 self-assigned this May 14, 2026
@henry0816191 henry0816191 requested a review from wpak-ai as a code owner May 14, 2026 16:16
@henry0816191 henry0816191 removed the request for review from wpak-ai May 14, 2026 16:17
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

📝 Walkthrough

Walkthrough

The PR implements production monitoring infrastructure for the PaperScout scheduler. It introduces a failure category taxonomy, extends the health endpoint with computed metrics, adds message queue backpressure tracking, implements scheduler staleness detection with ops alerts, and applies structured error categorization across polling and probing code paths.

Changes

Production Monitoring Setup

Layer / File(s) Summary
Failure Category Taxonomy
src/paperscout/errors.py
FailureCategory enum defines RATE_LIMIT, NETWORK, TIMEOUT, and UNKNOWN classifications for structured error logging.
Observability Configuration
src/paperscout/config.py
Settings adds ops_alert_channel (Slack channel for ops alerts) and mq_backpressure_threshold (queue depth warning threshold).
Health Endpoint Extensibility
src/paperscout/health.py, tests/test_health.py
_HealthHandler and start_health_server now accept extra_fields_fn callback to inject computed metrics (last successful poll, probe hit rate, queue depth, DB pool status). New fixture and test verify merged extra fields are present in /health response.
Message Queue Backpressure
src/paperscout/scout.py
MessageQueue adds depth() method and emits warning logs when queue depth exceeds the configured threshold during enqueue.
Source Error Categorization
src/paperscout/sources.py, tests/test_sources.py
Index download and probe error handlers now emit structured INDEX-FETCH and PROBE-ERR logs with failure categories mapped from httpx exceptions (timeout, rate-limit 429, network, unknown). Tests verify correct category emission for network failures, HTTP 429, and HTTP 500 errors.
Scheduler Monitoring and Staleness Detection
src/paperscout/monitor.py, tests/test_monitor.py
Scheduler gains ops_alert_fn callback parameter, tracks _last_successful_poll and _last_probe_stats, and categorizes httpx exceptions in run_forever. Staleness detection triggers ops alerts when no successful poll has occurred within 2× the poll interval, with interval-based rate limiting. Tests verify failure_category=TIMEOUT and failure_category=NETWORK are logged on matching exceptions.
Application Integration
src/paperscout/__main__.py
Defines _ops_alert helper to enqueue formatted Slack alerts, _pool_status to gather DB pool metrics, and _extra_health_fields to compute health payload fields. Wires ops_alert_fn=_ops_alert into scheduler and extra_fields_fn=_extra_health_fields into health server startup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • cppalliance/paperscout#19: Modifies src/paperscout/health.py's start_health_server signature and health endpoint initialization, affecting the same startup and handler code paths.

Suggested labels

bug

Suggested reviewers

  • wpak-ai

Poem

🐰 Whisker-twitching ode to observability

From chaos we build structure bright,
With failure types now categorized right,
The health endpoint glows with extra pride,
Queue depths and DB pools no longer hide,
Production runs smooth—or alerts ring loud! 🔔

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and generic, using a non-descriptive phrase 'production monitor and stabilization problem' that doesn't clearly convey the specific changes made. Use a more specific title that describes the main change, such as 'Add production monitoring with failure categorization and health endpoint enhancements'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The pull request description is comprehensive, detailing all files changed, new functionality, test additions, and configuration requirements with clear structure.
Linked Issues check ✅ Passed All acceptance criteria from issue #28 are met: structured health endpoint fields, failure category logging, Slack ops alerts on stale polls, queue backpressure warnings, and test coverage.
Out of Scope Changes check ✅ Passed All code changes directly support the production monitoring and stabilization objectives from issue #28; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/production-monitoring

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_health.py (1)

106-114: ⚡ Quick win

Add a test for failing extra_fields_fn fallback behavior.

Please add a case where extra_fields_fn raises, then assert /health still returns 200 with base fields. This is a key resilience branch introduced in this PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_health.py` around lines 106 - 114, Add a new test (or extend
test_health_extra_fields_merged) that registers an extra_fields_fn which raises
an exception and then calls the /health endpoint to verify resilience: assert
the HTTP status is 200 and only the base health fields (e.g. "version" and
"last_successful_poll") are present and correctly valued; reference the existing
test helper/fixture used by tests (health_url_with_extras or create a similar
fixture that wires a failing extra_fields_fn) and ensure the failing
extra_fields_fn does not propagate, so /health returns base fields and a 200
response.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/paperscout/monitor.py`:
- Around line 115-117: The stale-alert gating currently blocks alerts when
self._last_successful_poll is None and also calls ops_alert_fn without
protection; update the gating so alerts are allowed if there has never been a
successful poll OR the last_successful_poll is older than the stale threshold
(i.e., treat None as "stale"), and wrap the ops_alert_fn invocation in a
try/except inside run_forever() to catch/log any exception and prevent crashing
while still updating/using self._last_ops_alert appropriately; reference
self._last_successful_poll, self._last_ops_alert, ops_alert_fn, and
run_forever() when making the changes.

---

Nitpick comments:
In `@tests/test_health.py`:
- Around line 106-114: Add a new test (or extend
test_health_extra_fields_merged) that registers an extra_fields_fn which raises
an exception and then calls the /health endpoint to verify resilience: assert
the HTTP status is 200 and only the base health fields (e.g. "version" and
"last_successful_poll") are present and correctly valued; reference the existing
test helper/fixture used by tests (health_url_with_extras or create a similar
fixture that wires a failing extra_fields_fn) and ensure the failing
extra_fields_fn does not propagate, so /health returns base fields and a 200
response.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a063c352-50de-4861-81b6-e0bee7bc4368

📥 Commits

Reviewing files that changed from the base of the PR and between 87a5475 and fc92c2b.

📒 Files selected for processing (10)
  • src/paperscout/__main__.py
  • src/paperscout/config.py
  • src/paperscout/errors.py
  • src/paperscout/health.py
  • src/paperscout/monitor.py
  • src/paperscout/scout.py
  • src/paperscout/sources.py
  • tests/test_health.py
  • tests/test_monitor.py
  • tests/test_sources.py

Comment thread src/paperscout/monitor.py
Comment on lines +115 to +117
self._last_successful_poll: float | None = None
self._last_probe_stats: dict[str, int] = {}
self._last_ops_alert: float | None = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix stale-alert gating and guard ops_alert_fn failures.

On Line 323, requiring self._last_successful_poll is not None suppresses alerts when the scheduler has never had a successful poll. Also, on Line 330, an exception from ops_alert_fn can crash run_forever().

💡 Proposed fix
@@
         self._last_successful_poll: float | None = None
         self._last_probe_stats: dict[str, int] = {}
         self._last_ops_alert: float | None = None
+        self._started_at: float = time.time()
@@
-            if self.ops_alert_fn and self._last_successful_poll is not None:
-                stale = time.time() - self._last_successful_poll
+            if self.ops_alert_fn:
+                last_ok = self._last_successful_poll or self._started_at
+                stale = time.time() - last_ok
                 alert_threshold = 2 * interval
                 now_m = time.monotonic()
                 if stale > alert_threshold and (
                     self._last_ops_alert is None or (now_m - self._last_ops_alert) > interval
                 ):
-                    self.ops_alert_fn(
-                        f"No successful poll in {stale / 60:.0f}min "
-                        f"(threshold={2 * self.cfg.poll_interval_minutes}min)"
-                    )
-                    self._last_ops_alert = now_m
+                    try:
+                        self.ops_alert_fn(
+                            f"No successful poll in {stale / 60:.0f}min "
+                            f"(threshold={2 * self.cfg.poll_interval_minutes}min)"
+                        )
+                        self._last_ops_alert = now_m
+                    except Exception:
+                        log.exception("OPS-ALERT-ERROR")

Also applies to: 323-334

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/paperscout/monitor.py` around lines 115 - 117, The stale-alert gating
currently blocks alerts when self._last_successful_poll is None and also calls
ops_alert_fn without protection; update the gating so alerts are allowed if
there has never been a successful poll OR the last_successful_poll is older than
the stale threshold (i.e., treat None as "stale"), and wrap the ops_alert_fn
invocation in a try/except inside run_forever() to catch/log any exception and
prevent crashing while still updating/using self._last_ops_alert appropriately;
reference self._last_successful_poll, self._last_ops_alert, ops_alert_fn, and
run_forever() when making the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Production Monitoring Setup + Stabilization

1 participant