BUG: Webhook system reliability issues causing missed updates in v1.7.31 (replay protection, instance matching, SSE) #62

Closed
opened 2026-05-28 10:39:42 +01:00 by Gandalf · 0 comments
Owner

Related to: Season pack queue handling (#61) – webhooks are the primary trigger for queue refreshes that feed the broken season pack data path.

Description:

Webhooks from Sonarr, Radarr, and Ombi frequently fail to trigger real-time dashboard/SSE updates ("not working again" in v1.7.31). This regressed or became more noticeable after the poller/SSE test expansions and frontend changes in this release, even though no direct webhook logic was modified.

Detailed Investigation Findings (release/1.7.31)

1. Overly Strict Replay Protection (server/routes/webhook.js)

  • isReplay(eventType, instanceName, eventDate) uses a simple key: `${eventType}:${instanceName || ''}:${eventDate}` (or `${requestId}-${eventDate}` for Ombi).
  • Sonarr/Radarr often send date with sub-second variations, timezone differences, or identical values for rapid events (e.g., Grab → Download).
  • Valid events return 200 { received: true, duplicate: true } and completely skip processWebhookEvent() + pollAllServices() (SSE broadcast).
  • No normalization of eventDate (e.g., to minute precision) and no content-hash fallback.

2. Brittle Instance Resolution

  • const inst = sonarrInstances.find(i => i.name === instanceName || i.id === instanceName) || sonarrInstances[0];
  • Payload instanceName from *arr often doesn't exactly match configured name or id → falls back to first instance.
  • Causes wrong cache.updateWebhookMetrics() and incorrect cache keys for multi-instance users.

3. Ombi-Specific Fragility (despite query-param secret fix in commit 7b9c895)

  • Ombi route bypasses validatePayload().
  • Complex 3-retry + delay + extractRequestedUser() logic in processWebhookEvent('ombi', ...) can fail silently on certain payloads/Ombi versions, leaving poll:ombi-requests stale.

4. Fire-and-Forget + SSE Dependency

  • processWebhookEvent(...).catch(err => log...) only logs.
  • Final await pollAllServices() (which triggers SSE to all connected clients) can fail without user-visible feedback.
  • No per-webhook success metric or UI confirmation.

5. Interaction with Poller & Queue Refresh

  • When webhooks succeed, they call arrRetrieverRegistry.getQueuesByType()PollingSonarrRetriever.getQueue() (which has the season pack bug tracked in #61).
  • Missed webhooks → stale queue data → dashboard never updates for active downloads.

Impact:

  • Real-time "<1s updates" promise is broken for many users.
  • Dashboard stays on old data until next poll fallback (default 10 min).
  • Compounds the season pack issue (#61) because webhooks are the main refresh path.

Proposed Solution / Fix Plan:

  1. Replay Protection Hardening (High Priority)

    • Normalise eventDate to minute precision or use a stable content hash of key fields (eventType + title + downloadId).
    • Add optional ?force=true query param for testing/admin bypass.
    • Log "duplicate" decisions with the actual key for debugging.
  2. Instance Matching Improvements

    • Add fuzzy matching or fallback to URL-based matching.
    • Log when fallback to first instance occurs.
  3. Ombi Hardening

    • Make Ombi route also call a relaxed validatePayloadOmbi() version.
    • Add better logging around the retry loop and user extraction failures.
  4. Observability

    • Add structured webhook metrics (success/failure/duplicate counts per instance).
    • Expose via /api/status or new debug endpoint.
    • Ensure pollAllServices() errors are caught and logged with context.
  5. Cross-Dependency Note

    • This fix will improve reliability of queue refreshes that currently feed the broken season pack logic in #61. After both are fixed, real-time multi-episode tracking will finally work.

Suggested Labels:
Kind/Bug, Priority: High, Area/Webhooks, Area/SSE, Compat/Non-Breaking

Affected Versions: v1.7.27 – v1.7.31 (replay protection introduced earlier; became painful with increased webhook usage and poller changes).

**Related to:** Season pack queue handling (#61) – webhooks are the primary trigger for queue refreshes that feed the broken season pack data path. **Description:** Webhooks from Sonarr, Radarr, and Ombi frequently fail to trigger real-time dashboard/SSE updates ("not working again" in v1.7.31). This regressed or became more noticeable after the poller/SSE test expansions and frontend changes in this release, even though no direct webhook logic was modified. ### Detailed Investigation Findings (release/1.7.31) **1. Overly Strict Replay Protection (`server/routes/webhook.js`)** - `isReplay(eventType, instanceName, eventDate)` uses a simple key: `` `${eventType}:${instanceName || ''}:${eventDate}` `` (or `` `${requestId}-${eventDate}` `` for Ombi). - Sonarr/Radarr often send `date` with sub-second variations, timezone differences, or identical values for rapid events (e.g., Grab → Download). - Valid events return `200 { received: true, duplicate: true }` and **completely skip** `processWebhookEvent()` + `pollAllServices()` (SSE broadcast). - No normalization of `eventDate` (e.g., to minute precision) and no content-hash fallback. **2. Brittle Instance Resolution** - `const inst = sonarrInstances.find(i => i.name === instanceName || i.id === instanceName) || sonarrInstances[0];` - Payload `instanceName` from *arr often doesn't exactly match configured `name` or `id` → falls back to first instance. - Causes wrong `cache.updateWebhookMetrics()` and incorrect cache keys for multi-instance users. **3. Ombi-Specific Fragility (despite query-param secret fix in commit 7b9c895)** - Ombi route bypasses `validatePayload()`. - Complex 3-retry + delay + `extractRequestedUser()` logic in `processWebhookEvent('ombi', ...)` can fail silently on certain payloads/Ombi versions, leaving `poll:ombi-requests` stale. **4. Fire-and-Forget + SSE Dependency** - `processWebhookEvent(...).catch(err => log...)` only logs. - Final `await pollAllServices()` (which triggers SSE to all connected clients) can fail without user-visible feedback. - No per-webhook success metric or UI confirmation. **5. Interaction with Poller & Queue Refresh** - When webhooks succeed, they call `arrRetrieverRegistry.getQueuesByType()` → `PollingSonarrRetriever.getQueue()` (which has the season pack bug tracked in #61). - Missed webhooks → stale queue data → dashboard never updates for active downloads. **Impact:** - Real-time "<1s updates" promise is broken for many users. - Dashboard stays on old data until next poll fallback (default 10 min). - Compounds the season pack issue (#61) because webhooks are the main refresh path. **Proposed Solution / Fix Plan:** 1. **Replay Protection Hardening (High Priority)** - Normalise `eventDate` to minute precision or use a stable content hash of key fields (`eventType + title + downloadId`). - Add optional `?force=true` query param for testing/admin bypass. - Log "duplicate" decisions with the actual key for debugging. 2. **Instance Matching Improvements** - Add fuzzy matching or fallback to URL-based matching. - Log when fallback to first instance occurs. 3. **Ombi Hardening** - Make Ombi route also call a relaxed `validatePayloadOmbi()` version. - Add better logging around the retry loop and user extraction failures. 4. **Observability** - Add structured webhook metrics (success/failure/duplicate counts per instance). - Expose via `/api/status` or new debug endpoint. - Ensure `pollAllServices()` errors are caught and logged with context. 5. **Cross-Dependency Note** - This fix will improve reliability of queue refreshes that currently feed the broken season pack logic in #61. After both are fixed, real-time multi-episode tracking will finally work. **Suggested Labels:** Kind/Bug, Priority: High, Area/Webhooks, Area/SSE, Compat/Non-Breaking **Affected Versions:** v1.7.27 – v1.7.31 (replay protection introduced earlier; became painful with increased webhook usage and poller changes).
Gandalf added the Kind/Bug
Priority
High
2
labels 2026-05-28 10:39:42 +01:00
Gandalf added the Area/SSEArea/WebhooksCompat/Non-Breaking labels 2026-05-28 11:57:25 +01:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Gandalf/sofarr#62