Webhook Best Practices: Retry Logic, Idempotency, and Error Handling
<h1> Webhook Best Practices: Retry Logic, Idempotency, and Error Handling </h1> <p>Most webhook integrations fail silently. A handler returns 500, the provider retries a few times, then stops. Your system never processed the event and no one knows.</p> <p>Webhooks are not guaranteed delivery by default. How reliably your integration works depends almost entirely on how you write the receiver. This guide covers the patterns that make webhook handlers production-grade: proper retry handling, idempotency, error response codes, and queue-based processing.</p> <h2> Understand the Delivery Model </h2> <p>Before building handlers, understand what you are dealing with:</p> <ul> <li>Providers send webhook events as HTTP POST requests</li> <li>They expect a 2xx response within a timeout (typically 5
Webhook Best Practices: Retry Logic, Idempotency, and Error Handling
Most webhook integrations fail silently. A handler returns 500, the provider retries a few times, then stops. Your system never processed the event and no one knows.
Webhooks are not guaranteed delivery by default. How reliably your integration works depends almost entirely on how you write the receiver. This guide covers the patterns that make webhook handlers production-grade: proper retry handling, idempotency, error response codes, and queue-based processing.
Understand the Delivery Model
Before building handlers, understand what you are dealing with:
-
Providers send webhook events as HTTP POST requests
-
They expect a 2xx response within a timeout (typically 5-30 seconds)
-
If they do not receive 2xx, they retry on a schedule (often exponential backoff over hours or days)
-
Most providers have a maximum retry count after which the event is dropped
-
Some providers allow you to manually retry from their dashboard
Stripe retry schedule: Attempt 1: immediate Attempt 2: 5 minutes Attempt 3: 30 minutes Attempt 4: 2 hours Attempt 5: 5 hours Attempt 6: 10 hours Attempt 7: 24 hours ... continues for ~72 hours totalStripe retry schedule: Attempt 1: immediate Attempt 2: 5 minutes Attempt 3: 30 minutes Attempt 4: 2 hours Attempt 5: 5 hours Attempt 6: 10 hours Attempt 7: 24 hours ... continues for ~72 hours totalEnter fullscreen mode
Exit fullscreen mode
This retry behavior is your safety net -- but only if your handler is idempotent.
Rule 1: Respond Fast, Process Async
Your webhook handler should acknowledge receipt immediately and do the actual work in the background. If you do database writes, call external APIs, or send emails synchronously inside the handler, you risk timing out.
// BAD: synchronous processing risks timeout app.post('/webhook/stripe', async (req, res) => { const event = JSON.parse(req.body);// BAD: synchronous processing risks timeout app.post('/webhook/stripe', async (req, res) => { const event = JSON.parse(req.body);if (event.type === 'payment_intent.succeeded') { // This could take several seconds await fulfillOrder(event.data.object); await sendConfirmationEmail(event.data.object.metadata.email); await updateInventory(event.data.object.metadata.items); }
res.json({ received: true }); // might never get here if above throws });
// GOOD: acknowledge immediately, process async app.post('/webhook/stripe', async (req, res) => { const event = JSON.parse(req.body);
// Queue the work — respond in milliseconds await queue.add('stripe-webhook', { event });
res.json({ received: true }); // always returns 200 fast });
// Worker processes the queue queue.process('stripe-webhook', async (job) => { const { event } = job.data; if (event.type === 'payment_intent.succeeded') { await fulfillOrder(event.data.object); await sendConfirmationEmail(event.data.object.metadata.email); await updateInventory(event.data.object.metadata.items); } });`
Enter fullscreen mode
Exit fullscreen mode
The queue gives you retry logic, failure visibility, and async processing without blocking the HTTP response.
Rule 2: Make Handlers Idempotent
Since providers retry webhooks, your handler may receive the same event multiple times. You must make your handler safe to run more than once with the same event ID.
Without idempotency, a network blip that causes Stripe to retry a payment_intent.succeeded event could charge a customer twice, create duplicate orders, or send duplicate emails.
Track Processed Event IDs
The simplest approach: store event IDs and skip events you have already processed.
async function handleStripeEvent(event) { // Check if we already processed this event const existing = await db.query( 'SELECT id FROM processed_webhooks WHERE event_id = $1', [event.id] );async function handleStripeEvent(event) { // Check if we already processed this event const existing = await db.query( 'SELECT id FROM processed_webhooks WHERE event_id = $1', [event.id] );if (existing.rows.length > 0) {
console.log(Skipping duplicate event: ${event.id});
return; // idempotent: no-op on duplicate
}
// Process the event await processEvent(event);
// Record that we processed it await db.query( 'INSERT INTO processed_webhooks (event_id, processed_at) VALUES ($1, NOW())', [event.id] ); }`
Enter fullscreen mode
Exit fullscreen mode
Upsert Instead of Insert
When creating records from webhook data, use upsert (insert-or-update) instead of plain insert:
-- BAD: fails or creates duplicate on retry INSERT INTO subscriptions (stripe_id, user_id, status, plan) VALUES ($1, $2, $3, $4);-- BAD: fails or creates duplicate on retry INSERT INTO subscriptions (stripe_id, user_id, status, plan) VALUES ($1, $2, $3, $4);-- GOOD: idempotent, safe to run multiple times INSERT INTO subscriptions (stripe_id, user_id, status, plan) VALUES ($1, $2, $3, $4) ON CONFLICT (stripe_id) DO UPDATE SET status = EXCLUDED.status, plan = EXCLUDED.plan;`
Enter fullscreen mode
Exit fullscreen mode
Use Database Transactions with Idempotency Key
For more complex operations, wrap the idempotency check and business logic in a transaction:
async function handleWebhookIdempotent(eventId, operation) { return await db.transaction(async (trx) => { // Atomic check-and-insert prevents race conditions on concurrent retries const result = await trx.raw(async function handleWebhookIdempotent(eventId, operation) { return await db.transaction(async (trx) => { // Atomic check-and-insert prevents race conditions on concurrent retries const result = await trx.raw(, [eventId]);if (result.rows.length === 0) { // Already processed — skip return null; }
// Run business logic inside the same transaction return await operation(trx); }); }`
Enter fullscreen mode
Exit fullscreen mode
Rule 3: Return the Right HTTP Status Codes
Your response code tells the provider whether to retry. Use it correctly:
Status Meaning Provider behavior
200-299 Success No retry
400 Bad request (your choice not to process) Providers usually stop retrying
401/403 Unauthorized Providers usually stop retrying
500-503 Your server error Provider retries
Timeout No response in time Provider retries
The key distinction: use 5xx when the error is transient (database temporarily down, external API timeout) and 4xx when the error is permanent (invalid payload format, unsupported event type).
app.post('/webhook', async (req, res) => { let event;app.post('/webhook', async (req, res) => { let event;// Signature verification failure: return 400, don't want retry try { event = verifyAndParseWebhook(req.body, req.headers); } catch (err) { return res.status(400).json({ error: 'Invalid signature' }); }
// Unknown event type: return 200, don't retry if (!supportedEvents.includes(event.type)) { return res.status(200).json({ received: true, skipped: true }); }
// Queue for async processing, return 200 fast try { await queue.add(event); return res.status(200).json({ received: true }); } catch (err) { // Queue is down: return 503 so provider retries later return res.status(503).json({ error: 'Service unavailable' }); } });`
Enter fullscreen mode
Exit fullscreen mode
Rule 4: Handle Out-of-Order Delivery
Providers do not guarantee that webhooks arrive in the order events occurred. A customer.subscription.updated event might arrive before the customer.subscription.created event for the same subscription.
Design your handlers to work regardless of order:
async function handleSubscriptionEvent(event) { const sub = event.data.object;async function handleSubscriptionEvent(event) { const sub = event.data.object;if (event.type === 'customer.subscription.updated') { // Don't assume the subscription already exists in your DB await db.query(
INSERT INTO subscriptions (stripe_id, status, plan, updated_at) VALUES ($1, $2, $3, NOW()) ON CONFLICT (stripe_id) DO UPDATE SET status = EXCLUDED.status, plan = EXCLUDED.plan, updated_at = EXCLUDED.updated_at WHERE subscriptions.updated_at < EXCLUDED.updated_at INSERT INTO subscriptions (stripe_id, status, plan, updated_at) VALUES ($1, $2, $3, NOW()) ON CONFLICT (stripe_id) DO UPDATE SET status = EXCLUDED.status, plan = EXCLUDED.plan, updated_at = EXCLUDED.updated_at WHERE subscriptions.updated_at < EXCLUDED.updated_at Enter fullscreen mode
Exit fullscreen mode
The WHERE subscriptions.updated_at < EXCLUDED.updated_at clause handles the case where an older event arrives after a newer one — it will not overwrite newer data with stale data.
Rule 5: Log Everything
Log enough to reconstruct what happened to any webhook event without going back to the provider's dashboard:
const logger = require('pino')();
app.post('/webhook', async (req, res) => { const eventId = req.headers['stripe-event-id'] ?? 'unknown'; const eventType = req.body?.type ?? 'unknown';
logger.info({ eventId, eventType }, 'Webhook received');
try { await queue.add({ event: req.body }); logger.info({ eventId, eventType }, 'Webhook queued'); res.json({ received: true }); } catch (err) { logger.error({ eventId, eventType, err }, 'Failed to queue webhook'); res.status(503).json({ error: 'Unavailable' }); } });
// In your queue worker queue.process(async (job) => { const { event } = job.data; logger.info({ eventId: event.id, type: event.type, attempt: job.attemptsMade }, 'Processing webhook');
try { await processEvent(event); logger.info({ eventId: event.id }, 'Webhook processed successfully'); } catch (err) { logger.error({ eventId: event.id, err }, 'Webhook processing failed'); throw err; // let the queue retry } });`
Enter fullscreen mode
Exit fullscreen mode
Rule 6: Monitor Webhook Health
Failed webhooks are silent by default. Set up monitoring:
-
Check provider dashboards — Stripe, GitHub, and Shopify all show webhook delivery history. Check them regularly or set up alerts.
-
Alert on queue depth — If your webhook queue grows, something is wrong upstream.
-
Track error rates — Log a counter whenever a webhook handler fails. Alert if the error rate spikes.
-
Set up dead letter queues — Events that fail after all retries should go to a dead letter queue for manual inspection, not disappear silently.
// BullMQ dead letter queue example const queue = new Queue('webhooks'); const worker = new Worker('webhooks', processWebhook, { attempts: 5, backoff: { type: 'exponential', delay: 1000 }, });// BullMQ dead letter queue example const queue = new Queue('webhooks'); const worker = new Worker('webhooks', processWebhook, { attempts: 5, backoff: { type: 'exponential', delay: 1000 }, });worker.on('failed', (job, err) => { if (job.attemptsMade >= job.opts.attempts) { // Move to dead letter queue deadLetterQueue.add('failed-webhook', { event: job.data.event, error: err.message, failedAt: new Date().toISOString(), }); } });`
Enter fullscreen mode
Exit fullscreen mode
Testing Webhook Handling with HookCap
HookCap makes it easy to test these patterns before production:
-
Capture real webhook payloads — Point your provider to a HookCap endpoint to collect real events. Inspect headers, body structure, and signature format.
-
Test retry handling — Use HookCap's replay feature to send the same event to your handler multiple times. Verify that your idempotency logic prevents duplicate processing.
-
Test error recovery — Replay a captured event to a handler you deliberately break (return 500). Watch how your queue retries it. Fix the handler and replay again.
-
Simulate out-of-order delivery — Capture a sequence of related events and replay them in reverse order to verify your handler processes them correctly.
The replay feature is especially useful for idempotency testing: you can replay the same event ID dozens of times and confirm your database shows exactly one processed record each time.
Summary
Production webhook handlers need:
-
Fast acknowledgment — Return 200 immediately, process async
-
Idempotency — Track event IDs, use upserts, handle duplicate deliveries
-
Correct status codes — 5xx for transient errors (retry-worthy), 4xx for permanent errors
-
Order independence — Design DB writes to handle out-of-order events
-
Comprehensive logging — Log receipt, queuing, processing, and failures
-
Dead letter queues — Capture events that exhaust all retries
Most webhook failures come down to missing one of these. Add them to your integration checklist before going to production.
DEV Community
https://dev.to/henry_hang/webhook-best-practices-retry-logic-idempotency-and-error-handling-27i3Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelavailableupdateStart Small to Build Value through Digital Twin
The Future Ready podcast launched recently, offering a new channel where listeners can hear from Siemens and guest experts as they discuss the key technologies, industry trends and other drivers of today’s rapidly changing industrial landscape. The podcast has already featured conversations on the transition to software-defined automation, the immense potential of Industrial AI and [ ]
Building a Real-Time AI Dungeon Master with Claude API, Socket.io, and Next.js 16
Live Demo The AI in gaming market sits at $4.54 billion in 2025 and is projected to hit $81.19 billion by 2035 ( SNS Insider , 2025). That number isn't surprising when you think about what generative AI actually unlocks for games infinite narrative branching, dynamic NPCs, and a Dungeon Master who never gets tired at midnight. I built DnD AI, a multiplayer AI Dungeon Master running on Next.js 16, Claude API (claude-sonnet-4-6), Socket.io, and DALL-E 3. This post is a technical walkthrough of the six hardest problems I ran into, and how I solved them. No fluff just the architecture decisions that actually mattered. TL;DR: Next.js App Router can't maintain persistent WebSockets, so a custom server.ts boots Socket.io and Next.js in one process Claude streaming output pipes through Socket.io t
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!