5 Patterns for Building Resilient Event-Driven Integrations
Point-to-point integrations are easy to build and easy to break. You wire up an API call from one system to another, it works in testing, and then a 30-second downstream outage in production causes a cascade of failures, lost state, and a manual cleanup effort that takes longer than the outage itself. Event-driven integration patterns address this directly. They decouple the systems involved so that no single failure propagates through the entire integration chain. The tradeoff is upfront design work, but the operational stability that results is not comparable to the alternative. Here are five patterns that appear in most well-built event-driven integrations, with examples of when and why each one matters. 1. Queue-Based Event Processing What it is: Instead of processing webhook events or
Point-to-point integrations are easy to build and easy to break. You wire up an API call from one system to another, it works in testing, and then a 30-second downstream outage in production causes a cascade of failures, lost state, and a manual cleanup effort that takes longer than the outage itself.
Event-driven integration patterns address this directly. They decouple the systems involved so that no single failure propagates through the entire integration chain. The tradeoff is upfront design work, but the operational stability that results is not comparable to the alternative.
Here are five patterns that appear in most well-built event-driven integrations, with examples of when and why each one matters.
1. Queue-Based Event Processing
What it is: Instead of processing webhook events or API callbacks synchronously in the request handler, your endpoint stores each incoming event in a message queue or database table and returns an acknowledgment immediately. A separate worker process reads from the queue and handles the business logic.
Why it matters: Webhook providers set short timeout windows - typically 5-30 seconds. If your handler does any significant processing before responding, you risk timing out even when nothing is wrong with your application. The provider marks the delivery failed and retries, creating duplicates.
Separating acknowledgment from processing eliminates this window entirely. The endpoint does the minimum work (validate, store, acknowledge), and the worker handles everything else.
Example:
// Endpoint: validates, stores, acknowledges app.post('/webhooks', async (req, res) => { if (!validateSignature(req)) return res.status(401).end(); await eventStore.push({ id: req.body.id, payload: req.body }); res.status(202).end(); });// Endpoint: validates, stores, acknowledges app.post('/webhooks', async (req, res) => { if (!validateSignature(req)) return res.status(401).end(); await eventStore.push({ id: req.body.id, payload: req.body }); res.status(202).end(); });// Worker: processes independently async function processQueue() { const event = await eventStore.dequeue(); await handleBusinessLogic(event); await eventStore.markProcessed(event.id); }`
Enter fullscreen mode
Exit fullscreen mode
2. Idempotent Consumers
What it is: Every event handler checks whether the event has already been processed before running any business logic. The event ID from the provider payload is used as the idempotency key, stored in a processed_events table. Processing the same event twice produces the same outcome as processing it once.
Why it matters: No event delivery system guarantees exactly-once delivery. Retries, network partitions, and processing failures all create scenarios where the same event arrives multiple times. Without idempotency at the consumer level, duplicates produce duplicate side effects - fulfilled orders, sent emails, deducted inventory.
Idempotent consumers are the primary defense against duplicate processing at the application layer, and they are necessary regardless of what queue or broker infrastructure you use.
Photo by Brett Sayles on Pexels
When to apply: Every event consumer that writes state or triggers side effects. If the handler is purely read-only and produces no observable changes, idempotency is not necessary but still not harmful.
3. Dead Letter Queues
What it is: Events that fail to process after a defined number of retry attempts are moved to a separate "dead letter" storage location rather than dropped. A dead letter queue (DLQ) holds failed events for manual inspection and eventual reprocessing.
Why it matters: Some events fail not because of transient infrastructure issues but because of application-level problems: a referenced record does not exist, the payload is malformed, or an edge case in the business logic throws an unhandled exception. These events will fail on every retry until the underlying issue is fixed.
Without a DLQ, these events silently disappear. You may not know what data was missed until a customer reports a problem. With a DLQ, failed events are available for inspection, and once the code issue is fixed, they can be reprocessed without requiring the provider to resend them.
Basic implementation:
MAX_RETRIES = 3
def process_with_retry(event): for attempt in range(MAX_RETRIES): try: handle_event(event) return # Success except Exception as e: log_attempt_failure(event.id, attempt, str(e)) if attempt == MAX_RETRIES - 1: dead_letter_queue.push(event) # Move to DLQ return`
Enter fullscreen mode
Exit fullscreen mode
4. Circuit Breakers for Downstream Failures
What it is: A circuit breaker wraps calls to downstream services and tracks failure rates. When failures exceed a threshold, the circuit "opens" and subsequent calls fail immediately without attempting the downstream request. After a cooldown period, the circuit enters a "half-open" state and tests whether the downstream service has recovered.
Why it matters: When a downstream service (a payment gateway, a shipping API, a CRM) is experiencing an outage, your event handlers will fail on every attempt. Without a circuit breaker, your workers keep attempting calls to a known-bad service, consuming resources and creating a backlog of failed events.
Martin Fowler's Circuit Breaker pattern is the widely referenced description of this design. In practice, most teams implement it with a library (hystrix, opossum for Node.js, resilience4j for Java) rather than from scratch.
The circuit breaker is particularly valuable in event-driven integrations because it prevents a temporary downstream outage from turning into a permanent data backlog. When the downstream recovers, the circuit closes and events that were queued during the outage process normally.
"The pattern we see most often in integration work is teams building point-to-point connections that are brittle by design. Event-driven patterns are more work upfront, but the operational stability over time is not even close." - Dennis Traina, 137Foundry
5. Event Sourcing for Audit Trails
What it is: Rather than updating application state in-place, every state change is recorded as an immutable event in an event log. The current state of any entity is derived by replaying its event history. This is the core idea behind event sourcing.
Why it matters: For integration systems that handle high-value business events (payments, order state changes, inventory updates), the ability to audit what happened and replay events to rebuild state is genuinely valuable. When something goes wrong - a processing bug, a deployment that corrupted state - you can replay events from the log to restore correct state.
This is a heavier architectural commitment than the other four patterns. It is worth the investment for domains with complex state transitions, audit requirements, or frequent debugging needs. For simpler integrations, a combination of the first four patterns (queue, idempotency, DLQ, circuit breaker) provides most of the reliability benefits without the full event sourcing model.
When to apply: Financial transaction systems, inventory management with external integrations, any domain where "what happened and when" needs to be auditable over time.
Combining the Patterns
These five patterns compose naturally. A typical production integration setup looks like:
-
Events arrive at an endpoint that stores them in a queue (Pattern 1)
-
Workers dequeue events, run an idempotency check (Pattern 2), and attempt processing
-
Failed attempts are retried up to a limit, then moved to a DLQ (Pattern 3)
-
Calls to downstream services go through a circuit breaker (Pattern 4)
-
All state changes are written as immutable event records (Pattern 5, for applicable domains)
None of these patterns require specific infrastructure choices. They can be implemented with a PostgreSQL table as a queue, a Redis set for idempotency keys, a separate database table as a DLQ, and a simple failure counter in memory as a circuit breaker.
For teams building integrations that handle high-value business events and where reliability matters, API integration firm 137Foundry designs and implements these architectures as part of their data integration work. For a detailed look at the webhook-specific reliability patterns these designs are built on, the guide to building webhook integrations that handle failures gracefully covers the core decisions.
The foundational reading for event-driven reliability is well-distributed across the industry: the message queue pattern and idempotence articles on Wikipedia provide solid conceptual grounding, and Fowler's circuit breaker article is the canonical implementation reference.
DEV Community
https://dev.to/137foundry/5-patterns-for-building-resilient-event-driven-integrations-3i8hSign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelavailableupdate
I Made Parseltongue
Yes, that one from HPMoR by @Eliezer Yudkowsky . And I mean it absolutely literally - this is a language designed to make lies inexpressible. It catches LLMs' ungrounded statements, incoherent logic and hallucinations. Comes with notebooks (Jupyter-style), server for use with agents, and inspection tooling. Github , Documentation . Works everywhere - even in the web Claude with the code execution sandbox. How Unsophisticated lies and manipulations are typically ungrounded or include logical inconsistencies. Coherent, factually grounded deception is a problem whose complexity grows exponentially - and our AI is far from solving such tasks. There will still be a theoretical possibility to do it - especially under incomplete information - and we have a guarantee that there is no full computat

Steering Might Stop Working Soon
Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now . This is particularly important for things like steering as a mitigation against eval-awareness. Steering Humans I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought. People with weaker intrusive thoughts usually find them unpleasant, but generally don't act on them ! On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion. These things typically cause enormous distress, a
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products

Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos
The age of agentic AI is upon us — whether we like it or not. What started with an innocent question-answer banter with ChatGPT back in 2022 has become an existential debate on job security and the rise of the machines. More recently, fears of reaching artificial general intelligence (AGI) have become more real with the advent of powerful autonomous agents like Claude Cowork and OpenClaw . Having played with these tools for some time, here is a comparison. First, we have OpenClaw (formerly known as Moltbot and Clawdbot). Surpassing 150,000 GitHub stars in days, OpenClaw is already being deployed on local machines with deep system access. This is like a robot “maid” (Irona for Richie Rich fans, for instance) that you give the keys to your house. It’s supposed to clean it, and you give it th





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!