Robust retry mechanism for microservices: Exponential backoff, batch jobs, and 24-hour recovery system

In a microservices ecosystem, failure is not just possible, it's expected. Networks fail, services go down temporarily, and latency spikes happen. Handling these failures gracefully is the hallmark of a resilient system.

In this article, we describe the retry mechanism we implemented across the layers of our microservices ecosystem, from REST-level network glitches to inter-microservice lag and complete downtime, especially given our dependency on a third-party system where we had little control.

In our microservices architecture, we deal with tenants that must be processed reliably, despite unpredictable network and service behaviors. These tenants often trigger a series of configuration and onboarding steps via REST APIs and backend processing workflows.

The key failure scenarios that we needed to handle include:

REST layer network issues
Microservice lag or delays in dependency responses
Service downtime or failure beyond retries

We needed a robust retry strategy that addressed all these failure types while ensuring:

Visibility into each retry attempt
Controlled load on services
Recovery from long outages
Observability for success or failure
Minimal infrastructure overhead

Our retry mechanism

Our retry mechanism consists of these three elements:

Exponential backoff for REST layer network failures
A batch job that runs on a 10-minute interval for microservice lag
A 24-hour batch job for extended downtime recovery

Exponential backoff for REST layer network failures

When a REST call fails due to a transient issue like a timeout or connection error, we use exponential backoff. The retries are performed with a delay pattern such as 1s → 2s → 4s → 8s → 16s, capped at 32 seconds of total retry time.

Exponential backoff is a localized retry mechanism that is suitable for:

Short-lived, temporary issues (for example, DNS hiccups, momentary load spikes).
Ensuring that retries don’t hammer the downstream service.
Keeping retries isolated to the component that is making the call.

Why not Kafka for this? These issues are real-time interactions where a retry decision needs to be made instantly. Kafka introduces latency and complexity for such short-lived transient failures.

Batch job that runs on a 10-minute interval for microservice lag

In certain scenarios, downstream microservices are slow to respond or might be experiencing temporary issues such as high CPU usage, db bottlenecks, or contention on shared resources. These issues result in failed operations that don’t recover immediately with exponential backoff.

To handle such issues gracefully, we implemented:

A dedicated queue table that records failed tenant processing attempts.
A batch job that runs every 10 minutes to scan this queue and reattempt processing for each tenant.
Janitor cleanup logic that deletes old or successfully retried records from the queue table after a configured expiry period (such as 24 to 48 hours), which ensures that the table remains lean and efficient.

This architecture offers:

Controlled, periodic retries without hammering services.
Full visibility into retry attempts and status per tenant.
Database-backed durability and flexibility in modifying retry behavior.

Why not Dead Letter Queue (DLQ)? DLQs are passive. They collect failures but don’t actively retry them or track successful recoveries. Our batch job does both, it retries and cleans up, which ensures that no state is left dangling and offers better observability.

Twenty-four hour batch job for extended downtime recovery

This layer is the fail-safe for tenants that still haven't been processed despite exponential backoff and the 10-minute queue retries.

Here’s how it works:

A state table tracks failed tenants.
A daily batch job scans this table, and then does the following tasks:
- Identifies tenants that are eligible for retry (up to 5 times per day).
- Adds them back to the queue table for the 10-minute batch to retry.
- Tracks how many days each tenant has been retried (max of 5 days).
- After 5 days or exhausting the daily retry quota, the tenant is marked as permanently failed, triggering alerts or manual intervention.

The daily retry window is 5 attempts per day for 5 days for a maximum of 25 retries:

5 retry attempts per tenant per day.
Retries are evenly spaced across the day (such as 10 tenants for 10 minutes).
Retry attempts are managed and logged via the 24-hour job.
If processing fails for 5 consecutive days, the tenant is flagged for escalation.

This final mechanism ensures long-term resilience while also maintaining guardrails to prevent infinite retries.

Why not use Kafka or event-driven architecture?

Apache Kafka is a distributed event-streaming platform that enables services to communicate through real-time data streams rather than direct synchronous calls. This approach forms the foundation of an event-driven architecture (EDA), where systems react to events as they occur, improving scalability and decoupling services. Learn more about fault tolerance issues in microservices in this article, “Architectural considerations for event-driven microservices-based systems.”

While Kafka is powerful for decoupling systems and handling high-throughput event flows, it wasn’t the right fit for our retry mechanism. Here's why:

Per-tenant observability. Each tenant's retry needs to be tracked individually. Kafka doesn’t offer native response or outcome tracking for such long-running, stateful workflows.
Processing time. Our processing takes up to 1 minute per tenant. Kafka consumers are not ideal for holding messages that long.
Retry visibility. Kafka-based retries often rely on requeuing or topic-based redirection, which lacks transparency about retry success unless explicitly engineered.
Overhead and complexity. Running Kafka clusters (especially for a modest retry volume) adds unnecessary operational burden.
DLQ’s passive nature. A DLQ alone can’t perform retries. It stores failures and requires additional logic for reprocessing.

While event-driven architecture is a natural choice for real-time systems, it can complicate retry logic. Here are some practical limitations that informed our choice:

No native state tracking. Event systems "fire and forget." There's no inherent visibility into whether a specific tenant succeeded or failed unless you build extensive side-tracking systems.
Long-running workflow support. Our tenant operations might take up to a minute — problematic for streaming systems like Kafka that expect short-lived, stateless consumers.
Complex recovery mechanisms. Retrying from Kafka requires chaining topics, maintaining idempotency, and building DLQ reprocessing logic.
Observability challenges. If a retry fails in Kafka, you’d need to trace logs, metrics, and correlation IDs across systems. In our batch model, we directly observe tenant status.
Tight infrastructure dependency. Not every environment is equipped for Kafka. We prioritized portability and simplicity.

Why this approach works for us

We realized the following benefits of our robust retry mechanism:

Simplicity with control. The retry process is easy to tune, allowing adjustments to retry frequency, old data cleanup, and tenant-specific rules.
Per-tenant observability. The system provides clear visibility into each tenant’s retry status through logs and database snapshots.
Operational efficiency. This approach removes the need to scale Kafka or manage topic partitions.
Resilience through layers. The three layers (local backoff, 10-minute retries, 24-hour catch-up) ensure fault recovery with different scopes and time windows.
Easy maintenance. All retry logic resides within jobs and tables, eliminating the need for distributed debugging or reprocessing infrastructure.

Summary

Retry strategies need to fit the problem, not the hype. While event-driven systems are useful in many contexts, batch jobs with stateful retry logic offer more determinism, control, and observability,especially when working with long-running operations and critical tenant workflows.

Since implementing this hybrid retry mechanism, we’ve successfully recovered 500+ tenant failures without a single case of a tenant getting stuck indefinitely in the retry loop. The combination of exponential retries, queue-driven recovery, and batch-based fallbacks has provided both resilience and observability across the stack. We’ve built a robust, maintainable, and transparent retry mechanism that prioritizes correctness and clarity over flash.