May 1, 2026
· 14 min readKafka vs RabbitMQ vs SQS — Stop Picking the Wrong Queue
Most outages in queue-based systems aren't tool problems — they're concept problems. Pub/sub vs point-to-point, delivery guarantees, dead letter queues, and back pressure decide whether your system survives a bad Monday. This deep dive covers all four, then walks through when Kafka, RabbitMQ, or SQS is the right call — with real throughput numbers and the 'exactly-once' myth busted.

TL;DR
- Four concepts decide whether your queue setup survives Monday morning: pub/sub vs point-to-point, delivery guarantees, dead letter queues with a replay path, and back pressure.
- Exactly-once delivery across a network is impossible — every "exactly-once" feature is really at-least-once delivery plus idempotent processing. Build idempotent consumers regardless of vendor.
- Kafka, RabbitMQ, and SQS are different categories, not different flavors. Kafka is a log. RabbitMQ is a broker. SQS is a managed queue.
- Pick the simplest tool that meets your real requirements, not the requirements you imagine you'll have in three years. Running a Kafka cluster for 40 messages an hour is a resume decision, not an architecture decision.
Why message queues exist in the first place
Picture a checkout service that calls inventory directly over HTTP. Inventory slows down — doesn't crash, just slows. Checkout threads pile up waiting on inventory responses. At 300 or 400 stuck threads, checkout itself runs out of capacity. Anything depending on checkout — recommendations, the cart UI, shipping — is now also down.
One slow dependency, three broken services. That's a cascading failure, and synchronous coupling is what lets it spread.
A queue between checkout and inventory decouples the producer from the consumer in three dimensions: time (they don't run at the same moment), availability (one can be down without taking the other down), and speed (the fast one isn't held hostage by the slow one). Inventory slows, messages accumulate in the queue, checkout doesn't care.
That's the whole point. Now let's break down the four concepts that decide whether your queue actually does its job.
Concept 1 — Pub/Sub vs Point-to-Point
A team I know shipped a feature on a Friday. By Monday morning every new signup was getting fifty welcome emails. Root cause: they put a work job onto a topic that fanned out to every consumer. Fifty workers, fifty copies of the job, fifty emails per user.
One question asked in code review would have saved the weekend: should this message be handled once, or should multiple services react to it?
- Handled once means work. Resize this image. Charge this card. Send this one welcome email. Any worker can grab it, but only one does. That's a point-to-point queue, also called a work queue.
- Reacted to by many means event. An order was placed. Inventory cares. Email cares. Analytics cares. Three independent reactions to the same fact, each one its own copy. That's pub/sub.
Sending a welcome email is work. The team put it on a topic. The topic did its job — it fanned the message out to every subscriber. The tool wasn't broken; the semantics were wrong.
The vocabulary makes this worse. Different tools use "queue" and "topic" to mean different things. Don't argue about the word. Ask what physically happens to a message when it arrives — does one consumer take it and it's gone, or does every subscriber get a copy?
Concept 2 — Delivery Guarantees and the Exactly-Once Myth
Three guarantees show up in vendor docs:
| Guarantee | What actually happens | When to use |
|---|---|---|
| At most once | Send it, don't check. May be lost. Never duplicated. | Metrics, logs, anywhere a dropped data point doesn't matter |
| At least once | Producer retries until ack received. Never lost, may duplicate. | Default for almost everything |
| Exactly once | Marketing term. See below. | — |
Exactly-once delivery across a network is not achievable. Not in Kafka, not in RabbitMQ, not in anything. It's a property of distributed systems, not a vendor feature. Exactly-once message delivery is provably impossible over unreliable networks, as formalized by the Two Generals Problem and the FLP impossibility result.
The Two Generals Problem is the elegant proof. A producer sends a message. The broker writes it and sends back an acknowledgement. The acknowledgement vanishes on the wire. Now the producer has to choose: retry (maybe duplicate) or give up (maybe lose). The network never tells it which situation it was in. Add another acknowledgement-of-the-acknowledgement and you've just shifted the uncertainty by one hop — the regress is infinite.
So when a system advertises "exactly-once semantics," what's under the hood is at-least-once delivery plus idempotent processing or transactional writes. The message might physically arrive twice. The effect of processing it happens once because the consumer checks whether it has seen that event ID before and skips if it has.
Worth knowing about Kafka specifically: its exactly-once guarantee covers what happens inside the Kafka cluster. The moment your consumer writes to an external database or calls an external API, you're back to making that side idempotent yourself. Kafka's EOS stops at the boundary of the Kafka cluster — beyond that, it's up to us to design for idempotency or consistency.
💡 Rule of thumb: Assume at-least-once delivery. Make every consumer idempotent. Every handler checks if it has seen that event ID before and skips if it has.
This single pattern prevents an entire category of nightmare bugs — double charges, duplicate emails, inventory counts drifting in ways nobody can reproduce. If you do one thing after reading this, go check your message handlers. If they aren't idempotent, make them idempotent. You'll thank yourself the first time a network blip causes a retry storm.
Concept 3 — Dead Letter Queues (and the Replay Path)
One malformed message arrives. The consumer throws. The message goes back to the queue, gets retried, throws again, retried again. Meanwhile every message behind it waits. In an ordered partition, the entire partition is now frozen. The industry term is a poison message, and one of them can take down a pipeline handling millions of good messages a minute.
The fix is a dead letter queue (DLQ). After a configured number of failed attempts, the bad message is moved to a separate holding area. The main pipeline resumes. The poison message is preserved for inspection, not retried into oblivion, not silently dropped.
Most teams get this far, tick the DLQ box on the architecture review, and go home. A DLQ without a replay path is a graveyard. Messages land in it. You fix the bug. Then what?
If there's no tooling to push those messages back into the main queue, you have three choices:
- Write a one-off script under time pressure
- Replay them by hand
- Admit they're lost
Most teams quietly pick option three. The replay path is the whole point of having a DLQ. Everything else is just setup. Cap your retries, alert on DLQ depth, attach failure metadata so future-you can debug — but build the replay path. Without it, your DLQ is just where problems go to be forgotten.
Concept 4 — Back Pressure
It's 3:47 AM. The pager fires. The broker is out of memory. Not because of a bug — because the producer was writing at 10,000 messages per second and the consumer was draining at 2,000 messages per second, and had been for hours. Millions of messages piled up. The only question was whether memory, disk, or patience ran out first.
There's a quieter version that's even worse. Nothing crashes. The queue just grows. Messages get processed two hours late. For a fraud check on a credit card transaction, an answer that arrives two hours late is the same as no answer at all.
Back pressure is the umbrella term for how a slow consumer pushes back on a fast producer. Three techniques cover most cases:
| Technique | How it works | When to reach for it |
|---|---|---|
| Bounded queues | Cap the queue size. When full, producer blocks or fails fast. | Reach for this first — it fails loudly, alerts fire, you find out while there's still time |
| Autoscale consumers | Queue depth crosses a threshold → add workers | Stateless consumers, spiky workloads |
| Credit-based flow control | Consumer tells producer "I'm ready for N more." Producer sends N then stops. | Fine-grained streaming pipelines (Reactive Streams: Project Reactor, RxJava) |
The takeaway from that 3 AM page: every queue has a limit. Either you pick it and plan what happens when you hit it, or the OS picks it for you by killing the process. The second version always costs more.
⚠️ Warning: "Unlimited" queues don't exist. They just have a limit configured by someone other than you, usually the kernel.
Kafka vs RabbitMQ vs SQS — How They Actually Differ
These three aren't different flavors of the same tool. They're three different categories.
| Aspect | Kafka | RabbitMQ | SQS |
|---|---|---|---|
| What it is | Append-only distributed log | Smart message broker | Fully managed queue (AWS) |
| Message lifetime | Days, weeks, forever (configurable) | Until acked, then gone | Until consumed + deleted, max 14 days |
| Replay history | ✅ Native — any consumer rewinds anytime | ❌ Once acked, gone | ❌ |
| Routing | By partition key | ✅ Rich — exchanges, bindings, headers, patterns | Basic — pair with SNS for fanout |
| Throughput | Millions msg/sec at high end | ~50K msg/sec per node typical | Standard: nearly unlimited; FIFO: 300–70K msg/sec |
| Ordering | Per partition | Per queue | FIFO queues only |
| Ops burden | High — cluster, brokers, KRaft, monitoring | Medium — cluster, plugins | ✅ Zero — three API calls |
| Sweet spot | Event sourcing, stream processing, data pipelines | Complex routing, per-message control | Workloads on AWS that just need a queue |
Let's unpack each one.
Kafka — when the log is the value
Kafka writes messages to an append-only log and keeps them. Seven days, thirty days, forever — whatever you configure. Consumers track their own position, which means any consumer at any time can rewind.
Ship a new fraud detection service on Tuesday. On Wednesday, point it at the log, reset the offset to 30 days ago, and let it catch up on a month of history by lunchtime. You don't have to re-emit anything. The log is the history.
That's why Kafka shows up in event sourcing, stream processing, and data pipelines between teams. The log becomes a source of truth and new consumers can join any time and see everything that happened.
Kafka 4.0 update worth knowing: Kafka Queues (Share Groups) is now production-ready with new features like the RENEW acknowledgement type for extended processing times, adaptive batching for share coordinators, soft and strict enforcements of quantity of fetched records, and comprehensive lag metrics. Share groups (KIP-932) shipped as early access in 4.0 and went GA in 4.2 — they let multiple consumers cooperatively process records from the same partition with per-record acknowledgement, giving Kafka native queue-style consumption on top of the log. The old "Kafka can't do queues" line is out of date.
The catch: Kafka has real operational weight. Running a full cluster for 3,500 messages a day isn't an architecture decision, it's a resume decision. And you pay for it every time someone has to learn it, tune it, or debug it at 4 AM.
RabbitMQ — when routing is the interesting part
RabbitMQ's superpower is in the arrows. A message doesn't go to a queue directly — it goes to an exchange, and the exchange decides which queues it belongs in based on rules you configure: exact match, pattern, headers, broadcast, all declared in configuration.
No consumer filters anything. The broker does the routing for you, and once a consumer acknowledges a message from its queue, it's gone.
Reach for RabbitMQ when:
- Background jobs hit different worker pools
- Message shape decides destination (orders by region, by tier, by priority)
- Per-message delivery control matters more than raw throughput
SQS — when zero ops is the value
SQS is the entire product in three API calls — SendMessage, ReceiveMessage, DeleteMessage. Nothing to install. Nothing to tune. Nothing to patch. It runs inside AWS and you stop thinking about it.
Two flavors:
- Standard — at-least-once delivery, best-effort ordering, effectively unlimited throughput
- FIFO — strict ordering, what AWS calls "exactly-once processing"
The throughput numbers matter for capacity planning. By default, FIFO queues support up to 3,000 messages per second with batching or up to 300 messages per second (300 send, receive, or delete operations per second) without batching. If you require higher throughput, you can enable high throughput mode for FIFO on the Amazon SQS console, which will support up to 70,000 messages per second without batching and even higher with batching.
Same trap as before with the "exactly-once" label. SQS FIFO does deduplication on the send side using a 5-minute window — any message with the same deduplication ID within five minutes is silently dropped. On the consume side, if your consumer doesn't delete the message before its visibility timeout expires, SQS hands it back to another consumer. The network limits we covered earlier didn't go away. Your consumer still needs to be idempotent.
One thing SQS doesn't do natively: fanout. For pub/sub on AWS, pair it with SNS, or reach for EventBridge or Kinesis depending on the use case.
When to use what — the actual decision
That's the decision 90% of the time:
- Want replay history → Kafka
- Want rich routing → RabbitMQ
- On AWS, want zero ops → SQS
But teams sometimes pick the biggest tool because it's on the architecture diagram of their favorite tech company. I've seen systems running full Kafka clusters for forty messages an hour. Forty an hour. Every on-call rotation became a Kafka tutorial. Every new engineer spent their first week learning a distributed log they didn't need. Every incident took longer because the tool was more complicated than the problem it was solving.
Important: Pick the simplest tool that meets your real requirements, not the requirements you imagine you'll have in three years. SQS or a managed RabbitMQ will get most systems where they need to go. You can migrate to Kafka the day you have a real reason — and the day you have a real reason, you'll know.
Production checklist
Before you ship a queue-based system, walk through this list:
- Idempotent consumers — every handler checks if it has already processed this event ID and skips if it has. Use the message ID, a deduplication key, or a database unique constraint.
- DLQ with a replay path — not just a DLQ. Build the tooling to push fixed messages back into the main queue, and test it before you need it.
- Bounded queue size — pick the limit yourself. Decide whether the producer blocks or fails fast when full. Don't let the OS pick by killing the process.
- Alerting on queue depth and DLQ depth — both should fire long before the broker runs out of memory.
- Retry policy with exponential backoff — cap retries at N (typically 3–5). Anything beyond that goes to the DLQ.
- Visibility timeout sized to your processing time (SQS) — too short causes redelivery storms, too long delays poison-message detection.
- Pub/sub vs point-to-point on every new message type — ask explicitly: is this work, or is this an event? Document the answer.
- Failure-mode runbook — when the broker is down, can you still accept new requests and queue locally? When the consumer is down, what alerts fire?
Three checks before you close this tab
Open your codebase right now and answer these three questions. They take ten minutes combined and will surface real bugs:
- Grep your consumers for idempotency checks. If they just process every message they see, you have a bug waiting for the next retry storm.
- Find your DLQ and try to replay a message from it. If there's no clear way to do that, you have a graveyard, not a recovery tool.
- Check whether your queues have a max size configured. If they don't, the OS has configured one for you — and you'll find out what it is the hard way.
Conclusion
I treat message queues less as infrastructure and more as a contract between services. The four concepts — pub/sub vs point-to-point, delivery guarantees, dead letter queues, and back pressure — are the contract clauses that decide whether your system degrades gracefully or fails loudly at 3 AM. Pick the wrong delivery semantic and you get duplicate charges. Skip the DLQ replay path and you lose data. Forget back pressure and the broker takes itself out.
The tool choice matters far less than people think. Kafka, RabbitMQ, and SQS are all excellent for the problems they were built for, and all painful for the ones they weren't. Start with the simplest option that fits your real requirements today. SQS for most AWS workloads. RabbitMQ when routing is the hard part. Kafka when the log itself is the value. Then make every consumer idempotent, build the DLQ replay path, bound your queues, and you'll sleep through more nights than the team that picked Kafka because they read a Confluent blog post.
FAQ
What's the real difference between Kafka, RabbitMQ, and SQS?
They're three different categories, not three flavors of the same thing. Kafka is a distributed log — messages persist for days or forever and any consumer can rewind. RabbitMQ is a smart broker — messages route through exchanges based on rules and disappear once acknowledged. SQS is a fully managed queue inside AWS — three API calls, zero ops, and you don't own a single VM.
Is exactly-once message delivery actually possible?
Not across a network. The Two Generals Problem proves that no finite protocol can guarantee both sides agree a message was delivered. What vendors call 'exactly-once' is at-least-once delivery plus idempotent processing or transactions — duplicates can still arrive on the wire, but the effect happens once because the consumer detects and skips them.
When should I use a queue instead of direct service-to-service calls?
Whenever you don't want one slow service to take down the others that depend on it. Direct calls couple producer and consumer in time, availability, and speed. A queue decouples all three — the producer writes and moves on, the consumer drains at its own pace, and a slowdown becomes a queue depth metric instead of a cascading failure.
What is a poison message and how do I handle it?
A poison message is a malformed or unprocessable message that causes the consumer to throw on every retry. Without intervention it freezes the partition behind it. The fix is a dead letter queue (DLQ): after N failed attempts the message is moved aside so the main pipeline keeps flowing. But a DLQ without a replay path is a graveyard — always build the tooling to push fixed messages back into the main queue.
Can Kafka be used as a queue now?
Yes — Kafka 4.0 introduced share groups (KIP-932) as an early-access feature, and they became production-ready in Kafka 4.2. Share groups allow multiple consumers to cooperatively process messages from the same partition with per-record acknowledgement, giving you queue-like consumption on top of Kafka's log. The 'Kafka can't do queues' line is officially out of date.
What is back pressure and why does it matter?
Back pressure is how a slow consumer pushes back on a fast producer. Without it, the queue grows until the broker runs out of memory or the OS kills the process. Three common techniques: bounded queues that block or fail-fast when full, autoscaling consumers based on queue depth, and credit-based flow control where the consumer tells the producer how many messages it's ready for. Reactive Streams libraries like Project Reactor and RxJava implement the third pattern.
What is the 5-minute deduplication window in SQS FIFO?
SQS FIFO advertises 'exactly-once processing' via a 5-minute deduplication window: any message sent with the same deduplication ID within five minutes of the original is silently dropped. It's deduplication on the send side, not a network-level guarantee. Your consumer still needs to be idempotent because visibility timeouts can cause a message to be redelivered to a different consumer if the first one doesn't delete it in time.