← Back to Blog
System DesignAWSDistributed SystemsArchitecture

Message Queues Explained — SQS vs Kafka vs SNS

SY
Sumit Yadav
April 04, 20267 min read

Why Message Queues Exist

Imagine your e-commerce platform processes an order. Synchronously, it needs to:

  1. Validate the order
  2. Charge the payment
  3. Update inventory
  4. Send a confirmation email
  5. Notify the warehouse
  6. Update analytics

Do all this in a single request and the user waits 5+ seconds. One slow service — say, the email provider — blocks everything. One service crashes and the entire order fails.

Message queues solve this by decoupling producers from consumers. The order service publishes one event and immediately returns a response. Everything else happens asynchronously, independently, and reliably.

Now the user gets an instant response. Each downstream service processes the event at its own pace. If the email service is slow, it doesn't affect payment processing.


The Three Players on AWS

AWS offers three messaging services that are often confused:

ServiceTypeBest For
SQSQueueAsync job processing, task queues
SNSPub/SubFan-out, notifications, broadcasting
KafkaEvent StreamHigh throughput, event sourcing, replay

They solve different problems. Let's go deep on each.


SQS — Simple Queue Service

SQS is a managed message queue. Producers push messages, consumers pull and process them.

Key Concepts

Visibility Timeout When a consumer picks up a message, it becomes invisible to other consumers for a set period. If processing fails and the message isn't deleted, it reappears for another consumer to try.

Message picked up → invisible for 30 seconds
Processing succeeds → delete message
Processing fails → message reappears after 30s

Dead Letter Queue (DLQ) After N failed attempts, messages go to a DLQ for inspection. This prevents poison pill messages from blocking your queue forever.

Standard vs FIFO

  • Standard — at-least-once delivery, best-effort ordering, massive throughput
  • FIFO — exactly-once delivery, strict ordering, 3,000 msg/sec limit

When to Use SQS

  • Background job processing (image resizing, PDF generation)
  • Decoupling microservices
  • Rate limiting downstream services
  • Retry logic with exponential backoff

When NOT to Use SQS

  • You need multiple consumers to receive the same message → use SNS
  • You need to replay historical events → use Kafka
  • You need real-time streaming → use Kafka

SNS — Simple Notification Service

SNS is a pub/sub service. Publishers send to a topic, all subscribers receive it simultaneously — fan-out.

SNS + SQS = The Power Combo

SNS alone doesn't persist messages — if a subscriber is down, the message is lost. The pattern is SNS fan-out to SQS queues:

SNS Topic → SQS Queue per subscriber

Each SQS queue buffers messages for its subscriber. If the email service goes down, messages accumulate in its SQS queue and are processed when it recovers. Nothing is lost.

When to Use SNS

  • Sending the same event to multiple services
  • Push notifications (mobile, email, SMS)
  • Broadcasting events across microservices
  • Triggering Lambda functions from events

Kafka — The Event Streaming Platform

Kafka is fundamentally different from SQS and SNS. It's not just a queue — it's a distributed event log.

Key Concepts

Topics and Partitions Messages are stored in topics, split across partitions. Partitions enable parallel processing — each consumer in a group reads from a different partition.

Consumer Groups Multiple independent applications can consume the same topic independently. Analytics service and notification service both consume the orders topic — each at their own pace, each maintaining their own offset.

Message Retention Unlike SQS (messages deleted after processing), Kafka retains messages for a configurable period (default 7 days). This enables event replay — reprocess historical events, backfill new services, debug production issues.

Offsets Each consumer tracks its position in the log. If a consumer crashes, it resumes from where it left off — no messages lost.

When to Use Kafka

  • High throughput event streaming (millions of events/second)
  • Event sourcing — rebuild state from event history
  • Multiple independent consumers of the same events
  • Audit logs and compliance requirements
  • Real-time analytics pipelines

When NOT to Use Kafka

  • Simple async job processing → SQS is simpler and cheaper
  • Fan-out notifications → SNS is purpose-built
  • Small scale → Kafka has operational overhead

Decision Framework

Here's how I decide at system design time:


Real World Example — Order Processing

Here's how a production e-commerce system uses all three:

  • SNS fans out the order event to downstream services
  • SQS queues buffer work for payment, inventory, and notification workers
  • Kafka captures the event stream for analytics, auditing, and ML training

Each tool doing what it does best.


Key Takeaways

1. SQS for jobs, SNS for fan-out, Kafka for streams. These aren't interchangeable — they solve different problems.

2. SNS + SQS is the most common production pattern. Fan-out via SNS, buffer and retry via SQS. Use this as your default for microservices communication.

3. Kafka's superpower is replay. If you need to reprocess historical events or have multiple independent consumers, nothing else comes close.

4. Always add a Dead Letter Queue to SQS. Poison pill messages will happen. DLQs prevent them from blocking your entire queue.

5. Choose operational complexity carefully. Kafka requires significant ops investment. SQS and SNS are fully managed — start there and graduate to Kafka when you genuinely need it.

At Staff Engineer level, the question isn't "do you know what Kafka is?" It's "can you articulate when Kafka is the right tool and when it's overkill?" That nuance is what interviewers are testing.

← More ArticlesConnect on LinkedIn →