Kafka, NATS, and streaming & messaging in distributed systems in general#

Messaging in Distributed Systems (Foundations)#

Why messaging is used

  • It decouples producers from consumers in time and space.
  • Producers do not block waiting for consumers.
  • Systems tolerate partial failures better.
  • Traffic spikes are absorbed by the broker instead of crashing services.

Asynchronous vs synchronous

  • Messaging is asynchronous by default.
  • This improves latency isolation but complicates error handling.
  • Failures move from call-time to processing-time.

Common messaging patterns

  • Pub/Sub: one message, many consumers.
  • Queue (work distribution): one message, one consumer.
  • Event streaming: ordered log of immutable events.
  • Event sourcing: state rebuilt from events.
  • Request/Reply: async alternative to HTTP/gRPC.

Delivery guarantees

  • At-most-once: fast, no retries, message loss possible.
  • At-least-once: retries enabled, duplicates possible.
  • Exactly-once: delivery + processing exactly once, very complex in practice.

Ordering

  • Global ordering is expensive and rarely scales.
  • Most systems guarantee ordering only within a subset (partition or subject).
  • Ordering usually trades off with throughput and parallelism.

Backpressure

  • Prevents fast producers from overwhelming slow consumers.
  • Can be broker-enforced or consumer-controlled.
  • Missing backpressure often leads to cascading failures.

Acknowledgements (acks)

  • Ack confirms delivery or processing.
  • Ack timing affects reliability and latency.
  • Early ack = faster, less safe. Late ack = safer, slower.

Apache Kafka (Event Streaming Platform)#

What Kafka really is

  • A distributed, persistent commit log.
  • Designed for high throughput and durability.
  • Optimized for streaming large volumes of events.

Core components

  • Broker: Kafka server storing data.
  • Topic: named stream of events.
  • Partition: ordered, append-only log segment.
  • Producer: writes events.
  • Consumer: reads events.
  • Consumer group: parallel consumption model.

Data model

  • Events are immutable and appended to a log.
  • Each event has an offset within its partition.
  • Consumers track offsets, not the broker.

Ordering guarantees

  • Strict ordering only inside a single partition.
  • Multiple partitions break global ordering.
  • Partition key choice is critical for correctness.

Durability and replication

  • Data is written to disk, not memory.
  • Partitions are replicated across brokers.
  • Leader handles writes, followers replicate.
  • acks=all ensures replicas confirm the write.

Retention and replay

  • Kafka keeps data even after consumption.
  • Retention is time-based or size-based.
  • Consumers can rewind offsets and reprocess data.
  • This makes Kafka both a queue and a data store.

Scaling model

  • Scale by increasing partitions.
  • Consumers in a group split partitions.
  • Rebalancing pauses consumption temporarily.

Delivery semantics

  • At-least-once is the default and most common.
  • Exactly-once requires idempotent producers and transactions.
  • Exactly-once increases latency and operational complexity.

Latency profile

  • Optimized for batching and disk I/O.
  • Higher latency than in-memory systems.
  • Poor fit for request/response workflows.

Typical Kafka use cases

  • Event sourcing and audit logs.
  • CDC pipelines (e.g. Debezium).
  • Data lakes and analytics ingestion.
  • Large-scale log aggregation.

NATS.io (Low-Latency Messaging System)#

What NATS is

  • Lightweight, high-performance messaging system.
  • Focused on speed and simplicity.
  • Designed for real-time communication.

Core concepts

  • Server: handles message routing.
  • Subject: hierarchical message channel.
  • Publisher / Subscriber: simple API.

Data model

  • Messages are transient by default.
  • No partitions or offsets.
  • Routing is subject-based, not log-based.

Ordering

  • Messages are ordered per publisher on a subject.
  • No strong global ordering guarantees.
  • Parallel publishers can interleave messages.

Latency

  • Extremely low (microseconds).
  • Ideal for control planes and service-to-service traffic.
  • Much faster than disk-based systems.

Delivery guarantees

  • Core NATS provides at-most-once delivery.
  • No persistence unless JetStream is enabled.
  • Simplicity over guarantees by default.

NATS JetStream (Persistence Layer)#

What JetStream adds

  • Message persistence to disk.
  • Acknowledgements and retries.
  • Replay and consumer state.

Retention modes

  • Limits-based: keep last N messages or time window.
  • Work-queue: message delivered to one consumer.
  • Interest-based: stored while consumers exist.

Durability

  • Disk-backed storage.
  • Optional replication.
  • Still simpler than Kafka’s replication model.

Consumption model

  • Pull-based or push-based consumers.
  • Fine-grained control over acks and retries.
  • No partition rebalancing storms.

Strengths

  • Low operational overhead.
  • Low latency even with persistence.
  • Excellent for microservices messaging.

Limitations

  • Not designed for long-term historical data.
  • Smaller ecosystem than Kafka.
  • Limited stream processing capabilities.

Kafka vs NATS (How to Explain in Interview)#

AspectKafkaNATS
Primary goalEvent streamingReal-time messaging
LatencyMediumUltra-low
ThroughputExtremely highHigh
PersistenceAlways onOptional
ReplayNativeVia JetStream
ScalingPartitionsSubjects
Ops complexityHighLow

Simple rule

  • Kafka = data backbone.
  • NATS = nervous system.

Common Interview Deep-Dive Questions#

Why does Kafka need partitions? Partitions allow parallel writes and reads, but split ordering guarantees.

Why is consumer lag important? It shows how far consumers are behind and reveals backpressure or failures.

Why is exactly-once rarely used? It increases latency, complexity, and operational risk.

Can Kafka be used as a database? No. It lacks indexing, querying, and transactional semantics.

Why is NATS popular in microservices? Low latency, simple mental model, and native request/reply.

Does NATS support streaming? Yes, via JetStream, but with different trade-offs than Kafka.


Real-World Design Principles#

  • Prefer at-least-once + idempotent consumers.
  • Avoid designs that rely on global ordering.
  • Monitor lag, queue depth, and processing time.
  • Separate control traffic from data traffic.
  • Use messaging for decoupling, not state storage.
  • Simpler systems fail less often.