Concepts

In Kafka, topic is a formal name for queues where messages are published to and consumed from. Topics in Kafka offer the virtual topic queuing model described previously, that is, where there are multiple logical subscribers, each will get a copy of the message, but a logical subscriber can have multiple instances, and each instance of the subscriber will get a different message.

A topic is modeled as a partitioned log, as shown here:

Source: http://kafka.apache.org/documentation.html#introduction

New messages are appended to a partition of a log. The log partition is an ordered, immutable list of messages. Each message in a topic partition is identified by an offset within the list. The partitions serve several purposes:

A log (topic) can scale beyond the size of a single machine (node). Individual partitions need to fit on a single machine, but the overall topic can be spread across several machines.
Topic partitions allow parallelism and scalability in consumers.

Kafka only guarantees the order of messages within a topic partition, and not across different partitions for the same topic. This is a key point to remember when designing applications.

Whenever a new message is produced, it is durably persisted on a set of broker instances designated for that topic partition—called In-Sync Replicas (ISRs). Each ISR has one node that acts as the leader and zero or more nodes that act as followers. The leader handles all read and write requests and replicates the state on the follower. There is periodic heart-beating between the leader and the followers, and if a leader is deemed to be failed, an election is held to elect a new leader. It should be noted that one node can be a leader for one topic partition while being a follower for other topic partitions. This allows for the load to be distributed evenly among the nodes of the cluster.

The messages remain on the brokers for a configurable retention time—which has no bearing on whether they have been consumed. For example, if the retention policy for a topic is set to one week, then for a week after the message is published, it's available for consumption. After this, the message is deleted (and the space reclaimed).

Unlike most other messaging systems, Kafka retains minimal information about consumer consumption. Clients can remember what offsets they have in each topic partition and attempt to find it randomly into the log.

Kafka also provides a facility of the brokers remembering, to the offsets for consumers—on explicit indication by them. This design reduces a lot of the complexity on the broker and allows for efficient support of multiple consumers with different speeds:

(Source: http://kafka.apache.org/documentation.html#introduction)

Consumer A can consume messages at its own speed, irrespective of how fast Consumer B is going.

Producers publish messages on specific topics. They can provide a partitioner of their choice—to pick a partition for each message—or they can choose a default (random/round-robin) partitioner. Generally, most Kafka clients batch messages at the producer side, so a write is just storing the message in a buffer. These messages are then periodically flushed to the brokers.

As described earlier in The Pub/Sub model section, the Kafka topic offers a Pub/Sub queuing model. So, there can be multiple logical consumers and each will get all of the messages. However, a logical consumer (say, a microservice) will have more than one instance, and ideally we would like to load-balance consumption and processing of messages across these consumer instances. Kafka allows this using a construct called consumer groups. Each time a consumer registers to a topic for messages, it sends a label (string) describing the logical consumer (say, service). The Kafka brokers treat each instance having the same group name to belong to the same logical consumer and each instance gets only a subset of the messages. Hence, messages will be effectively load-balanced over the consumer instances.

As described earlier, the topic partitions serve as a unit of parallelism for consumption of the messages. Let's look at how this happens. When a consumer instance registers for messages from a topic, it has two options:

Manually register for specific partitions of that topic
Have Kafka automatically distribute topic partitions among the consumer instances

The first option is simple, and there is a lot of control with the application on who processes what. However, the second option, called the group coordinator feature, is an awesome tool to enable scalability and resilience in distributed systems. Here, the consumers don't specify explicit partitions to consume from, rather the broker automatically assigns them to consumer instances:

(Source: http://kafka.apache.org/documentation.html#introduction)

In the preceding diagram, the topic has four partitions spread over two servers. Consumer Group A has two instances, and each has two topic partitions assigned to it. On the other hand, Consumer Group B has four instances, and each is assigned a topic partition. The broker (group coordinator feature) is responsible for maintaining membership in the consumer groups through a heart-beating mechanism with the consumer group instances. As new consumer group instances come up, the topic partitions are reassigned automatically. If an instance dies, its partitions will be distributed among the remaining instances.

Table of Contents for Concepts

Create new playlist

Sign In

Sign Up

Table of Contents for
Concepts