Streaming using Kafka

Kafka is a distributed, partitioned, and replicated commit log service. In simple words, it is a distributed messaging server. Kafka maintains the message feed in categories called topics. An example of a topic can be the ticker symbol of a company you would like to get news about, for example, CSCO for Cisco.

Processes that produce messages are called producers and those that consume messages are called consumers. In traditional messaging, the messaging service has one central messaging server, also called the broker. Since Kafka is a distributed messaging service, it has a cluster of brokers, which functionally acts as one Kafka broker, as shown here:

For each topic, Kafka maintains the partitioned log. This partitioned log consists of one or more partitions spread across the cluster, as shown in the following figure:

Kafka borrows a lot of concepts from Hadoop and other big data frameworks. The concept of partition is very similar to the concept of InputSplit in Hadoop. In the simplest form, while using TextInputFormat, an InputSplit is the same as a block. A block is read in the form of a key-value pair in TextInputFormat, where the key is the byte offset of a line and the value is the content of the line itself. In a similar way, in a Kafka partition, records are stored and retrieved in the form of key-value pairs, where the key is a sequential ID number called the offset and the value is the actual message.

In Kafka, message retention does not depend on the consumption of a consumer. Messages are retained for a configurable period of time. Each consumer is free to read messages in any order they like. All they need to do is retain an offset. Another analogy can be reading a book in which the page number is analogous to the offset, while the page content is analogous to the message. The reader is free to read whichever way he/she wants as long as they remember the bookmark (the current offset).

To provide functionality similar to pub/sub and PTP (queues) in traditional messaging systems, Kafka has the concept of consumer groups. A consumer group is a group of consumers, which the Kafka cluster treats as a single unit. In a consumer group, only one consumer needs to receive a message. If the C1 consumer, in the following diagram, receives the first message for the topic T1, all the following messages on that topic will also be delivered to this consumer. Using this strategy, Kafka guarantees the order of message delivery for a given topic.

In extreme cases, when all consumers are in one consumer group, the Kafka cluster acts like PTP/queue. In another extreme case, if every consumer belongs to a different group, it acts like pub/sub. In practice, each consumer group has a limited number of consumers:

This recipe will show you how to perform a word count using the data coming from Kafka.

Table of Contents for Streaming using Kafka

Create new playlist

Sign In

Sign Up

Table of Contents for
Streaming using Kafka