Visualizing Kafka

Timothy Stepro
7 min readFeb 24, 2021

Kafka is open source event streaming software that lets you build event driven systems. While there are other guides on it, I’d like focus on visualizing the main concepts behind Kafka. That way, when you read through the other guides, you’ll feel much more confident.

With that, let’s begin!

Basics

Before we begin, let’s make sure we’re on the same page about Kafka. It’s event streaming software. It allows backend services (usually in micro-services architecture) to communicate with each other.

Two services communicating via Kafka

Producers and Consumers

Producers and consumers are services that listen to or send messages in Kafka. These services are your backend services.

Consumer and Producer

A service can be both a consumer and producer.

A service listening to messages and consuming them

Topics

Topics are addresses that producers can send messages to. Other services can listen to these topics.

A service emitting a message and a service receiving a message from a Kafka topic

A service can listen and send messages to as many topics as it wants.

There’s also a notion of a consumer-group. This is a group of services that act as a single consumer.

A consumer group listening to topic B

For any message going to a consumer group, Kafka routes that message to a single service. This helps you load balance messages. And scale consumers!

A message going into a single service with a consumer group

A topic acts as a queue for messages. Let’s walk through this. First, a message is sent.

Producer sending message to Kafka topic.

Then, the message is recorded and stored on this queue. This message cannot be changed.

Message getting stored on the queue

The message is also sent to any consumer of the topic. However, the message remains on the queue permanently and cannot be edited.

A copy of the message being stored on the queue and sent to consumer.

Let’s send another message. Just to hammer home the point.

Sending a second message to Topic A

Like before, this message will be sent out to the consumer and stored in the queue. You cannot change messages and they are stored permanently.

(P.S. you can configure Kafka topics to remove these messages if there are too many or after a period of time)

Second message being stored.

This happens for every topic in our Kafka cluster

Messages being queued up in topics

These immutable queues allow us to store messages asynchronously regardless if a producer or consumer goes down. It also guarantees the correctness of messages (they’re untampered with).

Let’s open up these Kafka topics and take a look inside.

Partitions

I lied. A Kafka topic is not really a single queue, but actually composed of many queues called partitions! They help a topic scale.

A topic with two partitions

When a producer posts to a topic, that message gets routed to a single partition.

A message entering a topic, going to a partition

A consumer listens to all partitions and consumes events from all.

A message being consumed from a partition

A producer by default will send messages to the topic. The topic will determine which partition the message will go to. By default, messages will be assigned to partitions via a round robin strategy.

A producer writing to a topic, which is writing to multiple partitions

You can configure topics (not the service) to split messages into different partitions. For instance, if you’re handling user messages (and have a user id), you can make sure that messages for that user stay within the same partition. You can do this by hashing the user id and then modding it by the number of partitions. You get the point. I hope.

A producer sending messages (with possibly different entity/user ids) to different partitions

Why would you want this? It’s because every message within a partition is guaranteed to be chronologically ordered. Therefore consumed in order.

Messages being consumed in order from partitions.

Every message that goes into that partition is ordered within that partition. Even with multiple users (or other entity) messages being mapped to the same partition (red/green). You still get ordered user messages for each.

Regardless of why different message types are mapped into single partitions, they maintain order

Messages coming from a partition will be ordered. But partitions may send out their messages at any time. Therefore, topics, don’t guarantee order. This is a little weird. I know. Below, notice how both partitions send out their own messages. But they do this regardless of the other partition. They still maintain their own message order.

Two perfectly valid scenarios. Each partition maintains order for its own messages.

If your consumers, depend on message order (tracking user clicks within your site), you’ll want to look more into these topic partition strategies (which is out of scope for this article). If not, the default strategy will work fine for you.

Let’s now zoom out a bit and understand how Kafka does this.

Infrastructure

If we take a step back, let’s look at our first chart. What is the Kafka cloud?

Two services communicating via Kafka

It’s actually a cluster of servers. The first one we’ll look at is the head of the kafka cluster, zookeeper.

Zookeeper routing traffic in and out of the Kafka cluster.

Zookeeper manages all of your topics and partitions. It basically maintains a set of Kafka cluster nodes where topics and partitions are stored. These nodes are individual machines (for example, EC2 instances) which make up your Kafka cluster.

Zookeeper maintaining a set of nodes

If we have two topics, with two partitions each, this how we may have visualized them before. Note that the partitions are colored the same as the topic now.

Two topics with two partitions

We’ll number the partitions to help identify them later.

Numbered partitions

Now, let’s see how these topics would fit onto our Kafka cluster. Let’s start with one topic. Topic A. For this example, its Partition #1 will be placed on each node.

Topic A, Partition #1

You don’t have to put partitions on every node. And you probably don’t want to. It’ll get kind of expensive. On the other hand, you’ll have a resilient systems. Let’s see why.

If a message comes in, it’ll get routed to a partition in one of the nodes, known as a leader. Zookeeper assigns the leader.

A message sent to the leader.

Zookeeper will send off the message to the consumer just like before. It will also duplicate the message to the other copies of the partition. Followers.

Sending the message to the consumer and duplicating it on all of the followers

Now, each copy of the partition contains our message! If one node goes down or explodes, Zookeeper will reassign the leader to a different node.

Message in each partition copy

This is the same process that will happen for each other partition we add in. We’ll keep two partition copies in our cluster for now though.

Two partition copies

Now let’s add in the other partition #2 for topic A. Just two copies of it too. Now, topic A is fully in our cluster! Both partitions are copied and maintained.

Partition #1 and #2 in our cluster.

Now, let’s add in the partitions for topic B. We’ll assume two copies for now. This is our Kafka cluster with both topics! We’re done!

Both clusters

It may help to compare what we had before. Notice how the topics are spread across the cluster.

What we had before.
What we have now!

Conclusion

I hope you have a better understanding of Kafka now. I hope these visualizations helped you figure which questions to ask and what to google for. There are incredible guides on each one of the principles in this article.

Thank you for reading! And a huge thank you to the Kafka creators for making an incredible platform.

--

--

Timothy Stepro

Software engineer by day, software engineer by night.