Kafka is open source event streaming software that lets you build event driven systems. While there are other guides on it, I’d like focus on visualizing the main concepts behind Kafka. That way, when you read through the other guides, you’ll feel much more confident.
With that, let’s begin!
Before we begin, let’s make sure we’re on the same page about Kafka. It’s event streaming software. It allows backend services (usually in micro-services architecture) to communicate with each other.
Producers and Consumers
Producers and consumers are services that listen to or send messages in Kafka. These services are your backend services.
A service can be both a consumer and producer.
Topics are addresses that producers can send messages to. Other services can listen to these topics.
A service can listen and send messages to as many topics as it wants.
There’s also a notion of a consumer-group. This is a group of services that act as a single consumer.
For any message going to a consumer group, Kafka routes that message to a single service. This helps you load balance messages. And scale consumers!
A topic acts as a queue for messages. Let’s walk through this. First, a message is sent.
Then, the message is recorded and stored on this queue. This message cannot be changed.
The message is also sent to any consumer of the topic. However, the message remains on the queue permanently and cannot be edited.
Let’s send another message. Just to hammer home the point.
Like before, this message will be sent out to the consumer and stored in the queue. You cannot change messages and they are stored permanently.
(P.S. you can configure Kafka topics to remove these messages if there are too many or after a period of time)
This happens for every topic in our Kafka cluster
These immutable queues allow us to store messages asynchronously regardless if a producer or consumer goes down. It also guarantees the correctness of messages (they’re untampered with).
Let’s open up these Kafka topics and take a look inside.
I lied. A Kafka topic is not really a single queue, but actually composed of many queues called partitions! They help a topic scale.
When a producer posts to a topic, that message gets routed to a single partition.
A consumer listens to all partitions and consumes events from all.
A producer by default will send messages to the topic. The topic will determine which partition the message will go to. By default, messages will be assigned to partitions via a round robin strategy.
You can configure topics (not the service) to split messages into different partitions. For instance, if you’re handling user messages (and have a user id), you can make sure that messages for that user stay within the same partition. You can do this by hashing the user id and then modding it by the number of partitions. You get the point. I hope.
Why would you want this? It’s because every message within a partition is guaranteed to be chronologically ordered. Therefore consumed in order.
Every message that goes into that partition is ordered within that partition. Even with multiple users (or other entity) messages being mapped to the same partition (red/green). You still get ordered user messages for each.
Messages coming from a partition will be ordered. But partitions may send out their messages at any time. Therefore, topics, don’t guarantee order. This is a little weird. I know. Below, notice how both partitions send out their own messages. But they do this regardless of the other partition. They still maintain their own message order.
If your consumers, depend on message order (tracking user clicks within your site), you’ll want to look more into these topic partition strategies (which is out of scope for this article). If not, the default strategy will work fine for you.
Let’s now zoom out a bit and understand how Kafka does this.
If we take a step back, let’s look at our first chart. What is the Kafka cloud?
It’s actually a cluster of servers. The first one we’ll look at is the head of the kafka cluster, zookeeper.
Zookeeper manages all of your topics and partitions. It basically maintains a set of Kafka cluster nodes where topics and partitions are stored. These nodes are individual machines (for example, EC2 instances) which make up your Kafka cluster.
If we have two topics, with two partitions each, this how we may have visualized them before. Note that the partitions are colored the same as the topic now.
We’ll number the partitions to help identify them later.
Now, let’s see how these topics would fit onto our Kafka cluster. Let’s start with one topic. Topic A. For this example, its Partition #1 will be placed on each node.
You don’t have to put partitions on every node. And you probably don’t want to. It’ll get kind of expensive. On the other hand, you’ll have a resilient systems. Let’s see why.
If a message comes in, it’ll get routed to a partition in one of the nodes, known as a leader. Zookeeper assigns the leader.
Zookeeper will send off the message to the consumer just like before. It will also duplicate the message to the other copies of the partition. Followers.
Now, each copy of the partition contains our message! If one node goes down or explodes, Zookeeper will reassign the leader to a different node.
This is the same process that will happen for each other partition we add in. We’ll keep two partition copies in our cluster for now though.
Now let’s add in the other partition #2 for topic A. Just two copies of it too. Now, topic A is fully in our cluster! Both partitions are copied and maintained.
Now, let’s add in the partitions for topic B. We’ll assume two copies for now. This is our Kafka cluster with both topics! We’re done!
It may help to compare what we had before. Notice how the topics are spread across the cluster.
I hope you have a better understanding of Kafka now. I hope these visualizations helped you figure which questions to ask and what to google for. There are incredible guides on each one of the principles in this article.
Thank you for reading! And a huge thank you to the Kafka creators for making an incredible platform.