Making Your Kafka Cluster Production-Ready with Strimzi: Best Practices

Apache Kafka is widely used for building distributed, real-time data pipelines. However, running Kafka in a production environment requires a solid setup to ensure performance, reliability, and scalability. Strimzi simplifies deploying and managing Kafka clusters on Kubernetes or OpenShift, but to be truly production-ready, several factors need to be considered. In this article, we will cover essential best practices and optimizations to make your Kafka cluster ready for production, based on insights from the Strimzi Production Ready series and additional knowledge.

1. Cluster Balancing

As Kafka clusters grow, partitions might become unevenly distributed across brokers. This can lead to resource bottlenecks, where some brokers handle too many partitions while others remain underutilized.

How to Balance Partitions

Strimzi uses the Kafka Rebalance custom resource to rebalance partitions across brokers. This feature ensures better resource distribution and prevents some brokers from being overloaded.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaRebalance
metadata:
  name: my-rebalance
  labels:
    strimzi.io/cluster: my-cluster
spec:
  goals:
    - NetworkInboundCapacityGoal
    - DiskCapacityGoal
    - RackAwareGoal
    - NetworkOutboundCapacityGoal
    - CpuCapacityGoal
    - ReplicaCapacityGoal

By specifying goals such as NetworkInboundCapacityGoal and DiskCapacityGoal, the rebalancer ensures that brokers’ resources are better utilized, reducing bottlenecks.

Additional Tip: Rebalance Frequency

Rebalance Kafka clusters periodically, especially after adding or removing brokers. However, avoid over-rebalancing, as it can lead to performance degradation during the rebalance process.

2. Optimizing Local Storage

Storage configuration plays a major role in Kafka’s performance. While using local storage may improve I/O performance, there are trade-offs to consider.

Local Storage Pros and Cons

Advantages: Local storage (e.g., SSDs attached directly to the broker node) offers low-latency read/write operations, reducing the overhead associated with network storage.
Disadvantages: When a node goes offline for maintenance or crashes, its local storage becomes unavailable, requiring manual cleanup of Persistent Volumes (PV) and Persistent Volume Claims (PVC) in Kubernetes. Moreover, resyncing data to the newly provisioned node can take time.

Recommendation: If you choose local storage, have a robust monitoring and disaster recovery plan. For mission-critical workloads, consider distributed storage that isn’t tied to a single node.

3. Handling Multiple Disks for Kafka

While multiple disks offer more storage, they don’t necessarily improve bandwidth or throughput. Kafka partitions cannot span multiple disks, so you must carefully plan disk usage.

Best Practice: JBOD (Just a Bunch Of Disks)

Using JBOD can help combine multiple disks, but the limitation remains that a single partition will still reside on a single disk. Rebalancing partitions is necessary to ensure even distribution of disk usage.

Additional Insights:

Using a single large disk per broker is often better from a management and performance perspective, as it avoids idle disks and uneven partition distribution.
Strimzi makes it easy to configure JBOD in the Kafka CRD, and it’s often the best choice for scaling storage without adding complexity.

4. Improving Reliability

Reliability is key in any production environment. Kafka is designed to replicate data across brokers, but the default configuration might not be suitable for high-reliability requirements.

Producer Acknowledgments and Replication Factor

By default, Kafka producers use acks=1, meaning they only wait for the leader to confirm the receipt of a message. In production, it’s advisable to set acks=all or -1 to ensure that the producer waits for both the leader and all in-sync replicas to acknowledge.

You should also configure the replication factor for your topics. For most production use cases, a replication factor of 3 ensures high availability and fault tolerance.

1
2
3
4
config:
  offsets.topic.replication.factor: 3
  transaction.state.log.replication.factor: 3
  min.insync.replicas: 2

Rolling Updates and Fault Tolerance

Strimzi’s rolling update feature ensures that brokers are updated one by one. It waits for each broker to sync all messages before proceeding to the next broker, ensuring minimal disruption during updates. Configuring a proper min.insync.replicas guarantees Kafka can continue processing messages even if one broker is down for maintenance.

5. CPU and Memory Resource Management

In Kubernetes, managing CPU and memory resources is critical to ensuring Kafka’s stability and performance. Strimzi allows you to specify requests and limits for CPU and memory.

CPU Resources:

Kafka brokers rely heavily on CPU for handling client connections, managing partitions, and replication. Set requests to match the expected baseline usage, and configure limits based on peak performance.

1
2
3
4
5
6
7
resources:
  requests:
    memory: 16Gi
    cpu: 4
  limits:
    memory: 16Gi
    cpu: 8

Memory Management:

Kafka relies on JVM memory, including heap memory and off-heap (page cache). The recommended approach is to configure the heap size for Kafka brokers (using KAFKA_HEAP_OPTS), while ensuring you leave enough memory for off-heap operations.

1
KAFKA_HEAP_OPTS="-Xms4G -Xmx4G"

In Strimzi, these JVM settings can be specified directly in the Kafka CRD.

6. Dedicated Nodes and Node Affinity

To ensure that Kafka doesn’t compete for resources with other applications, it’s a good idea to dedicate worker nodes to Kafka by using node taints and affinities. Taints prevent non-Kafka workloads from being scheduled on these nodes.

1
2
kubectl taint node worker-node-04 dedicated=kafka:NoSchedule
kubectl label node worker-node-04 dedicated=kafka

Node affinity ensures that Kafka pods are placed on specific nodes dedicated to Kafka, allowing for better resource isolation and performance.

7. Storage Options for Kafka in Kubernetes

Kafka’s performance depends on the type of storage used. Strimzi supports three types of storage: Ephemeral, Persistent, and JBOD.

Persistent Storage:

For production, always prefer Persistent Volumes (PVs) backed by SSDs. Strimzi makes it easy to configure PVs and Persistent Volume Claims (PVCs) for durable, high-performance storage.

JBOD Storage:

For scaling storage without the complexity of RAID configurations, JBOD is a recommended approach. JBOD combines multiple disks into a single storage resource, which works well for Kafka brokers.

Avoid NFS:

Network File Systems (NFS) are not ideal for Kafka due to performance bottlenecks. Stick with block storage such as xfs or ext4 formatted disks.

8. Rack-Awareness and Multi-Zone Deployment

Kafka is designed to be resilient across failure domains such as availability zones. Strimzi supports rack-awareness, which allows you to spread Kafka brokers and replicas across different zones for high availability.

Rack-Awareness Configuration:

Enable rack-awareness in Strimzi to ensure that replicas are not placed on the same failure domain. Strimzi will automatically handle affinities and distribute brokers evenly across zones.

9. Disaster Recovery and Backups

Kafka itself does not have built-in disaster recovery mechanisms, so backups are critical for production deployments.

Best Practices:

Use Strimzi’s backup capabilities to store Kafka configuration and metadata.
Automate regular backups of topic data and Kafka configurations (topic definitions, ACLs, etc.).
Ensure that your backup strategy covers Zookeeper data (if you’re not using KRaft mode).

Conclusion

Setting up Kafka for production using Strimzi requires attention to detail in several areas—storage configuration, resource management, partition rebalancing, and fault tolerance. By following these best practices, you can ensure that your Kafka cluster is scalable, reliable, and optimized for high throughput in production environments.

Strimzi simplifies much of the complexity, but it’s important to understand the underlying components to make the most of your Kafka deployment. Whether you’re scaling horizontally or managing high availability across multiple zones, these guidelines will help you maintain a robust and resilient Kafka cluster.

For further learning, you can explore the Strimzi Production Ready YouTube Playlist here, which provided many of the key insights covered in this article.

Credits to the Strimzi Community for their valuable insights in the playlist.

1. Cluster Balancing#

How to Balance Partitions#

Additional Tip: Rebalance Frequency#

2. Optimizing Local Storage#

Local Storage Pros and Cons#

3. Handling Multiple Disks for Kafka#

Best Practice: JBOD (Just a Bunch Of Disks)#

Additional Insights:#

4. Improving Reliability#

Producer Acknowledgments and Replication Factor#

Rolling Updates and Fault Tolerance#

5. CPU and Memory Resource Management#

CPU Resources:#

Memory Management:#

6. Dedicated Nodes and Node Affinity#

7. Storage Options for Kafka in Kubernetes#

Persistent Storage:#

JBOD Storage:#

Avoid NFS:#

8. Rack-Awareness and Multi-Zone Deployment#

Rack-Awareness Configuration:#

9. Disaster Recovery and Backups#

Best Practices:#

Conclusion#