Apache Kafka is widely used for building distributed, real-time data pipelines. However, running Kafka in a production environment requires a solid setup to ensure performance, reliability, and scalability. Strimzi simplifies deploying and managing Kafka clusters on Kubernetes or OpenShift, but to be truly production-ready, several factors need to be considered. In this article, we will cover essential best practices and optimizations to make your Kafka cluster ready for production, based on insights from the Strimzi Production Ready series and additional knowledge.
1. Cluster Balancing
As Kafka clusters grow, partitions might become unevenly distributed across brokers. This can lead to resource bottlenecks, where some brokers handle too many partitions while others remain underutilized.
How to Balance Partitions
Strimzi uses the Kafka Rebalance custom resource to rebalance partitions across brokers. This feature ensures better resource distribution and prevents some brokers from being overloaded.
|
|
By specifying goals such as NetworkInboundCapacityGoal
and DiskCapacityGoal
, the rebalancer ensures that brokers’ resources are better utilized, reducing bottlenecks.
Additional Tip: Rebalance Frequency
Rebalance Kafka clusters periodically, especially after adding or removing brokers. However, avoid over-rebalancing, as it can lead to performance degradation during the rebalance process.
2. Optimizing Local Storage
Storage configuration plays a major role in Kafka’s performance. While using local storage may improve I/O performance, there are trade-offs to consider.
Local Storage Pros and Cons
- Advantages: Local storage (e.g., SSDs attached directly to the broker node) offers low-latency read/write operations, reducing the overhead associated with network storage.
- Disadvantages: When a node goes offline for maintenance or crashes, its local storage becomes unavailable, requiring manual cleanup of Persistent Volumes (PV) and Persistent Volume Claims (PVC) in Kubernetes. Moreover, resyncing data to the newly provisioned node can take time.
Recommendation: If you choose local storage, have a robust monitoring and disaster recovery plan. For mission-critical workloads, consider distributed storage that isn’t tied to a single node.
3. Handling Multiple Disks for Kafka
While multiple disks offer more storage, they don’t necessarily improve bandwidth or throughput. Kafka partitions cannot span multiple disks, so you must carefully plan disk usage.
Best Practice: JBOD (Just a Bunch Of Disks)
Using JBOD can help combine multiple disks, but the limitation remains that a single partition will still reside on a single disk. Rebalancing partitions is necessary to ensure even distribution of disk usage.
Additional Insights:
- Using a single large disk per broker is often better from a management and performance perspective, as it avoids idle disks and uneven partition distribution.
- Strimzi makes it easy to configure JBOD in the Kafka CRD, and it’s often the best choice for scaling storage without adding complexity.
4. Improving Reliability
Reliability is key in any production environment. Kafka is designed to replicate data across brokers, but the default configuration might not be suitable for high-reliability requirements.
Producer Acknowledgments and Replication Factor
By default, Kafka producers use acks=1
, meaning they only wait for the leader to confirm the receipt of a message. In production, it’s advisable to set acks=all
or -1
to ensure that the producer waits for both the leader and all in-sync replicas to acknowledge.
You should also configure the replication factor for your topics. For most production use cases, a replication factor of 3 ensures high availability and fault tolerance.
|
|
Rolling Updates and Fault Tolerance
Strimzi’s rolling update feature ensures that brokers are updated one by one. It waits for each broker to sync all messages before proceeding to the next broker, ensuring minimal disruption during updates. Configuring a proper min.insync.replicas
guarantees Kafka can continue processing messages even if one broker is down for maintenance.
5. CPU and Memory Resource Management
In Kubernetes, managing CPU and memory resources is critical to ensuring Kafka’s stability and performance. Strimzi allows you to specify requests and limits for CPU and memory.
CPU Resources:
Kafka brokers rely heavily on CPU for handling client connections, managing partitions, and replication. Set requests to match the expected baseline usage, and configure limits based on peak performance.
|
|
Memory Management:
Kafka relies on JVM memory, including heap memory and off-heap (page cache). The recommended approach is to configure the heap size for Kafka brokers (using KAFKA_HEAP_OPTS
), while ensuring you leave enough memory for off-heap operations.
|
|
In Strimzi, these JVM settings can be specified directly in the Kafka CRD.
6. Dedicated Nodes and Node Affinity
To ensure that Kafka doesn’t compete for resources with other applications, it’s a good idea to dedicate worker nodes to Kafka by using node taints and affinities. Taints prevent non-Kafka workloads from being scheduled on these nodes.
|
|
Node affinity ensures that Kafka pods are placed on specific nodes dedicated to Kafka, allowing for better resource isolation and performance.
7. Storage Options for Kafka in Kubernetes
Kafka’s performance depends on the type of storage used. Strimzi supports three types of storage: Ephemeral, Persistent, and JBOD.
Persistent Storage:
For production, always prefer Persistent Volumes (PVs) backed by SSDs. Strimzi makes it easy to configure PVs and Persistent Volume Claims (PVCs) for durable, high-performance storage.
JBOD Storage:
For scaling storage without the complexity of RAID configurations, JBOD is a recommended approach. JBOD combines multiple disks into a single storage resource, which works well for Kafka brokers.
Avoid NFS:
Network File Systems (NFS) are not ideal for Kafka due to performance bottlenecks. Stick with block storage such as xfs or ext4 formatted disks.
8. Rack-Awareness and Multi-Zone Deployment
Kafka is designed to be resilient across failure domains such as availability zones. Strimzi supports rack-awareness, which allows you to spread Kafka brokers and replicas across different zones for high availability.
Rack-Awareness Configuration:
Enable rack-awareness in Strimzi to ensure that replicas are not placed on the same failure domain. Strimzi will automatically handle affinities and distribute brokers evenly across zones.
9. Disaster Recovery and Backups
Kafka itself does not have built-in disaster recovery mechanisms, so backups are critical for production deployments.
Best Practices:
- Use Strimzi’s backup capabilities to store Kafka configuration and metadata.
- Automate regular backups of topic data and Kafka configurations (topic definitions, ACLs, etc.).
- Ensure that your backup strategy covers Zookeeper data (if you’re not using KRaft mode).
Conclusion
Setting up Kafka for production using Strimzi requires attention to detail in several areas—storage configuration, resource management, partition rebalancing, and fault tolerance. By following these best practices, you can ensure that your Kafka cluster is scalable, reliable, and optimized for high throughput in production environments.
Strimzi simplifies much of the complexity, but it’s important to understand the underlying components to make the most of your Kafka deployment. Whether you’re scaling horizontally or managing high availability across multiple zones, these guidelines will help you maintain a robust and resilient Kafka cluster.
For further learning, you can explore the Strimzi Production Ready YouTube Playlist here, which provided many of the key insights covered in this article.
Credits to the Strimzi Community for their valuable insights in the playlist.