-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement]: Support for RackAwareDistributionGoal in Cruise Control with Strimzi #10297
Comments
Why doesn't Strimzi support the Also as a sidenote, you will not have HA cluster over 2 zones. With RF=4 and min.insync.repcias=2, you can have both insync-replicas in in a single zone. With RF=4 and min.insync.replicas=3, you will not be available after loosing one AZ. So you really should use the 3 AZs if you have them. |
Thank you for your response. I thought that if the racks are configured on the broker, then in case of minISR=2 kafka would also try to distribute the data across the racks/availability zone? Or am i mistaken? My problem is that the third zone is minimally equipped, without a storage system and only with local storage. There are some restrictions in our infrastructure at the moment. Thanks in advance |
The min.insync.replicas are not something Kafka distributes. They happen for various other reasons (broker restart, client configuration, slow networking between AZs etc.). The number is just the minimum of replicas that have to be in sync to allow producers to produce new messages. It does not take the racks into account. So it can easily happen that you will have both of the in-sync replicas in the same rack/zone and lose not only the avialability, but possibly also the data. The only real protection is to have 3 as a minimum. That gives you certainty that at least one will be in each zone and you don't lose any data. But if you lose a whole zone, you will need to for example reconfigure the topic to allow producers to work again. So the availability will suffer from that. |
BAck to the original question, the @kyguy @ppatierno Is there any reason why |
FYI, from my archives, I see that we encountered this problem in mid-2023: at the time, the
|
@Pinimo And did you added it to the |
@scholzj Thank you very much for the detailed explanation about the Kafka replication mechanism. cruiseControl:
config:
goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal
hard.goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal But i get the following error within the cruise control pod:
I'am using Strimzi Kafka Operator 0.40.0 at the moment. |
As the error suggests -> you need to adjust the |
I'm not sure if I understand the configuration options of cruise control correctly. If I configure the following, then everything works: cruiseControl:
config:
default.goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.MinTopicLeadersPerBrokerGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal But the following doesnt't work: cruiseControl:
config:
default.goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.MinTopicLeadersPerBrokerGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal
|
You basically need to have in-sync |
Thank you very much for your help. I have now found the correct configuration: cruiseControl:
config:
default.goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal
goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal
hard.goals: >
com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareDistributionGoal |
Let's keep this open to figure out why isn't the RackAwareDistributionGoal goal part of |
@scholzj The Strimzi documentation is a bit misleading
It can be read as "the value is taken from main goals," but from the test, I think it means "uses the predefined value for main goals" (https://strimzi.io/docs/operators/latest/deploying#main-goals) The other misunderstanding comes from https://strimzi.io/docs/operators/latest/deploying#goals_order_of_priority
To save some time, the list of supported goals in Cruise Control is here https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/config/constants/AnalyzerConfig.java#L260 |
@pkleindl I'm not sure I follow your comments. Possibly because I do not know much about Cruise Control and did not wrote the docs. If you think you can write it better how it is used by Strimzi, feel free to open PR. |
Discussed on the community call on 10.7.2024: this should wait for the next call where we will hopefully have more Cruise Control SMEs. But it seems like this goal should be enabled by default. |
We can safely add That being said, even though both the
Since they are both
|
Related problem
Our organization operates a Kafka cluster within Kubernetes using the Strimzi Operator. To ensure high availability and fault tolerance, we aim to distribute our Kafka brokers and data across multiple Availability Zones. Currently, we are planning to split the brokers across 2 Availability Zones, with 2 brokers running in each zone. A third Zookeeper will additionally run in the third Availability Zone.
Every availability zone is defined as rack in strimzi by using the following topology key:
Cruise Control is a crucial tool for balancing and managing Kafka cluster workloads, but its current implementation in Strimzi lacks support for the
RackAwareDistributionGoal
.Unfortunately, we cannot use the supported
RackAwareGoal
because it requires having as many racks as the replication factor of the topics. In our setup, we will set the replication factor to 4 to tolerate the failure of an entire Availability Zone.The RackAwareDistributionGoal is essential for:
Suggested solution
Integrate the
RackAwareDistributionGoal
into the Cruise Control configuration within Strimzi. This could involve:RackAwareDistributionGoal
.Alternatives
No response
Additional context
Impact
This feature will significantly benefit organizations that deploy Kafka clusters across multiple Availability Zones, providing a more robust and fault-tolerant setup. It aligns with the best practices for high availability and disaster recovery in cloud environments.
The text was updated successfully, but these errors were encountered: