-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set default replication factor for CC sample topics #9471
Conversation
076f038
to
5f0d6d7
Compare
1b311e3
to
0dbb79b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder a bit whether this is the right fix. Do we set the replication factor to 2 in Strimzi? Or does Cruise Control do it? If Cruise Control does it, what are the reasons? Isn't then the right fix for Cruise Control to also set the min.insync.replicas
to 1 and thus create a proper topic configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cruise Control does not set minISR equal to cluster default, we do.
Cruise Control sets sample.store.topic.replication.factor=2
for these internal topics, and looks at the cluster.configs.file
for cluster configurations. The sample config/clusterConfigs.json
contains min.insync.replicas=1
, which is correct.
IMO, we should simply remove the code I linked, because our default minISR for CC is also 1.
I guess we need to decide what is the right way to handle it. WDYT @ppatierno? |
Going through the Cruise Control docs it looks like that the replication factor is set to 2 for internal topics like |
Cruise Control sets it by default
From what I understand Cruise Control provides a
Yes. We have an issue upstream [2] for this which I am hoping to fix/address. That being said, until that upstream issue is addressed, we could use this fix as a work around or leave the code as it is for the time being.
It is the responsibility of the CC user (or in our case, Strimzi) to maintain the
A partition that is under replicated, 2 replicas behind, will not get flagged by Cruise Control. So (1) The user will not know their partition replication is behind/lagging from CC data (2) Cruise Control will not factor in this partition replication lag when it is creating a propose which could lead to a rebalance which could cause a cluster to become more unbalanced or worse, lose data.
Yes, this is key and why we do it. Cruise Control depends on this information to maintain correct state in its cluster model [1] https://kafka.apache.org/documentation/#brokerconfigs_min.insync.replicas |
I guess I'm a bit confused on how to interpret this.
One has to wonder if we should override it with Kafka's defaults if CC intentionally chose 2. It seems more like it just needs to set the
Does it really make sense to pursue both this fix and the CC fix? Once CC has some default for it, would you need to change this again? Or how would the user choose between using the Kafka's default and the CC default? It does not seem to me like one is workaround of the other. But more like two different solutions from which we will need to pick one.
|
Ok, but with your patch it can still happen that the user sets RF 3 and minISR 1 at the cluster level, causing the same issue with flagging under under replicated partitions. I think we should have the same defaults that CC has (RF 2, minISR 1) when .kafka.replicas < 3, and set stronger defaults (RF 3, minISR 2) when kafka.replicas >=3. Wdyt? |
I don't believe there is any strong reason behind CC's chosen defaults, regardless, I have asked CC maintainers if there is any specific reason why
Yes, this would be the most direct/correct approach if we could set the
Cruise Control logs errors like the following:
but is still able to generate optimization proposals and execute partition rebalances.
Depends on how critical the raised issue is, this PR is a temporary fix for it until:
won't be executed or needed and could be removed.
CC's default takes top precedence, a user could set it themselves, if not, Strimzi would set it with a reasonable default based on the Kafka settings.
This PR is a temporary fix for the problem until we add a proper
Yes, we generate it on the Strimzi side
The CC code does attempt to read the
With or without the patch, a user can configure their cluster in a way to induce under replicated partition issues. The patch is meant to reduce the chance of that happening for CC sample topics by following the cluster level configurations if CC level configurations are not set. With or without the current patch, under replicated partitions are flagged by CC, which is good. However, if we change the solution by hardcoding the minISR for CC at cluster level (via the Strimzi generated
Sounds reasonable to me. When the |
@kyguy thanks for the investigation and the RFE, much appreciated.
Why can't we do that right now and then update to make use of the new configuration? TBH, I don't like the idea that the setup of some internal topics that we create indirectly depends on the user's cluster configuration. |
Because we should not mess with the minISR for CC at a cluster level. We should only alter the minISR for CC at a topic level.
Any topic created without a topic level replication factor config defaults to Kafka cluster level |
This is not what I was suggesting. We can just set a default value at the CC class level, as we already do for minISR.
I think setting RF=2 and minISR=1 as defaults for CC internal topics works for every cluster. If we want to do better, we can set RF=3 and minISR=2 when .kafka.replicas >= 3. |
I am not sure I understand, how would we set the default value at the CC class level? Which minISR setting?
How would we set default minISR as default for CC internal topics? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see that min.insync.replicas
in /tmp/clusterConfig.json
is never used, because KafkaTopicConfigProvider
is deprecated, and Cruise Control uses KafkaAdminTopicConfigProvider
by default. If you agree, I can open a different PR to remove Cruise Control's cluster.configs.file
configuration and related code.
This means that min.insync.replicas
cluster default always wins, and the fix I had in mind can't work because of that. This also means that your temporary fix makes sense, until we have the new sample.store.topic.min.insync.replicas
configuration.
Makes sense to me, I hadn't realized the
I think this is the best route to go for Strimzi for now, in the meantime, I'll work with upstream Cruise Control maintainers to add a |
This PR is ready to go, it will take care of the problem safely in the short term while I work with CC maintainers on getting a more permanent solution merged upstream [1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking into account the Cruise Control community has't back to us after our comment on the issue, I think we should not wait for them and moving forward with this. We can always revert back if they will get our help to have the feature supported in CC itself.
Can ou please rebase this @kyguy? |
Signed-off-by: Kyle Liberti <[email protected]>
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thakns @kyguy
Type of change
Description
Addresses #9469
At this time, Cruise Control does not provide a
min.insync.replicas
configuration for sample store topics in its properties file. Therefore, to set themin.insync.replicas
configuration for sample store topics, a user must either:min.insync.replicas
configuration valuemin.insync.replica
configurationThe issue here is that Cruise Control sets the replication factor of sample store topics to
2
by default and the Strimzi example files set themin.insync.replica
of the Kafka clusters to2
. This causes issues when Strimzi preforms rolling updates on the cluster as the sample store topics easily become under replicated.This PR sets the default replication factor of sample store topics,
sample.store.topic.replication.factor
, to match that of the Kafka cluster'sdefault.replication.factor
.Checklist
Please go through this checklist and make sure all applicable tasks have been done