-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-16116: Use apache curator to manage the Solr Zookeeper interactions #760
Conversation
solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerRoleCmd.java
Outdated
Show resolved
Hide resolved
leader election code is not broken, it was that the sessionTimeout was not being populated in SolrZkClient. |
@HoustonPutman I haven't looked super closely at this, but I think that Hadoop shades curator when needed. Its possible that the Solr Hadoop classes just need to have the imports updated to use the Hadoop specific shaded curator paths. This is new with the Hadoop auth and HDFS modules in Solr when we upgraded Hadoop. I tried to switch to the Hadoop self contained shaded dependencies to avoid issues like this. |
@risdenk I believe the issue currently is that for hadoop-auth we are not able to rely solely on |
I propose we make the loss of the reconnect boolean its own PR, plus the methods that shielded use of ZkCmdExecutor. That will make this effort here more independently reviewable and would be nice in its own right. |
var clientBuilder = CuratorFrameworkFactory.builder() | ||
.ensembleProvider(new FixedEnsembleProvider(zkHost)) | ||
.namespace(chroot) | ||
.sessionTimeoutMs(zkClientTimeout) | ||
.connectionTimeoutMs(clientConnectTimeout) | ||
.aclProvider(this.aclProvider) | ||
.authorization(zkCredentialsProvider.getCredentials()) | ||
.retryPolicy(retryPolicy); | ||
|
||
client = clientBuilder.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently worked on an experimental hack for some days on 8x to have the Overseer use Curator for elections. One thing that I got stuck on was that Solr's tests were finding some threads lingering. Eventually I discovered that Curator has an internal executor and a way to set it on the builder runSafeService
. Once I did that and tended to ensuring this executor got shot down, I didn't have that problem anymore. Just sharing this with you now in hopes it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW #1743 used this runSafeService stuff for the hadoop-auth changes. I think it might only come into play when there is an external zkClient provided but good to know either way.
hadoop 3.3.6 upgraded curator to 5.2.0 - #1743. I'm working on getting that in. chatted w/ @HoustonPutman about it as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great. I know it needs to be brought up to main again but 🥳
solr/test-framework/src/java/org/apache/solr/cloud/MiniSolrCloudCluster.java
Outdated
Show resolved
Hide resolved
#1743 is in main and so this PR should be ready to update :D |
I merged main (after hadoop 3.3.6/curator 5.x) and then fixed up different things after that to get all the tests passing including nightly. Would appreciate another set of eyes @HoustonPutman @dsmiley I'll look through this again to see if things can be cleaned up more, but hopefully its progress. |
solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/BeforeReconnect.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note a number of feedback comments aren't addressed.
@@ -23,6 +23,8 @@ Improvements | |||
|
|||
* SOLR-17077: When a shard rejoins leader election, leave previous election only once to save unneeded calls to Zookeeper. (Pierre Salagnac) | |||
|
|||
* SOLR-16116: Apache Curator is now used to manage all Solr Zookeeper interactions. (Houston Putman, Kevin Risden, Mike Drob, David Smiley) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my role was minor.
BTW this PR adds a dependency to Curator from solrj-zookeeper. We should highlight solrj dependencies with more publicity than Solr server.
44199a9
to
31ebd14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note a number of feedback comments aren't addressed.
Think I got all of them. Let me know if I missed any
solr/solrj-zookeeper/build.gradle
Outdated
@@ -32,6 +32,13 @@ dependencies { | |||
|
|||
implementation project(':solr:solrj') | |||
|
|||
api('org.apache.curator:curator-client', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually implementation should be ok for the curator-client
. For curator-framework
, that wouldn't be great because of the SolrZkClient.multi()
function parameters.
solr/test-framework/build.gradle
Outdated
@@ -43,6 +43,17 @@ dependencies { | |||
var zkExcludes = { | |||
exclude group: "org.apache.yetus", module: "audience-annotations" | |||
} | |||
api('org.apache.curator:curator-client', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed all curator deps here to "implementation"
new CloudHttp2SolrClient.Builder(Collections.singletonList(zkHost), Optional.empty()) | ||
.withZkConnectTimeout(1, TimeUnit.SECONDS) | ||
.build()) { | ||
cloudSolrClient.getById(collection, "1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the point was to use stream API I suppose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I guess so, but it is a slow test because of you can't override the ZkConnectTimeout of the Streaming client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert. The test will just be slow until https://issues.apache.org/jira/browse/CURATOR-720 is implemented.
implementation('org.apache.curator:curator-client', { | ||
exclude group: 'org.apache.zookeeper', module: 'zookeeper' | ||
}) | ||
api('org.apache.curator:curator-framework', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned before, it's in the API contract. Solr-core actually uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly a matter of taste; nothing is breaking based on this decision (I think). Solr-core already has an explicit dependency so that point is moot. I believe the impact would be some SolrJ user who adds solrj-zookeeper because they want to use ZK coordinates for CloudSolrClient. Their runtime class path will have Curator either way. But do we want such a user to have their compile-time class-path with Curator as well? Like; are they likely to use Curator explicitly? If we don't think it's likely, I think "implementation" best communicates an internal scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compilation vs runtime dependency doesn't have a huge effect on users, but I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, tried it, and it won't work. Our ZkACLProvider
is a sub-class of Curator's ACLProvider
. Therefore, people using solrj-zookeeper and ACLs will be unable to compile without the api dependency. Changing back.
solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/DigestZkACLProvider.java
Outdated
Show resolved
Hide resolved
solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/DigestZkCredentialsProvider.java
Outdated
Show resolved
Hide resolved
solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/SolrZkClient.java
Outdated
Show resolved
Hide resolved
solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/SolrZkClient.java
Show resolved
Hide resolved
solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/SolrZkClient.java
Outdated
Show resolved
Hide resolved
solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/SolrZkClient.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @HoustonPutman. I found an old PR for removing connection loss stuff that conflicts but just a reminder we might want to do this cleanup after this gets merged as well
For some reason this PR breaks |
Well apparently actually using PRS breaks a good amount of things.... lovely. I'm going to undo that, and just add a |
This reverts commit 72d9623.
Possibility of relating to this one.
|
@iamsanjay , this was a small race condition waiting for the connection to come back. I added a wait, since I don't think this is what the test was intending to test. Commit: 0f8decd |
Thanks @HoustonPutman, |
Yes, that is absolutely related, but I thoroughly tested that before pushing.... 🙄 I'll take a look. |
https://issues.apache.org/jira/browse/SOLR-16116
This is still a WIP there are two outstanding issues:
Might look into using curator recipes for persistent watchers and leader election, but this is a fairly complete migration to start with.