Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16116: Use apache curator to manage the Solr Zookeeper interactions #760

Merged
merged 47 commits into from
Oct 30, 2024

Conversation

HoustonPutman
Copy link
Contributor

@HoustonPutman HoustonPutman commented Mar 24, 2022

https://issues.apache.org/jira/browse/SOLR-16116

This is still a WIP there are two outstanding issues:

  • The hadoop-auth framework requires a different version of curator
  • The leader election code is slightly broken, because the OnReconnect logic in ZKController expects to be called at different times, with different guarantees of watchers and ephemeral nodes than Zookeeper gives.

Might look into using curator recipes for persistent watchers and leader election, but this is a fairly complete migration to start with.

@HoustonPutman
Copy link
Contributor Author

leader election code is not broken, it was that the sessionTimeout was not being populated in SolrZkClient.

@risdenk
Copy link
Contributor

risdenk commented Apr 12, 2022

@HoustonPutman I haven't looked super closely at this, but I think that Hadoop shades curator when needed. Its possible that the Solr Hadoop classes just need to have the imports updated to use the Hadoop specific shaded curator paths. This is new with the Hadoop auth and HDFS modules in Solr when we upgraded Hadoop. I tried to switch to the Hadoop self contained shaded dependencies to avoid issues like this.

@madrob
Copy link
Contributor

madrob commented Apr 12, 2022

@risdenk I believe the issue currently is that for hadoop-auth we are not able to rely solely on hadoop-client-* dependencies, which are the ones that would use a relocated curator. All of the auth filter and kerberos logic and whatever else we borrow from hadoop is annotated with audience LimitedPrivate I think, so they didn't bother making a consumable client module for people to use.

@HoustonPutman
Copy link
Contributor Author

Thanks for the comment @risdenk . I was talking to @madrob about it, but he hinted that there were other problems with using the shaded dependencies. However that might no longer be applicable, so I will definitely give it a shot in a week or month (things getting a bit busy haha).

@dsmiley
Copy link
Contributor

dsmiley commented Feb 19, 2023

I propose we make the loss of the reconnect boolean its own PR, plus the methods that shielded use of ZkCmdExecutor. That will make this effort here more independently reviewable and would be nice in its own right.

Comment on lines 210 to 219
var clientBuilder = CuratorFrameworkFactory.builder()
.ensembleProvider(new FixedEnsembleProvider(zkHost))
.namespace(chroot)
.sessionTimeoutMs(zkClientTimeout)
.connectionTimeoutMs(clientConnectTimeout)
.aclProvider(this.aclProvider)
.authorization(zkCredentialsProvider.getCredentials())
.retryPolicy(retryPolicy);

client = clientBuilder.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently worked on an experimental hack for some days on 8x to have the Overseer use Curator for elections. One thing that I got stuck on was that Solr's tests were finding some threads lingering. Eventually I discovered that Curator has an internal executor and a way to set it on the builder runSafeService. Once I did that and tended to ensuring this executor got shot down, I didn't have that problem anymore. Just sharing this with you now in hopes it helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW #1743 used this runSafeService stuff for the hadoop-auth changes. I think it might only come into play when there is an external zkClient provided but good to know either way.

@risdenk
Copy link
Contributor

risdenk commented Oct 3, 2023

hadoop 3.3.6 upgraded curator to 5.2.0 - #1743. I'm working on getting that in. chatted w/ @HoustonPutman about it as well

Copy link
Contributor

@risdenk risdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great. I know it needs to be brought up to main again but 🥳

@risdenk
Copy link
Contributor

risdenk commented Oct 4, 2023

#1743 is in main and so this PR should be ready to update :D

@risdenk
Copy link
Contributor

risdenk commented Oct 10, 2023

I merged main (after hadoop 3.3.6/curator 5.x) and then fixed up different things after that to get all the tests passing including nightly. Would appreciate another set of eyes @HoustonPutman @dsmiley

I'll look through this again to see if things can be cleaned up more, but hopefully its progress.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note a number of feedback comments aren't addressed.

@@ -23,6 +23,8 @@ Improvements

* SOLR-17077: When a shard rejoins leader election, leave previous election only once to save unneeded calls to Zookeeper. (Pierre Salagnac)

* SOLR-16116: Apache Curator is now used to manage all Solr Zookeeper interactions. (Houston Putman, Kevin Risden, Mike Drob, David Smiley)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my role was minor.

BTW this PR adds a dependency to Curator from solrj-zookeeper. We should highlight solrj dependencies with more publicity than Solr server.

Copy link
Contributor Author

@HoustonPutman HoustonPutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note a number of feedback comments aren't addressed.

Think I got all of them. Let me know if I missed any

@@ -32,6 +32,13 @@ dependencies {

implementation project(':solr:solrj')

api('org.apache.curator:curator-client', {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually implementation should be ok for the curator-client. For curator-framework, that wouldn't be great because of the SolrZkClient.multi() function parameters.

@@ -43,6 +43,17 @@ dependencies {
var zkExcludes = {
exclude group: "org.apache.yetus", module: "audience-annotations"
}
api('org.apache.curator:curator-client', {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed all curator deps here to "implementation"

new CloudHttp2SolrClient.Builder(Collections.singletonList(zkHost), Optional.empty())
.withZkConnectTimeout(1, TimeUnit.SECONDS)
.build()) {
cloudSolrClient.getById(collection, "1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point was to use stream API I suppose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I guess so, but it is a slow test because of you can't override the ZkConnectTimeout of the Streaming client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert. The test will just be slow until https://issues.apache.org/jira/browse/CURATOR-720 is implemented.

implementation('org.apache.curator:curator-client', {
exclude group: 'org.apache.zookeeper', module: 'zookeeper'
})
api('org.apache.curator:curator-framework', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned before, it's in the API contract. Solr-core actually uses it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly a matter of taste; nothing is breaking based on this decision (I think). Solr-core already has an explicit dependency so that point is moot. I believe the impact would be some SolrJ user who adds solrj-zookeeper because they want to use ZK coordinates for CloudSolrClient. Their runtime class path will have Curator either way. But do we want such a user to have their compile-time class-path with Curator as well? Like; are they likely to use Curator explicitly? If we don't think it's likely, I think "implementation" best communicates an internal scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation vs runtime dependency doesn't have a huge effect on users, but I'll change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, tried it, and it won't work. Our ZkACLProvider is a sub-class of Curator's ACLProvider. Therefore, people using solrj-zookeeper and ACLs will be unable to compile without the api dependency. Changing back.

Copy link
Contributor

@risdenk risdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @HoustonPutman. I found an old PR for removing connection loss stuff that conflicts but just a reminder we might want to do this cleanup after this gets merged as well

#2004

@HoustonPutman
Copy link
Contributor Author

For some reason this PR breaks TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect() on a seed that makes PRS enabled by default. From what I can tell there is an error in CreateCollectionCmd, such that PRS being turned on via the default doesn't actually work all the way. I'll try to fix that so that merging this will pass.

@HoustonPutman
Copy link
Contributor Author

Well apparently actually using PRS breaks a good amount of things.... lovely. I'm going to undo that, and just add a @NoPRS tag to the test.

@github-actions github-actions bot removed the cat:api label Oct 30, 2024
@HoustonPutman HoustonPutman merged commit e5d15cc into apache:main Oct 30, 2024
4 checks passed
@HoustonPutman HoustonPutman deleted the curator branch October 30, 2024 21:55
@iamsanjay
Copy link
Contributor

Possibility of relating to this one.
Build scan data https://ge.apache.org/s/63ya5yavjnh5o

Error from server at http://127.0.0.1:43991/solr: Cannot talk to ZooKeeper - Updates are disabled.

org.apache.solr.cloud.BasicDistributedZk2Test.test (:solr:core)
    Test history: https://ge.apache.org/scans/tests?search.rootProjectNames=solr-root&tests.container=org.apache.solr.cloud.BasicDistributedZk2Test&tests.test=test http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.BasicDistributedZk2Test.test
    Test output: /tmp/src/solr/solr/core/build/test-results/test/outputs/OUTPUT-org.apache.solr.cloud.BasicDistributedZk2Test.txt
    Reproduce with: ./gradlew :solr:core:test --tests "org.apache.solr.cloud.BasicDistributedZk2Test.test" -Ptests.jvms=96 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=5426200A04BA10C -Ptests.timeoutSuite=600000! -Ptests.file.encoding=UTF-8

@HoustonPutman
Copy link
Contributor Author

@iamsanjay , this was a small race condition waiting for the connection to come back. I added a wait, since I don't think this is what the test was intending to test. Commit: 0f8decd

@iamsanjay
Copy link
Contributor

Thanks @HoustonPutman,
Do you think this one also related?
TestPullReplicaErrorHandling#testCloseHooksDeletedOnReconnect

@HoustonPutman
Copy link
Contributor Author

Yes, that is absolutely related, but I thoroughly tested that before pushing.... 🙄 I'll take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants