-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admin client - failure to create topics or delete topics (error code 41) #865
Conversation
@ods I feel this PR should definitely be part of 0.8.0 |
@tvoinarovskyi Please review |
Codecov Report
@@ Coverage Diff @@
## master #865 +/- ##
=======================================
Coverage 97.55% 97.55%
=======================================
Files 28 28
Lines 5389 5391 +2
=======================================
+ Hits 5257 5259 +2
Misses 132 132
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@kbhatiya999 Could you please describe what problem do you solve? Is it possible to reproduce it? |
I solved the failure to create or delete topics using admin client. To reproduce we just need a kafka server with more than one broker. Now use admin client to create or delete topics, it will fail most of the time. Reason: you need cluster coordinator to create/delete topics. But since we are using random broker it will fail. I changed code so that create/delete always hit cluster coordinator broker. |
Hello! We also have this bug with 41 error code while trying to create topic with distributed kafka cluster. The PR seems to be a solution for that. Are there any plans to merge this fix in near future? |
@DMantis you might have to fork library as you wait. I am really surprised others are not facing this issue (probably they are not using admin client just producer and consumer). |
@DMantis If possible could you add sample code and it's output demonstrating the problem please. I haven't worked on kafka for some time now so will have to setup from scratch. |
@ods what will you need for this PR to be approved. |
@kbhatiya999, thank you for the reminder about this problem. I've done some research and it looks like the main point of the PR is correct, i.e. we have to send request to controller node. But we certainly have to apply the same approach in other methods too (delete_topics, create_partitions etc.). Also, we probably want to handle NOT_CONTROLLER (41) error (refresh metadata and retry?) in this case. Also it would be nice to have some simple way to check that problem exists with current code and solved after the fix. I understand that it's very difficult to provide CI tests for this case, but I hope something like a docker-compose.yml + a script to check it manually is possible. This is the most difficult part. Probably I could finish the rest if we have tests. Does it make sense to you? Please describe you vision on how to proceed. |
@ods It might take me some time (I have left that project as it was completed). I will comeback with docker-compose.yml and script. |
Has there been any progress on this? I can also confirm that this change fixes the issue. librdkafka opts to always use the controller ID node as the target for admin requests, so we can fix all of the call sites by modifying one function: async def _send_request(
self,
request: Request,
node_id: Optional[int] = None) -> Response:
if node_id is None:
metadata = await self._get_cluster_metadata()
node_id = metadata.controller_id
return await self._client.send(node_id, request) Handling error code 41 in the call may be challenging since the user can pass multiple topics and each one technically has its own associated error code. As an end-user, I would be fine handling the retry logic. |
@y4n9squared I will update pr with details that will help maintainer determine if pr is ready for merge. |
Any update on this? |
@ods I have taken a look to the test setup, and an approach would be :
It is quite some work, I can give it a shot, but I would like to know if you agree with the approach |
I completely agree with moving to docker compose. Actually we have problems with current approach that could be solved by this move.
It doesn't look like a right way to me. Nowadays it's common to start tests in of the services declared in docker-compose.yml. This approach has many advantages compared to the current: better reproducibility of environment, fully automated setup for newcomers, better observability and simpler debugging. Docker compose takes responsibility on waiting for readiness, so we don't need the code in test for this. Actually, moving to docker compose is in roadmap for aiokafka. But as you said, it's quite a big task, as all the CI should be adjusted accordingly either by using docker-in-docker approach, or by duplicating services setup in CI config. |
I think some temporary docker-compose.yml to reproduce the problem and check that PR fixes it (without automating and incorporating it in CI) would be enough for now. |
@ods About running the test inside a service part of the docker-compose.yml, it is the way we are working in my company so we have quite an experience with it, so it would be easier but it has some cons too. Some details/feedback on it :
Most of the "issues" are around mounting the project folder. A copy would play nicer, but it is slowing down a bit the dev cycle (you need to run the docker build as soon as you change a file) |
@kbhatiya999 @y4n9squared Any tips to reproduce the issue ? What the version of brokers used in your clusters ? |
I am seeing that in the scala code of the broker https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/KafkaApis.scala
|
Here the change on the broker https://issues.apache.org/jira/browse/KAFKA-10181 |
I am still failing to reproduce the issue |
I see the Jira issue (KAFKA-9705 - Zookeeper mutation protocols should be redirected to Controller only) is not resolved yet. |
Yet @ods I have been playing with the compose here #973 form version 1.1.1 to 2.8.1 without being able to reproduce the problem. Maybe something is wrong with my test, but so far, I am not able to reproduce it. |
In my situation, we are using the client library against RedPanda 23.2. RedPanda is a completely separate implementation but complies with the Kafka API. This issue being particular to KRaft-based cluster seems to track. |
@y4n9squared I tested with a KRaft cluster, it does still work, see this branch https://github.com/aio-libs/aiokafka/pull/973/files maybe you should try to do a similar docker compose file but with RedPanda ? I am not sure then how far the goal of aiokafka client to be compatible with different servers |
As a general principle, I agree that the client SDK shouldn't be accommodating implementation idiosyncrasies. My impression was that the API dictated that Create/DeleteTopic requests must go to the controller, but I cannot find the KIP. However, implementation in other libraries seem to do this already:
|
I don't mind adding a recent version of RedPanda to CI, if somebody is ready to volunteer implementing and supporting it. |
@vmaurin Is this a change that you would still like to merge? All that I can do is attest that the current behavior is not usable for multi-node RedPanda clusters and that the other Python client SDKs do not have this issue as they only forward RPC commands to the controller node. |
I opened this issue in 2022 - it has been open since two year. @ods Meanwhile I will update the PR |
Raised a new issue corresponding to this : #995 all the details as sample code is mentioned there as well |
Use controller node for create and delete topic instead of random
Fixes error code 41 on create and delete topic