Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RIP-72] : Add a message gray solution , compatible with RocketMQ 4.x and RocketMQ 5.x #8469

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

syhleo
Copy link
Contributor

@syhleo syhleo commented Jul 31, 2024

Which Issue(s) This PR Fixes

[Enhancement] Add message gray strategy solution #8468

Brief Description

a solution for message gray.
The solution provides an extensible message grayscale solution for implementing message grayscale publishing that supports RocketMQ 4.x and RocketMQ 5.x . It is compatible with POP consumption mode and Push consumption mode. No matter the client-rebalance or server-rebalance, the solution can realize the message gray level in a lightweight way.
The solution has been widely applied and verified in our projects, which confirms its reliability, security, and stability.

How Did You Test This Change?

image image image

image

image

image

image

Easy access. Business parties control only a few client configurations, such as enableGraySwitch and grayTag, to seamlessly access RocketMQ's gray-scale publishing capabilities, enabling full-link gray-scale publishing.Specific can refer to org.apache.rocketmq.example.gray

@HelloGitbin
Copy link

perfect

@syhleo syhleo changed the title [ISSUE #8468]: Add a message gray solution [ISSUE #8468]: Add a message gray solution , compatible with POP consumption mode and Push consumption mode Jul 31, 2024
@yx9o
Copy link
Contributor

yx9o commented Aug 1, 2024

@syhleo Hello, I have a question. If the producer is configured with grayscale, but the consumer is not configured with grayscale, then the data in the grayscale queue will also be consumed by normal consumers. The same is true in reverse. Is this in line with expectations?

@Qoozm
Copy link
Contributor

Qoozm commented Aug 1, 2024

@syhleo Hello, I have a question. If the producer is configured with grayscale, but the consumer is not configured with grayscale, then the data in the grayscale queue will also be consumed by normal consumers. The same is true in reverse. Is this in line with expectations?

hi, in the issue's Describe the Solution, i found a note about this, it seems within expectations
image

@syhleo
Copy link
Contributor Author

syhleo commented Aug 1, 2024

@syhleo Hello, I have a question. If the producer is configured with grayscale, but the consumer is not configured with grayscale, then the data in the grayscale queue will also be consumed by normal consumers. The same is true in reverse. Is this in line with expectations?

of course,
RocketMQ triggers the rebalancing process when all grayscale consumer clients do not exist (for example, the consumer goes online after the grayscale verification passes or the grayscale consumer goes offline abnormally). During this process, it is detected that all consumer ClientiDs do not contain the @gray identifier, so other normal consumer clients will immediately take over the consumption of messages, and any messages left in the gray scale queue will also be immediately taken over by normal consumer clients to ensure that messages are not lost.

You can see the related documentation:语雀

@yx9o
Copy link
Contributor

yx9o commented Aug 1, 2024

If the grayscale message and normal message data formats are inconsistent, and grayscale is enabled for production but not for consumption, will the grayscale data consumed normally fail to parse or cannot be processed normally?

@syhleo
Copy link
Contributor Author

syhleo commented Aug 1, 2024

If the grayscale message and normal message data formats are inconsistent, and grayscale is enabled for production but not for consumption, will the grayscale data consumed normally fail to parse or cannot be processed normally?

Yes. If only the production is gray, the consumption is not gray, that is, the gray consumption does not exist at all, that is, all the messages are consumed by the normal consumer end. However, first of all, the change of MQ message data format, the production side and the consumer side are to be logically adjusted together, if your business scene inside the production side of the grayscale message and the data format of the ordinary message is inconsistent, the consumer side needs to follow together with the grayscale, so that the grayscale consumer side consumes the grayscale message (data format A). The normal consumer consumes normal messages (data format B). This is also the full-link gray scale release that we have been advocating.

@HelloGitbin
Copy link

I tried it. It works.

@yx9o
Copy link
Contributor

yx9o commented Aug 6, 2024

If the grayscale message and normal message data formats are inconsistent, and grayscale is enabled for production but not for consumption, will the grayscale data consumed normally fail to parse or cannot be processed normally?

Yes. If only the production is gray, the consumption is not gray, that is, the gray consumption does not exist at all, that is, all the messages are consumed by the normal consumer end. However, first of all, the change of MQ message data format, the production side and the consumer side are to be logically adjusted together, if your business scene inside the production side of the grayscale message and the data format of the ordinary message is inconsistent, the consumer side needs to follow together with the grayscale, so that the grayscale consumer side consumes the grayscale message (data format A). The normal consumer consumes normal messages (data format B). This is also the full-link gray scale release that we have been advocating.

What I want is that the grayscale queue can only be consumed by grayscale consumers. When there is no grayscale consumer, I don’t want the data of both the grayscale queue and the ordinary queue to be consumed by normal consumers, so that the data is strung together.

@syhleo
Copy link
Contributor Author

syhleo commented Aug 6, 2024

If the grayscale message and normal message data formats are inconsistent, and grayscale is enabled for production but not for consumption, will the grayscale data consumed normally fail to parse or cannot be processed normally?

Yes. If only the production is gray, the consumption is not gray, that is, the gray consumption does not exist at all, that is, all the messages are consumed by the normal consumer end. However, first of all, the change of MQ message data format, the production side and the consumer side are to be logically adjusted together, if your business scene inside the production side of the grayscale message and the data format of the ordinary message is inconsistent, the consumer side needs to follow together with the grayscale, so that the grayscale consumer side consumes the grayscale message (data format A). The normal consumer consumes normal messages (data format B). This is also the full-link gray scale release that we have been advocating.

What I want is that the grayscale queue can only be consumed by grayscale consumers. When there is no grayscale consumer, I don’t want the data of both the grayscale queue and the ordinary queue to be consumed by normal consumers, so that the data is strung together.

When implemented this way, there will be problems. Imagine this: If your producer service is in a grayscale environment, at this time a steady stream of grayscale messages are sent, and your consumer service is not in grayscale (because the consumer service has no distribution requirements at all, it has been a non-grayscale state), if the grayscale queue can only be used by grayscale consumers, in this case, the grayscale queue will cause a large number of messages to accumulate. During grayscale, grayscale queues can only be used by grayscale consumers, provided that there are grayscale consumers. If there are grayscale consumers, it is absolutely guaranteed that the grayscale queue can only be used by grayscale consumers. What we need to do is, during the full link gray scale release, the messages sent by gray scale producers can be accurately consumed by gray scale consumers, and the messages sent by non-gray scale producers can be accurately consumed by non-gray scale consumers, so that the gray scale verification involving MQ message changes can be preserved in the business. In order to ensure the normal switching between gray level and gray level, the same group of consumers perceive each other whether there are consumers with gray level identification to determine whether non-gray level consumers need to take over the gray level queue message.

@syhleo
Copy link
Contributor Author

syhleo commented Aug 6, 2024

This solution has been implemented in our production environment, and the access method is also low-cost, which confirms its stability and security.

@yx9o
Copy link
Contributor

yx9o commented Aug 6, 2024

If the grayscale message and normal message data formats are inconsistent, and grayscale is enabled for production but not for consumption, will the grayscale data consumed normally fail to parse or cannot be processed normally?

Yes. If only the production is gray, the consumption is not gray, that is, the gray consumption does not exist at all, that is, all the messages are consumed by the normal consumer end. However, first of all, the change of MQ message data format, the production side and the consumer side are to be logically adjusted together, if your business scene inside the production side of the grayscale message and the data format of the ordinary message is inconsistent, the consumer side needs to follow together with the grayscale, so that the grayscale consumer side consumes the grayscale message (data format A). The normal consumer consumes normal messages (data format B). This is also the full-link gray scale release that we have been advocating.

What I want is that the grayscale queue can only be consumed by grayscale consumers. When there is no grayscale consumer, I don’t want the data of both the grayscale queue and the ordinary queue to be consumed by normal consumers, so that the data is strung together.

When implemented this way, there will be problems. Imagine this: If your producer service is in a grayscale environment, at this time a steady stream of grayscale messages are sent, and your consumer service is not in grayscale (because the consumer service has no distribution requirements at all, it has been a non-grayscale state), if the grayscale queue can only be used by grayscale consumers, in this case, the grayscale queue will cause a large number of messages to accumulate. During grayscale, grayscale queues can only be used by grayscale consumers, provided that there are grayscale consumers. If there are grayscale consumers, it is absolutely guaranteed that the grayscale queue can only be used by grayscale consumers. What we need to do is, during the full link gray scale release, the messages sent by gray scale producers can be accurately consumed by gray scale consumers, and the messages sent by non-gray scale producers can be accurately consumed by non-gray scale consumers, so that the gray scale verification involving MQ message changes can be preserved in the business. In order to ensure the normal switching between gray level and gray level, the same group of consumers perceive each other whether there are consumers with gray level identification to determine whether non-gray level consumers need to take over the gray level queue message.

Our usage scenarios are different. In our production implementation, grayscale can only be consumed by grayscale, and grayscale and ordinary messages are separated. Otherwise, they will be mixed and the meaning of grayscale will be lost.

@syhleo
Copy link
Contributor Author

syhleo commented Aug 6, 2024

If the grayscale message and normal message data formats are inconsistent, and grayscale is enabled for production but not for consumption, will the grayscale data consumed normally fail to parse or cannot be processed normally?

Yes. If only the production is gray, the consumption is not gray, that is, the gray consumption does not exist at all, that is, all the messages are consumed by the normal consumer end. However, first of all, the change of MQ message data format, the production side and the consumer side are to be logically adjusted together, if your business scene inside the production side of the grayscale message and the data format of the ordinary message is inconsistent, the consumer side needs to follow together with the grayscale, so that the grayscale consumer side consumes the grayscale message (data format A). The normal consumer consumes normal messages (data format B). This is also the full-link gray scale release that we have been advocating.

What I want is that the grayscale queue can only be consumed by grayscale consumers. When there is no grayscale consumer, I don’t want the data of both the grayscale queue and the ordinary queue to be consumed by normal consumers, so that the data is strung together.

When implemented this way, there will be problems. Imagine this: If your producer service is in a grayscale environment, at this time a steady stream of grayscale messages are sent, and your consumer service is not in grayscale (because the consumer service has no distribution requirements at all, it has been a non-grayscale state), if the grayscale queue can only be used by grayscale consumers, in this case, the grayscale queue will cause a large number of messages to accumulate. During grayscale, grayscale queues can only be used by grayscale consumers, provided that there are grayscale consumers. If there are grayscale consumers, it is absolutely guaranteed that the grayscale queue can only be used by grayscale consumers. What we need to do is, during the full link gray scale release, the messages sent by gray scale producers can be accurately consumed by gray scale consumers, and the messages sent by non-gray scale producers can be accurately consumed by non-gray scale consumers, so that the gray scale verification involving MQ message changes can be preserved in the business. In order to ensure the normal switching between gray level and gray level, the same group of consumers perceive each other whether there are consumers with gray level identification to determine whether non-gray level consumers need to take over the gray level queue message.

Our usage scenarios are different. In our production implementation, grayscale can only be consumed by grayscale, and grayscale and ordinary messages are separated. Otherwise, they will be mixed and the meaning of grayscale will be lost.

If the grayscale message and normal message data formats are inconsistent, and grayscale is enabled for production but not for consumption, will the grayscale data consumed normally fail to parse or cannot be processed normally?

Yes. If only the production is gray, the consumption is not gray, that is, the gray consumption does not exist at all, that is, all the messages are consumed by the normal consumer end. However, first of all, the change of MQ message data format, the production side and the consumer side are to be logically adjusted together, if your business scene inside the production side of the grayscale message and the data format of the ordinary message is inconsistent, the consumer side needs to follow together with the grayscale, so that the grayscale consumer side consumes the grayscale message (data format A). The normal consumer consumes normal messages (data format B). This is also the full-link gray scale release that we have been advocating.

What I want is that the grayscale queue can only be consumed by grayscale consumers. When there is no grayscale consumer, I don’t want the data of both the grayscale queue and the ordinary queue to be consumed by normal consumers, so that the data is strung together.

When implemented this way, there will be problems. Imagine this: If your producer service is in a grayscale environment, at this time a steady stream of grayscale messages are sent, and your consumer service is not in grayscale (because the consumer service has no distribution requirements at all, it has been a non-grayscale state), if the grayscale queue can only be used by grayscale consumers, in this case, the grayscale queue will cause a large number of messages to accumulate. During grayscale, grayscale queues can only be used by grayscale consumers, provided that there are grayscale consumers. If there are grayscale consumers, it is absolutely guaranteed that the grayscale queue can only be used by grayscale consumers. What we need to do is, during the full link gray scale release, the messages sent by gray scale producers can be accurately consumed by gray scale consumers, and the messages sent by non-gray scale producers can be accurately consumed by non-gray scale consumers, so that the gray scale verification involving MQ message changes can be preserved in the business. In order to ensure the normal switching between gray level and gray level, the same group of consumers perceive each other whether there are consumers with gray level identification to determine whether non-gray level consumers need to take over the gray level queue message.

Our usage scenarios are different. In our production implementation, grayscale can only be consumed by grayscale, and grayscale and ordinary messages are separated. Otherwise, they will be mixed and the meaning of grayscale will be lost.

Well, each scheme has advantages and disadvantages. For schemes such as Shadow topic and group, on the one hand, there is a critical problem, that is, when gray verification is switched to prod, there may be messages missed consumption, which is unacceptable. On the other hand, there is the cost problem. If you think about the relatively large scale of the business, the cost brought by doubling each topic and group can not be ignored.
This solution is to let rocketmq native support grayscale capabilities through gray partitioning, so that users low cost, easy access.

@syhleo syhleo changed the title [ISSUE #8468]: Add a message gray solution , compatible with POP consumption mode and Push consumption mode [RIP-69] : Add a message gray solution , compatible with POP consumption mode and Push consumption mode Aug 7, 2024
@imzs
Copy link
Contributor

imzs commented Aug 12, 2024

Good idea. I have a question: how will this solution do when we need more than one gray environment?

Like Git branch, we may have several active branches and each of them runs in its independent gray environment.

@syhleo
Copy link
Contributor Author

syhleo commented Aug 12, 2024

Good idea. I have a question: how will this solution do when we need more than one gray environment?

Like Git branch, we may have several active branches and each of them runs in its independent gray environment.

First of all, thank you very much for your response and for asking the question.

This solution also effectively addresses this issue when multiple grayscale environments are required.
In project development, it is true that there are multiple active branches or environments. For this, the proposed solution is:

  1. different environment processing: in different environments (e.g. dev/test/stage/prod), the environment suffix can be added for the same topic or group. This kind of processing has been widely used. For example, the topic name can be topic-{environment}, such as pt-order-completed-stage, and the group name can be GID-open-service-stage.
  2. Multiple branches in the same environment: In the same environment (e.g., test environment), there may also exist multiple branch environments, such as test1/test2/test3.
    In conjunction with this scenario, a grayscale partitioning approach can be used during grayscale release in a given environment. This approach ensures that messages sent by grayscale producers are accurately consumed by grayscale consumers, while messages from non-grayscale producers are processed by non-grayscale consumers. In this way, it ensures both grayscale validation of MQ message changes, as well as a friendly solution to the switchover interface from grayscale to the normal environment once the grayscale validation is passed.

In fact, most enterprises want to perform gray-scale validation of MQ messages in a low-cost and convenient way in an all-link gray-scale publishing scenario. However, when MQ does not support grayscale messages, it involves a change in consumption logic, and developers often need to add a lot of compatibility logic to the code. Nonetheless, these logics can only ensure that the new business does not affect the online, but cannot ensure that the grayscale traffic accurately enters the grayscale consumption client, thus preventing strict grayscale validation. With this solution, these problems can be effectively solved.

@syhleo syhleo changed the title [RIP-69] : Add a message gray solution , compatible with POP consumption mode and Push consumption mode [RIP-71] : Add a message gray solution , compatible with POP consumption mode and Push consumption mode Aug 13, 2024
@syhleo syhleo changed the title [RIP-71] : Add a message gray solution , compatible with POP consumption mode and Push consumption mode [RIP-72] : Add a message gray solution , compatible with POP consumption mode and Push consumption mode Aug 13, 2024
@makabakaboom
Copy link
Contributor

#3265 21年我提交过相同的方案

@syhleo syhleo changed the title [RIP-72] : Add a message gray solution , compatible with POP consumption mode and Push consumption mode [RIP-72] : Add a message gray solution , compatible with RocketMQ 5.x and RocketMQ 4.x Aug 14, 2024
@syhleo syhleo changed the title [RIP-72] : Add a message gray solution , compatible with RocketMQ 5.x and RocketMQ 4.x [RIP-72] : Add a message gray solution , compatible with RocketMQ 4.x and RocketMQ 5.x Aug 14, 2024
@syhleo
Copy link
Contributor Author

syhleo commented Aug 14, 2024

#3265 21年我提交过相同的方案

Thank you for your response. While comparing our solutions, I noticed the following differences between this solution and your previous one:

1.MQ version support: This solution is compatible with RocketMQ 4.x and RocketMQ 5.x.
2. Consumption mode: This program supports POP consumption mode and Push consumption mode.
3. Implementation details: This solution utilizes the clientId generation mechanism. By adding public client configurations, it allows MQ users to decide whether or not to enable grayscale messaging. These configurations allow RocketMQ users to easily determine if a client is a grayscale client.
4. Low cost and easy access.

@francisoliverlee
Copy link
Member

very nice to see gray solution, and i sort some cases that mq's users can be met, if meet, how to solve or extend those cases?
case 1: producers are all gray, consumers are all gray
case 5: producers are both gray and normal, consumers are all gray
others do same up.

image

@qianye1001
Copy link
Contributor

In the following description, the current environment is referred to as the "base environment," to distinguish it from the "gray environment."

  1. The PR requires that the "gray consumer" be started before the "gray producer." If the gray consumer fails to start, it can lead to "gray messages" being sent to the base environment, may causing issues. In other words, the PR assumes that consumers in the base environment have the capability to consume gray messages, but this cannot be guaranteed.

  2. The plan does not account for scenarios where subscription are different. When gray consumers with different subscription and base consumers are running concurrently, how can we ensure that if client crashes or other exceptions, consumer offset is not skipped?

  3. The remoting protocol client can be rewrite load balancing rules and select message queue function by users ; it is not a difficult task. Thus, the question arises whether it is necessary to merge this PR.

  4. This PR handles multiple environments poorly, as each environment must be bound to corresponding queues, thereby limiting the number of environments that can be supported.

@syhleo syhleo closed this Aug 15, 2024
@syhleo syhleo reopened this Aug 15, 2024
@syhleo
Copy link
Contributor Author

syhleo commented Aug 15, 2024

In the following description, the current environment is referred to as the "base environment," to distinguish it from the "gray environment."

  1. The PR requires that the "gray consumer" be started before the "gray producer." If the gray consumer fails to start, it can lead to "gray messages" being sent to the base environment, may causing issues. In other words, the PR assumes that consumers in the base environment have the capability to consume gray messages, but this cannot be guaranteed.
  2. The plan does not account for scenarios where subscription are different. When gray consumers with different subscription and base consumers are running concurrently, how can we ensure that if client crashes or other exceptions, consumer offset is not skipped?
  3. The remoting protocol client can be rewrite load balancing rules and select message queue function by users ; it is not a difficult task. Thus, the question arises whether it is necessary to merge this PR.
  4. This PR handles multiple environments poorly, as each environment must be bound to corresponding queues, thereby limiting the number of environments that can be supported.

Thank you very much for your feedback.

1.Regarding the startup sequence of gray consumers: Yes, the PR recommends that gray consumers be started before gray producers to ensure that gray messages are correctly processed.

2.Regarding the issue of subscription inconsistency: This PR does not address scenarios with inconsistent subscriptions, as this issue is not caused by the gray partitioning scheme. Even without the gray scheme, when gray consumers (gray pods) and base consumers (base pods) have different subscription relationships, the same problem will occur. However, the issue of inconsistent subscriptions can be resolved through other means.

3.Regarding rewrite load balancing rules and select message queue function: The Remoting Protocol client can indeed have its logic rewritten. Apologies, I haven't yet found the entry point for this. However, to ensure compatibility with RocketMQ 5.x server-side load balancing and to support both Pop and Push consumption modes, the PR introduces additional client configurations, providing a simple and flexible way to integrate the gray feature.

4.Multi-environment and gray release: The gray partitioning scheme and binding of different environments to corresponding queues do not conflict; rather, they complement each other. This design ensures that each environment can carry out gray releases in an orderly and independent manner.

Additional Notes:
About multiple environments and gray-scale releases
In the daily development process, we usually divide multiple environments, such as dev, test, stage, prod and so on. In the test environment, due to different project branches, it may be further subdivided into testA, testB, testC and other sub-environments. Based on the actual needs, we can choose to create a separate topic/group with the “environment” logo for a specific environment, so as to realize multi-environment MQ message isolation, which is more common in practice.
However, when we do a gray-scale release for a certain environment (e.g., prod environment), if we create additional shadow topics/groups for the purpose of gray-scale functionality, this may cause some problems. First, the criticality issue: when the grayscale validation passes and switches to the prod environment (e.g., the pod of the grayscale service is taken offline normally), the grayscale messages may be consumed underconsumed, which is unacceptable. Second, the cost issue: doubling each topic and group in order to implement the grayscale functionality brings significant cost overhead.
For multiple environments, complete isolation by adding new topics/groups combined with the ability to partition grayscale during grayscale for specific environments (e.g., online environments) are not conflicting and are complementary. The combination of the two approaches is designed to ensure that each environment can be grayed out in an orderly and independent manner.

@syhleo
Copy link
Contributor Author

syhleo commented Aug 15, 2024

very nice to see gray solution, and i sort some cases that mq's users can be met, if meet, how to solve or extend those cases? case 1: producers are all gray, consumers are all gray case 5: producers are both gray and normal, consumers are all gray others do same up.

image

First of all, thank you very much for your response
MQ-Rray-Case

灰度_正常生命周期管理

Share the link
https://www.processon.com/view/link/66bdc5020ee74e516c89def5?cid=66bd5df5fb35b76d02da1e4c
https://www.processon.com/view/link/66bdc5252515964e0bbc2f63?cid=66bda70c0ee74e516c898985

Case Scenario Description:
In practice, the whole link gray-scale release scenario, almost all the services issued version will be strictly first go gray-scale release, verify ok and then online.
1, In the daily development process, we usually divide multiple environments, such as dev, test, stage, prod and so on. In the test environment, due to different project branches, it may be further subdivided into testA, testB, testC and other sub-environments. Based on the actual needs, we can choose to create a separate topic/group with the “environment” logo for a specific environment, so as to realize multi-environment MQ message isolation, which is more common in practice.
However, when we do a gray-scale release for a certain environment (e.g., prod environment), if we create additional shadow topics/groups for the purpose of gray-scale functionality, this may cause some problems. First, the criticality issue: when the grayscale validation passes and switches to the prod environment (e.g., the pod of the grayscale service is taken offline normally), the grayscale messages may be consumed underconsumed, which is unacceptable. Second, the cost issue: doubling each topic and group in order to implement the grayscale functionality brings significant cost overhead.
For multiple environments, complete isolation by adding new topics/groups, combined with the ability to partition grayscale during grayscale in specific environments (e.g., online environments) are not conflicting and are complementary.
2, the gray-scale partitioning used in this program is, more often than not, by not adding a new topic/group, quasi-environment-specific (eg: prod environment) for gray-scale release, you can make MQ low-cost, convenient way to also have the ability to gray-scale. (ps: full-link grayscale scenario, http, gRPC and other interface calls can be gray-scale traffic forwarding, mq also need to have ...)
This design ensures that each environment can perform grayscale publishing in an orderly and independent manner.

@francisoliverlee
Copy link
Member

francisoliverlee commented Aug 16, 2024

In the following description, the current environment is referred to as the "base environment," to distinguish it from the "gray environment."

  1. The PR requires that the "gray consumer" be started before the "gray producer." If the gray consumer fails to start, it can lead to "gray messages" being sent to the base environment, may causing issues. In other words, the PR assumes that consumers in the base environment have the capability to consume gray messages, but this cannot be guaranteed.
  2. The plan does not account for scenarios where subscription are different. When gray consumers with different subscription and base consumers are running concurrently, how can we ensure that if client crashes or other exceptions, consumer offset is not skipped?
  3. The remoting protocol client can be rewrite load balancing rules and select message queue function by users ; it is not a difficult task. Thus, the question arises whether it is necessary to merge this PR.
  4. This PR handles multiple environments poorly, as each environment must be bound to corresponding queues, thereby limiting the number of environments that can be supported.

@qianye1001
nice question 1, 2

@syhleo

question 1

no users can guarantee it, this canbe a BIG problem, can we unbind the dependency on producer and consumer?

question 2

consumers in different-subscriptions, gray and normal consumers could be in-a-mess, how to make sure consume offset's right

my question

how gray consumer change into normal consumer. from the source code to see, it need to update consumer config and restart, can do it in admin tools?

@syhleo
Copy link
Contributor Author

syhleo commented Aug 16, 2024

In the following description, the current environment is referred to as the "base environment," to distinguish it from the "gray environment."

  1. The PR requires that the "gray consumer" be started before the "gray producer." If the gray consumer fails to start, it can lead to "gray messages" being sent to the base environment, may causing issues. In other words, the PR assumes that consumers in the base environment have the capability to consume gray messages, but this cannot be guaranteed.
  2. The plan does not account for scenarios where subscription are different. When gray consumers with different subscription and base consumers are running concurrently, how can we ensure that if client crashes or other exceptions, consumer offset is not skipped?
  3. The remoting protocol client can be rewrite load balancing rules and select message queue function by users ; it is not a difficult task. Thus, the question arises whether it is necessary to merge this PR.
  4. This PR handles multiple environments poorly, as each environment must be bound to corresponding queues, thereby limiting the number of environments that can be supported.

@qianye1001 nice question 1, 2

@syhleo

question 1

no users can guarantee it, this canbe a BIG problem, can we unbind the dependency on producer and consumer?

question 2

consumers in different-subscriptions, gray and normal consumers could be in-a-mess, how to make sure consume offset's right

my question

how gray consumer change into normal consumer. from the source code to see, it need to update consumer config and restart, can do it in admin tools?

Thank you for your feedback. @qianye1001 @francisoliverlee
Subscription relationship inconsistency issue:
The subscription inconsistency issue is not triggered by the grayscale partitioning scheme of this PR. Even without the grayscale partitioning scheme, the same problem is faced when a grayscale consumer (grayscale pods) has a different subscription relationship than a normal consumer (normal pods). Solving the MQ subscription inconsistency problem requires another PR to deal with, and the commit scheme can be further explored if there is time, so we won't expand on it here.

Gray-to-normal bridging issue:
When the grayscale validation passes, it triggers the normal downtime of the grayscale service and triggers the rebalancing logic. By way of admin tools, I have not tried this. Either way, it's essentially modifying the client's grayTag, clientId, and other related configurations so that all consumer instances under the same consumer group can sense each other's grayscale status.

@syhleo
Copy link
Contributor Author

syhleo commented Aug 16, 2024

Thank you very much for all the great replies. I have summarized the above discussion in the 语雀, so feel free to share and discuss the grayscale scenarios or MQ grayscale solution that your respective companies are experiencing. 🌹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] Add message gray strategy solution, compatible with RocketMQ 4.x and RocketMQ 5.x
8 participants