You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Required - Mediator Service deployment in HA-fashion and running multiple replicas (3+)
This includes the Agent (auto-scaled), wallet (HA instance of Postgres), Proxy (auto-scaled), External Queue (HA instance of Redis or Kafka), Message Workers (auto-scaled).
Note regarding the Proxy, it would be nice to eliminate this layer, though it is useful for traffic control and rate limiting. However relying on it to route messages to the separate http and ws ports of the aca-py agents is undesirable.
With the use of web sockets auto-scaling is best performed on the basis of the web socket connections themselves. This helps to ensure that pods with active web socket connections are not terminated by the system scaling them down prematurely. There are several ways to accomplish this.
Highly Desired - No session/cookie affinity: A agent may seamlessly connect and be served by any of the replicas
The use of web sockets defines the current need for session affinity. The socket is opened between a client and an agent/worker instance. Once established all traffic must be routed between the same client and agent/worker instance. This is a challenge in K8S/OCP, especially when to comes to HPAS.
Are there any alternatives to using web sockets?
Required - Uptime/performance monitoring at Aries protocol level (not just http/s)
For example, the k8s compatible status endpoints on the agents do not provide sufficient information regarding the state and health of the websocket connections, nor do they provide any metrics on the durability and longevity of the websocket connections.
This PR addresses vertical scalability on a single instance mediator, it does not address any of issues encountered in an HA or horizontally scalable environment.
I did a bit of catching-up and while there are still some items that will require review and planning, I think we have a couple of options to focus on for the short/medium term.
The PR linked in the issue description will allow the mediator agent to scale vertically and manage throughput of 2400+ connections: this should be enough to handle the user volumes we expect in the immediate future. This does NOT help with scenarios involving pod rollout, as the in-progress queue would be lost.
This PR (https://github.com/bcgov/openshift-aries-mediator-service/pull/18/files) includes changes that, in theory, should help with preventing websockets from being dropped due to scaling, by using sticky sessions and affinity. The changes are already deployed in our dev environment, however testing appears to have been interrupted before it could confirm whether the change was helpful or not sue to a shift in priorities. It would be a good idea to wrap-up the testing and confirm whether this approach resolves, or at least mitigates, the horizontal scaling issues.
Both the above approaches should be accompanied by running a persistent queue to handle messages, so that they will not be lost in case of rollouts/re-deployments/failures.
I would suggest we focus on these three items in the short term, and in the meantime complete the investigation of potential next steps/long term strategies to manage mediation.
Tasks
Acceptance Criteria
Blocked By:
Additional Resources:
The text was updated successfully, but these errors were encountered: