-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic Selenium 4 grid on kubernetes #9845
Comments
We are happy to discuss approaches, what do you have in mind, @gazal-k? |
Sorry, I'm not really familiar with the selenium grid codebase. I imagine this: https://github.com/SeleniumHQ/selenium/blob/trunk/java/src/org/openqa/selenium/grid/node/docker/DockerSessionFactory.java has some of the logic to dynamically create browser nodes to join the grid. It would be nice to have something similar to create k8s Pods so that the kubernetes selenium 4 grid scales based on the test as opposed to creating a static number of browser nodes. Again, sorry that I don't have something more solid to contribute. |
I have attempted to build something similar for Kubernetes with Selenium Grid3. |
I have some thoughts about how the Kubernetes support could be implemented. I remember having a look at the Grid 4 codebase in December 2018 and I wrote up my thoughts in this ticket over in Zalenium when someone asked if we planned to support Grid 4: zalando/zalenium#1028 (comment) So assuming the grid architecture is still the same as it was in 2018, ie router, sessionMap and distributor. Then I think my original ideas are still valid. The crux of it was to implement the sessionMap as annotations (metadata) on a Kubernetes pod, so that Selenium Grid didn't need to maintain the session state, which means that you could scale it and make it highly available much easier. So it means you could run multiple copies of the router, and you probably just want one distributor as you'd get into race conditions when creating new selenium pods. The sessionMap would end up just being a shared module/library that the router and distributor used to talk to the Kubernetes API server. |
If we wanted a more pure k8s solution, if there were metrics exposed around how many selenium sessions are in queue, or even how long they've been waiting, maybe even rate of queue processing, it would be possible to configure a horizontal pod autoscaler (HPA) around the node deployment itself to target a given rate of message processing. |
There is https://keda.sh/docs/2.4/scalers/selenium-grid-scaler/ which can autoscale nodes, it's working fine - the problem is with tearing down a node. Since it doesn't keep track of which node is working - it could kill test in progress, and it seems Chrome Node doesn't handle it gracefull. |
I tried another approach by implementing an application which intercepts the docker-engine calls from the selenium node-docker component and then translates those calls to k8s calls and then call the Kubernetes API. It works properly on creating and stopping browser nodes depending on the calls from node-docker. But this has a major problem because node-docker doesn't support concurrency. It can only create single browser-node, run test, destroy it and then next. (I will be creating a separate issue for that involving the docker-node as for the concurrency issue). From what i noticed is the node-docker binds those browser nodes to itself and expose it as an session of the node-docker to the distributor. So all that the distributor sees is the node-docker and not the browser node. I think this approach is not appropriate during concurrent execution as i feel it is a point of failure and end all the sessions routed through the node-docker. Therefore I think KEDA Selenium-Grid-AutoScaler is a much better approach. |
There is slight issue with this as this will make the Grid 4 dependent on Kubernetes. This will make two different implementations of Grid which is specific to K8s and one that is not dependent on Kubernetes. I think much better approach is to make the Grid HA with other approaches like sharing the current state with all the instances of particular grid component type. |
It's already dependent on Docker. Perhaps there should be some middleware for different environments. |
@MissakaI I have tested KEDA Selenium-Grid-AutoScaler and is scaling up how many nodes you need based on the queue session and is ok. The problem is with video part because doesn't work in kubernetes. I have managed to deploy video container on the same pod but the video file is not saved till the video container is not stop gracefully and also you cannot set the name of the video for every test, is recording all the time till will be closed. |
The selenium repository is currently dependent on Ruby, Python, dotnet, and quite a few other things that it probably shouldn't be, there's certainly an argument for a lot of stuff to be split out into separate modules, but that's probably a conversation for another issue. |
We had a note in the standup meeting of KEDA to see if we can help with Selenium & video. |
Will do, issue in question is These two are pretty intertwined. |
@tomkerkhove I am that who added the note on your note standup meeting. Please see also the next issue: #10018 |
Tracking this in kedacore/keda#2494 |
As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test. To make a node exit after a session is done you need to add a property to to the node section of config.toml:
With the official docker images this isn't enough since supervidord would still be running. So for that case you would need to add a supervisord event listener that finishes supervisord with its subprocesses. One good thing with this approach is that combined with the video feature you get one video per session. Regarding graceful shutdown: In the dynamic grid code any video container is stopped before the node/browser container. So I guess the video file gets corrupted if Xvfb exits before ffmpeg is done saving the file. The event listener described above should therefore shutdown the supervisord in the video container before shutting down the one in the same container. For shutting down supervisord, you can use the unix_http_server and supervisorctl features of supervisord. That works between containers in the pod as well. I've also been thinking about how to have the video file uploaded to s3 (or similar) automatically. The tricky part is supplying the pod with the url to upload the file to. I have some ideas, but that have to wait until the basic solution is implemented. |
I think this case should be followed with the thread dedicated to it. Which is mentioned by @LukeIGS
|
Also we need a way to implement liveliness and readiness probes because i ran into few instances that the selenium process was killed and pod continues to run which results in no new pod is reinstated by Kubernetes after terminating the currently crashed pod. |
Thank you for this. I have been using deployments and thought of raising a issue KEDA to add the annotation
Also can you point me to where this was included in the Selenium Documentation if it was documented. |
I don't see how that would help. You could put that cost in the manifest to begin with. But in any case you end up with having to remove/update that annotation when the test is done and KEDA don't know when that is. There is a recent proposal for Kubernetes to let the pod inform Kubernetes on which pods to delete through a probe: kubernetes/kubernetes#107598. Until something like that is implemented either the node itself or maybe the distributor would need to update the annotation.
I haven't found anything about it in the documentation. I stumbled upon org.openqa.selenium.grid.node.k8s.OneShotNode when I was looking in the selenium code. It then took a while for me to find out how to make use of the class. That's implemented here:
On the other hand I haven't tested it, so who knows if OneShotNode still works... This is where it should be documented: https://www.selenium.dev/documentation/grid/configuration/toml_options/ |
I was intending to either write an application that will monitor the test sessions along with the respective pod or write a custom KEDA scaler that will do what i mentioned previously. |
There is an issue about shutting down the node container when the node server has exited: SeleniumHQ/docker-selenium#1435 |
It seems like even though the code is available in the repo it causes ClassNotFoundException after adding it to the config.toml. Extracting the
The docker image that i used was |
Well, the selenium project is a bit confusing. Apparently the selenium build system excludes the package org.openqa.selenium.grid.node.k8s from selenium-server.jar. Here I found bazel build configurations for building docker images: The firefox_node and chrome_node images are there declared to include a layer (called one-shot) that includes a library with that class. But these images and the library doesn't seem to be published publicly anywhere. In https://github.com/SeleniumHQ/selenium/tree/trunk/deploys/k8s you can see how that library is utilized: selenium/deploys/k8s/firefox-node.yaml Lines 19 to 44 in 451fc38
It seems like the idea is that you checkout the code to build and deploy these images and k8s manifest to your local infrastructure. |
Thank you all for sharing your thoughts and offering paths to move forward. I will reply to the comments below. |
This will help to enable a Dynamic Grid in Kubernetes, as one can create a container with a single session which will shut down on its own after the session is completed. Helps with SeleniumHQ#9845 and SeleniumHQ/docker-selenium#1514
Hi all, Is anyone still facing issues related to KEDA implementation? I was the one who originally added the scaler to KEDA. I wasn't following much about it due to my other assignments. We have now started the setup of grid in EKS on Fargate and it seems to work fine for us. I am yet to work on retrieving the browser console logs and network logs. Any help on that would be greatly appreciated. |
With regards to video recording the problem we face is it recorded a single video for the whole lifetime of the pod so if there are multiple sessions handled by the same pod then there is just one video for all the sessions handled by that pod. Also even if the pod handles just a single session, the video keeps recording until the pod is killed which is like 300 seconds by default in HPA. So even for a test that runs just for few seconds we get a video that's 5 mins or longer. Is there a way to control this behaviour? |
My idea of solving that (which I haven't tested yet) is to use the scaling jobs feature in KEDA to run selenium nodes. These nodes should then be configured with DRAIN_AFTER_SESSION_COUNT=1. After the session has finished the selenium container will then finish. The remaining problem is to make the the video container exit. This could be solved by harnessing features of supervisord: If the supervisord of the video container has enabled unix_http_server then the supervisord of the selenium container could use supervisorctl to stop the video container. in a similar way as here: SeleniumHQ/docker-selenium@281e5c4 A somewhat tricky part would be how to make that supervisorctl when there actually is a video container to stop. |
An alternative would be to have the selenium node container kill the pod that it belongs to via the kube API server in the pre-stop hook (stackoverflow delete pod). I have not tested this. The pre-stop hook of the video container can then upload the video to a remote storage. The problem is then that the video container does not know the session identifier of the last test run of the selenium node container, which would be a practical filename in the remote storage. |
I solved it by adding ffmpeg directly into browser node docker and record video directly for every session. It works great. I will soon share the whole setup. |
@prashanth-volvocars Hi! Great to hear this. Please don't forget to share with us the setup. |
@prashanth-volvocars we would love to receive your feedback :) |
My setup is more oriented towards AWS but it works great for us until now. I need some help with sharing it. I have made some changes to the NodeBase and added new script to upload the videos and logs directly to S3. So would it be ok to have this part of this repo or should i just share it another separate repo since its more oriented towards AWS. |
I think using S3 as opposed to block storage was an excellent choice. Perhaps parts of that logic can be made generic using something like https://github.com/google/go-cloud in the future. But for a lot of us who would want to setup a selenium 4 grid on AWS, I think ur contribution would be excellent. Perhaps it can be turned on based on some env params? |
Hey all, Apologies for the delay in sharing it. I was unsure of how to do it. But just taking a first step now. https://github.com/prashanth-volvocars/docker-selenium/tree/auto-scaling/charts/selenium-grid Remember you need to install keda before installing the chart. The chart is configured to work with default namespace. If you are installing it another namespace make sure to update the hpa.url. Any questions please direct to me here or slack. |
You can grab all information about the setup here On a nutshell, It can
|
@diemol - Do you think that this was one of the use-cases for building the reference implementation of |
@krmahadevan there is a solution in a PR in the docker-selenium project, have you checked them? @prashanth-volvocars was kind enough to submit them. |
@diemol - No I wasn't aware of the PR. I went back and checked SeleniumHQ/docker-selenium#1714 Even though I dont understand a lot of the k8s lingo yet, I kind of got the idea of what it is doing and looks like that should suffice for the k8s requirement of an autoscaling grid. |
Yes, we we merge that, we can close this issue. |
I have made the new PR SeleniumHQ/docker-selenium#1854 (based on SeleniumHQ/docker-selenium#1714). It has a few more features, including automatic installation of KEDA and autoscaling with jobs. I have also supplied a helm repo where you can get the chart to test it this before the PR is merged. |
Hello Team , i would like to know if this implementation will include a solution for Video Recording feature enabled on a k8s distributed implementation and if there is any ETA on which we would be able to use these new components. |
What if I want to upload the videos with a specific name instead of <session_id>.mp4 to the S3 bucket? Or how can I identify which is the corresponding video file for a test? |
In your test you know the session id so therefore you also know the file name. Telling what file name to use is not possible with this solution. |
@msvticket |
Hi @sahajamit,, i have seen your post in medium to configure selenium grid inside the eks cluster. As per your instructions i have created selenium grid hub and I can able to access it via ingress controller, but when I'm trying to configure the chrome node not able to register with selenium hub. I your post you mentioned k8s_host what you referring value, is the eks cluster endpoint url or something else? |
We now have Keda integrated in the chart, and there is also video there. Closing this. |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
🚀 Feature Proposal
Just like dynamic Selenium 4 grid using docker, having a similar k8s "pod factory" (or something similar) would be nice.
https://github.com/zalando/zalenium does that. Perhaps some of that can be ported to grid 4
The text was updated successfully, but these errors were encountered: