-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CloudSim - limit 1 running simulation per team #644
Comments
We agree, that it makes sense to limit the allowed parallel runs to 1, if this helps to avoid the Pending status. We have an submitted solution run in status Pending now since 16 hours and we have no other simulation running. |
I am in a similar situation. One pending 36 hours, one pending 12 hours and none currently running. I don't know that the issue is specifically the simultaneous runs, but more that the system seems to get lots of its resources in a state that nobody can use. It recovers after it is reset, but with more people using the system it gets back to the bad state much faster. Limiting it to one simultaneous run might help keep the resources from getting in the bad state for longer periods. |
Agree, I am also seeing pending for a very long time. Whatever the issue is, the ability to get results ASAP is more important to me than the number of concurrent simulations. On a slightly related note though, as the maximum number of concurrent simulations is decreased, it would be really helpful to be able to cancel a running job :). |
Or, an even better solution if budget allows - buy more compute on AWS. That would allow teams to have 3 simultaneous simulations and reasonable finish times. I think the fact that the whole setup is cloud-based should make it simple to add more compute... |
Good point! I’ve been running my own simulator for a while in AWS based on the cloudsim containers (wish I could share it with other teams, but don’t know how I could provide enough privacy atm, and it also still has its own set of issues :)) and it’s somewhere around $1/hr/robot. I expect that’s about the same for the subt simulator. Not crazy expensive, but does add up depending on the available budget. |
I was thinking about reviving #354... Some simulation seems to be running but we cannot connect to it to see what is going on. The message I get is "Connection failed. Please contact an administrator.". The exact error is
Started at 2020-10-04T17:59:05.599Z. If I may summarize what I have learned in this thread: all teams see Pending times in tens of hours and none of them have anything running - except us, where something seems to be running but we cannot connect to it and the simulations that have finished recently have ended abruptly probably due to a crash (see #631 (comment)). Why does it break each time before the circuit deadline? It was kind of expected for tunnel, somewhat expected for urban but definitely not expected for cave since that is the last try before the finals. Is the load so much higher in these times? Is nobody running anything during the year except us? I am confused. |
When Cloudsim is unable to procure enough instances for a submission (1 per robot + 1 for the simulator), submissions will display the AWS has many users, and external spikes in usage can result in fewer total available instances for Cloudsim (likely what occurred over the weekend). We are in contact with AWS about instance availability and will only reduce the limit of simultaneous runs per team further if it is necessary based on the usual expected machine availability. We do not wish to further restrict practice runs around-the-clock based on infrequent dips in availability. All |
FWIW, while I don’t know specifically what limits AWS has per user, I was able to spin up a simulation with 6 instances (5 robots + simulator) in my private simulator, shortly after hitting the |
That is not correct. The instance 84ffbecd-eaf5-44e1-8769-d2cd7a77c2f2-r-1 still shows as Running, while it has crashed a long time ago.
This only ever happens when a circuit deadline is approaching (it has not happened since urban) so our guess is that it is related to Subt activities and not to external AWS usage - hence the suggestion to further limit the number of concurrent simulations per team. Can you publish stats about Subt usage of AWS? It would be interesting to see the comparison of now vs month ago. It would be also interesting to know the size of the AWS pool - how many machines are there? |
Agree, this does seem to be really consistent when the simulator usage is higher before a deadline. Also, as mentioned in my previous post, I saw no evidence of an AWS g3 instance shortage over the weekend. Taking a look at the cloudsim web code, there is a pool of threads responsible for starting up new simulations: https://gitlab.com/ignitionrobotics/web/cloudsim/-/blob/master/simulations/sim_service.go
Is it possible the threads in the launch pool, or one of the other pools, all got in a bad state (crashed or hung), preventing new simulations from starting successfully? Or maybe another internal limit got hit due to the crashed simulations from #631? [Edit] There is an internal limit of running EC2 instances defined here: |
p.s. I probably misunderstood the limit - up to now I expected that you can put several simulations in the queue, but only given limit is processed in parallel, but at the moment you cannot add new simulation to the queue (error 5506 - Simultaneous simulations limit reached.) |
Hello,
few days ago @angelacmaio announced on https://community.subtchallenge.com/t/practice-for-cave-circuit-virtual-competition/1449 that the number of submissions will be limited from 10 to 3:
Teams can continue submitting solutions against the practice scenarios. To ensure cloud machine availability for all teams, each team can submit a maximum of 3 simultaneous practice runs on Cloudsim. Once the limit is reached, teams will not be able to submit additional runs until at least one of the 3 runs has finished. During peak usage, submissions may also display a “queued” status until machines become available.
At the moment we have two simple (small set of robots) simulations pending for more than 24 hours, so I would suggest to limit the number of concurrent runs per team to only one that it still provides some feedback in "reasonable time". You can vote 👍 or 👎
thanks
Martin/Robotika Team
The text was updated successfully, but these errors were encountered: