CloudSim - limit 1 running simulation per team #644

m3d · 2020-10-04T11:27:30Z

Hello,
few days ago @angelacmaio announced on https://community.subtchallenge.com/t/practice-for-cave-circuit-virtual-competition/1449 that the number of submissions will be limited from 10 to 3:

Teams can continue submitting solutions against the practice scenarios. To ensure cloud machine availability for all teams, each team can submit a maximum of 3 simultaneous practice runs on Cloudsim. Once the limit is reached, teams will not be able to submit additional runs until at least one of the 3 runs has finished. During peak usage, submissions may also display a “queued” status until machines become available.

At the moment we have two simple (small set of robots) simulations pending for more than 24 hours, so I would suggest to limit the number of concurrent runs per team to only one that it still provides some feedback in "reasonable time". You can vote 👍 or 👎

thanks
Martin/Robotika Team

wolfgangschwab · 2020-10-04T16:12:13Z

We agree, that it makes sense to limit the allowed parallel runs to 1, if this helps to avoid the Pending status.

We have an submitted solution run in status Pending now since 16 hours and we have no other simulation running.

knoedler · 2020-10-04T16:36:49Z

I am in a similar situation. One pending 36 hours, one pending 12 hours and none currently running. I don't know that the issue is specifically the simultaneous runs, but more that the system seems to get lots of its resources in a state that nobody can use. It recovers after it is reset, but with more people using the system it gets back to the bad state much faster. Limiting it to one simultaneous run might help keep the resources from getting in the bad state for longer periods.

malcolmst · 2020-10-04T22:37:56Z

Agree, I am also seeing pending for a very long time. Whatever the issue is, the ability to get results ASAP is more important to me than the number of concurrent simulations.

On a slightly related note though, as the maximum number of concurrent simulations is decreased, it would be really helpful to be able to cancel a running job :).

peci1 · 2020-10-04T22:52:04Z

Or, an even better solution if budget allows - buy more compute on AWS. That would allow teams to have 3 simultaneous simulations and reasonable finish times. I think the fact that the whole setup is cloud-based should make it simple to add more compute...

malcolmst · 2020-10-04T23:10:41Z

Good point! I’ve been running my own simulator for a while in AWS based on the cloudsim containers (wish I could share it with other teams, but don’t know how I could provide enough privacy atm, and it also still has its own set of issues :)) and it’s somewhere around $1/hr/robot. I expect that’s about the same for the subt simulator. Not crazy expensive, but does add up depending on the available budget.

zbynekwinkler · 2020-10-05T12:11:59Z

I was thinking about reviving #354... Some simulation seems to be running but we cannot connect to it to see what is going on. The message I get is "Connection failed. Please contact an administrator.". The exact error is

GET | wss://cloudsim-ws.ignitionrobotics.org/simulations/84ffbecd-eaf5-44e1-8769-d2cd7a77c2f2-r-1
503 Service Unavailable

Started at 2020-10-04T17:59:05.599Z.

If I may summarize what I have learned in this thread: all teams see Pending times in tens of hours and none of them have anything running - except us, where something seems to be running but we cannot connect to it and the simulations that have finished recently have ended abruptly probably due to a crash (see #631 (comment)).

Why does it break each time before the circuit deadline? It was kind of expected for tunnel, somewhat expected for urban but definitely not expected for cave since that is the last try before the finals. Is the load so much higher in these times? Is nobody running anything during the year except us? I am confused.

angelacmaio · 2020-10-05T15:17:54Z

When Cloudsim is unable to procure enough instances for a submission (1 per robot + 1 for the simulator), submissions will display the Pending status. The limit of 3 simultaneous runs per team was put in place to spread available capacity across the 17 teams.

AWS has many users, and external spikes in usage can result in fewer total available instances for Cloudsim (likely what occurred over the weekend). We are in contact with AWS about instance availability and will only reduce the limit of simultaneous runs per team further if it is necessary based on the usual expected machine availability. We do not wish to further restrict practice runs around-the-clock based on infrequent dips in availability.

All Pending runs were able to spin up after more instances were freed up later in the weekend and have now terminated, so please check the status of your submissions on the SubT Virtual Portal.

malcolmst · 2020-10-05T15:37:24Z

FWIW, while I don’t know specifically what limits AWS has per user, I was able to spin up a simulation with 6 instances (5 robots + simulator) in my private simulator, shortly after hitting the pending state in the subt simulator using a similar number of robots. That was with g3.4xlarge instances in the us-east-1 region. Those pending simulations did eventually run, but it wasn’t until many hours later.

zbynekwinkler · 2020-10-05T15:38:45Z

All Pending runs were able to spin up after more instances were freed up later in the weekend and have now terminated

That is not correct. The instance 84ffbecd-eaf5-44e1-8769-d2cd7a77c2f2-r-1 still shows as Running, while it has crashed a long time ago.

AWS has many users, and external spikes in usage can result in fewer total available instances for Cloudsim

This only ever happens when a circuit deadline is approaching (it has not happened since urban) so our guess is that it is related to Subt activities and not to external AWS usage - hence the suggestion to further limit the number of concurrent simulations per team. Can you publish stats about Subt usage of AWS? It would be interesting to see the comparison of now vs month ago. It would be also interesting to know the size of the AWS pool - how many machines are there?

malcolmst · 2020-10-06T00:43:28Z

This only ever happens when a circuit deadline is approaching (it has not happened since urban)

Agree, this does seem to be really consistent when the simulator usage is higher before a deadline. Also, as mentioned in my previous post, I saw no evidence of an AWS g3 instance shortage over the weekend.

Taking a look at the cloudsim web code, there is a pool of threads responsible for starting up new simulations:

https://gitlab.com/ignitionrobotics/web/cloudsim/-/blob/master/simulations/sim_service.go

        The Simulations Service is in charge of launching and terminating Gazebo simulations. And,
	in case of an error, it is responsible of rolling back the failed operation.
	To do this and handle some concurrency without exhausting the host, it has
	one worker-thread-pool for each main activity (launch, terminate, error handling).
	The `launch` and `terminate` pools can launch 10 concurrent workers (eg. the launcher can
	launch 10 simulations in parallel). The error handler pool only has one worker.

Is it possible the threads in the launch pool, or one of the other pools, all got in a bad state (crashed or hung), preventing new simulations from starting successfully? Or maybe another internal limit got hit due to the crashed simulations from #631?

[Edit] There is an internal limit of running EC2 instances defined here:
https://gitlab.com/ignitionrobotics/web/cloudsim/-/blob/master/simulations/ec2_machines.go
(See AvailableEC2Machines).

m3d · 2020-10-06T19:44:48Z

p.s. I probably misunderstood the limit - up to now I expected that you can put several simulations in the queue, but only given limit is processed in parallel, but at the moment you cannot add new simulation to the queue (error 5506 - Simultaneous simulations limit reached.)

zbynekwinkler mentioned this issue Oct 7, 2020

Allow more runs to be queued on cloudsim (limit running not queuing) #652

Open

nkoenig added the proposal label Oct 19, 2020

nkoenig added enhancement New feature or request and removed proposal labels Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CloudSim - limit 1 running simulation per team #644

CloudSim - limit 1 running simulation per team #644

m3d commented Oct 4, 2020

wolfgangschwab commented Oct 4, 2020

knoedler commented Oct 4, 2020

malcolmst commented Oct 4, 2020

peci1 commented Oct 4, 2020

malcolmst commented Oct 4, 2020

zbynekwinkler commented Oct 5, 2020

angelacmaio commented Oct 5, 2020

malcolmst commented Oct 5, 2020

zbynekwinkler commented Oct 5, 2020

malcolmst commented Oct 6, 2020 •

edited

Loading

m3d commented Oct 6, 2020

CloudSim - limit 1 running simulation per team #644

CloudSim - limit 1 running simulation per team #644

Comments

m3d commented Oct 4, 2020

wolfgangschwab commented Oct 4, 2020

knoedler commented Oct 4, 2020

malcolmst commented Oct 4, 2020

peci1 commented Oct 4, 2020

malcolmst commented Oct 4, 2020

zbynekwinkler commented Oct 5, 2020

angelacmaio commented Oct 5, 2020

malcolmst commented Oct 5, 2020

zbynekwinkler commented Oct 5, 2020

malcolmst commented Oct 6, 2020 • edited Loading

m3d commented Oct 6, 2020

malcolmst commented Oct 6, 2020 •

edited

Loading