-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate entity name prevention #480
Duplicate entity name prevention #480
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave a quick skim: just a couple of small requests and some threads to pull on while you work out the CI
smartsim/_core/control/controller.py
Outdated
@@ -512,6 +512,14 @@ def _launch_step( | |||
:raises SmartSimError: if launch fails | |||
""" | |||
try: | |||
if ( | |||
entity.name in self._jobs.completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure about this check? Won't this prevent relaunching a completed model?
exp = Experiment(...)
rs = exp.create_run_settings(...)
model = exp.create_model(run_settings=rs, ...)
exp.start(model, block=True) # Very explicitly wait for the model to complete
exp.start(model, block=True) # <-- I do not think we want this to error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MattToast I think I pushed this idea because running the same model would overwrite outputs from the first run. Do we care about the outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super common for anything I write, but I know there have been use cases were users run a model, ignore the outputs or programmatically move them somewhere else, re-generate the model dir, and then re-run the same model.
Another more common case (at least for me) is, when a user is working in a jupyter notebook, they might start up an orchestrator, stop it for some reason, but then decide to relaunch the same orc object without restarting the kernel or creating a new experiment.
I'd be hesitant to make this change just because this is behavior that we have billed as a feature in the past (see tests/on_wlm/test_restart.py
) and would constitute an pretty significant API break.
Which isn't to say that I think the API itself isn't somewhat incredibly convoluted, especially now that we are specifically trying to enforce a name uniqueness requirement that was in the past, at best, a suggestion, despite our assumption that they were. Unfortunately, that does mean that a user could start another model of the same name if they let the first model finsish. e.g.
exp = Experiment("multi-start")
rs = exp.create_run_settings("echo", ["this", "just", "works"])
model_1 = exp.create_model("my-model", rs)
model_2 = exp.create_model("my-model", rs) # <-- note the same name
exp.start(model_1, block=True)
exp.start(model_2, block=True) # <-- This, unfortunately, now just works
# which, I would argue, is partially what
# we are trying to discourage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could check to see if the user is trying to start the same object with a
if (entity.name in self._jobs.completed
and entity is not self._jobs.completed[entity.name]):
... # TODO: block the launch
condition, but I think the same problem exists as a model's run settings is a writable attribute. So a user could effectively just launch the same model, with the same name, with a completely different set of run settings, e.g.
exp = Experiment("multi-start")
rs_1 = exp.create_run_settings("echo", ["my", "model"])
model = exp.create_model("my-model")
exp.start(model, block=True)
rs_2 = exp.create_run_settings("echo", ["my", "completely", "different", "model"])
model.run_settings = rs_2
exp.start(model) # <-- Effectively just ran two different models
# with the same name
somewhat defeating the point of the check.
For that reason, I think we should scope this PR to "preventing duplicate simultaneously running entity names launched through the same experiment", and decide how (or even if) we want to enforce a true unique names constraint in a later spike. Does that seem like a good scope @amandarichardsonn @ankona?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the current implementation below that will allow users to restart a completed entity but will not allow users to run an entity of the same name that is completed or running. I think this is a bandaid patch and should be revisited.
completed_job = self._jobs.completed.get(entity_passed_in.name, None)
if completed_job is None and is not a running job:
launch
elif completed_job is not None and completed_job.entity is entity_passed_in:
launch
else:
raise
However, with this change above, users are now required to specify a db_identifier var if launching multiple Orchestrators. The implementation caught a SS test that launched two orchs with no db_identifiers, and therefore threw on same name. Do we want this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say if this is more complex than the implementation that I currently have - we scope down to what you mentioned @MattToast. Is this something we should discuss with the team as far as allowing users to overwrite entity info by relaunching before merging in? or is this more supporting SS currently functionality intentions
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #480 +/- ##
===========================================
- Coverage 90.83% 90.75% -0.09%
===========================================
Files 60 60
Lines 3818 3838 +20
===========================================
+ Hits 3468 3483 +15
- Misses 350 355 +5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last minor nit, but nothing worth holding approval up over! LGTM!!
Removes behavior deprecated in #480 from test suite. [ committed by @MattToast ] [ reviewed by @mellis13 ]
This PR prevents the launch of entities with names that already exist in the Experiment.