Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate entity name prevention #480

Merged

Conversation

amandarichardsonn
Copy link
Contributor

This PR prevents the launch of entities with names that already exist in the Experiment.

@MattToast MattToast self-requested a review February 7, 2024 23:14
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a quick skim: just a couple of small requests and some threads to pull on while you work out the CI

@@ -512,6 +512,14 @@ def _launch_step(
:raises SmartSimError: if launch fails
"""
try:
if (
entity.name in self._jobs.completed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about this check? Won't this prevent relaunching a completed model?

exp = Experiment(...)
rs = exp.create_run_settings(...)
model = exp.create_model(run_settings=rs, ...)

exp.start(model, block=True)  # Very explicitly wait for the model to complete
exp.start(model, block=True)  # <-- I do not think we want this to error

Copy link
Contributor

@ankona ankona Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MattToast I think I pushed this idea because running the same model would overwrite outputs from the first run. Do we care about the outputs?

Copy link
Member

@MattToast MattToast Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super common for anything I write, but I know there have been use cases were users run a model, ignore the outputs or programmatically move them somewhere else, re-generate the model dir, and then re-run the same model.

Another more common case (at least for me) is, when a user is working in a jupyter notebook, they might start up an orchestrator, stop it for some reason, but then decide to relaunch the same orc object without restarting the kernel or creating a new experiment.

I'd be hesitant to make this change just because this is behavior that we have billed as a feature in the past (see tests/on_wlm/test_restart.py) and would constitute an pretty significant API break.

Which isn't to say that I think the API itself isn't somewhat incredibly convoluted, especially now that we are specifically trying to enforce a name uniqueness requirement that was in the past, at best, a suggestion, despite our assumption that they were. Unfortunately, that does mean that a user could start another model of the same name if they let the first model finsish. e.g.

exp = Experiment("multi-start")
rs = exp.create_run_settings("echo", ["this", "just", "works"])
model_1 = exp.create_model("my-model", rs)
model_2 = exp.create_model("my-model", rs)  # <-- note the same name

exp.start(model_1, block=True)
exp.start(model_2, block=True)  # <-- This, unfortunately, now just works
                                #     which, I would argue, is partially what
                                #     we are trying to discourage

Copy link
Member

@MattToast MattToast Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could check to see if the user is trying to start the same object with a

if (entity.name in self._jobs.completed
    and entity is not self._jobs.completed[entity.name]):
   ...  # TODO: block the launch

condition, but I think the same problem exists as a model's run settings is a writable attribute. So a user could effectively just launch the same model, with the same name, with a completely different set of run settings, e.g.

exp = Experiment("multi-start")

rs_1 = exp.create_run_settings("echo", ["my", "model"])
model = exp.create_model("my-model")
exp.start(model, block=True)

rs_2 = exp.create_run_settings("echo", ["my", "completely", "different", "model"])
model.run_settings = rs_2
exp.start(model)  # <-- Effectively just ran two different models
                  #     with the same name

somewhat defeating the point of the check.

For that reason, I think we should scope this PR to "preventing duplicate simultaneously running entity names launched through the same experiment", and decide how (or even if) we want to enforce a true unique names constraint in a later spike. Does that seem like a good scope @amandarichardsonn @ankona?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the current implementation below that will allow users to restart a completed entity but will not allow users to run an entity of the same name that is completed or running. I think this is a bandaid patch and should be revisited.

completed_job = self._jobs.completed.get(entity_passed_in.name, None)

if completed_job is None and is not a running job:
    launch
elif completed_job is not None and completed_job.entity is entity_passed_in:
    launch
else:
    raise

However, with this change above, users are now required to specify a db_identifier var if launching multiple Orchestrators. The implementation caught a SS test that launched two orchs with no db_identifiers, and therefore threw on same name. Do we want this?

Copy link
Contributor Author

@amandarichardsonn amandarichardsonn Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say if this is more complex than the implementation that I currently have - we scope down to what you mentioned @MattToast. Is this something we should discuss with the team as far as allowing users to overwrite entity info by relaunching before merging in? or is this more supporting SS currently functionality intentions

Copy link

codecov bot commented Feb 9, 2024

Codecov Report

Attention: Patch coverage is 40.00000% with 12 lines in your changes are missing coverage. Please review.

Project coverage is 90.75%. Comparing base (8408368) to head (81d9140).
Report is 12 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #480      +/-   ##
===========================================
- Coverage    90.83%   90.75%   -0.09%     
===========================================
  Files           60       60              
  Lines         3818     3838      +20     
===========================================
+ Hits          3468     3483      +15     
- Misses         350      355       +5     
Files Coverage Δ
smartsim/_core/control/controller.py 85.95% <40.00%> (-1.26%) ⬇️

... and 2 files with indirect coverage changes

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last minor nit, but nothing worth holding approval up over! LGTM!!

@amandarichardsonn amandarichardsonn merged commit 39354db into CrayLabs:develop Feb 22, 2024
34 checks passed
@amandarichardsonn amandarichardsonn deleted the double-entity-name branch February 22, 2024 23:01
MattToast added a commit that referenced this pull request Mar 18, 2024
Removes behavior deprecated in #480 from test suite.

[ committed by @MattToast ]
[ reviewed by @mellis13 ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants