Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix updated timestamp #307

Merged
merged 20 commits into from
Oct 5, 2023
Merged

Conversation

giffels
Copy link
Member

@giffels giffels commented Aug 3, 2023

The original idea was that created and updated timestamps indicate a change of the DroneState. However, in the meantime it was also updated in some SiteAdapters, when the resource status changed, e.g. through a resource_status call on certain SiteAdapters. Through the drone_minimum_lifetime setting seems to be ignored, because resource_status is called every minute, while drone_minimum_lifetime is usually in the order of hours.

  • Remove all updates of the created and updated timestamp from adapters
  • Update all adapters: Do not update the resource_attributes dictionary, this is only done in the DroneStates
  • Update all adapters: deploy_resource and resource_status should return only translated dictionaries of values to eventually modify the resource_attributes dictionary
  • Update all adapters: stop_resource and terminate_resource should return None
  • Use AsyncBulkCall for the HTCondor, Moab and Slurm site adpter as discussed below.

Fixes #296

@giffels giffels force-pushed the fix-updated-timestamp branch from d9a4831 to 759c932 Compare August 3, 2023 10:37
@codecov-commenter
Copy link

codecov-commenter commented Aug 3, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (886255e) 98.90% compared to head (58fb461) 98.86%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #307      +/-   ##
==========================================
- Coverage   98.90%   98.86%   -0.04%     
==========================================
  Files          55       55              
  Lines        2277     2210      -67     
==========================================
- Hits         2252     2185      -67     
  Misses         25       25              
Files Coverage Δ
tardis/adapters/sites/cloudstack.py 100.00% <ø> (ø)
tardis/adapters/sites/fakesite.py 100.00% <100.00%> (ø)
tardis/adapters/sites/htcondor.py 100.00% <100.00%> (ø)
tardis/adapters/sites/kubernetes.py 100.00% <100.00%> (ø)
tardis/adapters/sites/moab.py 100.00% <100.00%> (ø)
tardis/adapters/sites/openstack.py 100.00% <100.00%> (ø)
tardis/adapters/sites/slurm.py 100.00% <100.00%> (ø)
tardis/interfaces/siteadapter.py 100.00% <100.00%> (ø)
tardis/resources/drone.py 100.00% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@giffels
Copy link
Member Author

giffels commented Aug 4, 2023

I am currently not really happy about these changes in the Slurm, Moab and HTCondor site adapters f2c9f1b.

Previously the created timestamp has been updated in the corresponding site adaters and we could simply check (batch_system_last_status_update-created) < 0 to retry the resource_status call in case the asynchronously updated batch system status was last updated before the job was actually submited.

Is having a grace period enough or should we introduced a further timestamp for that use-case? Do you have any opionions on that? @MatterMiners/review

On the other hand it can possibly end up in a race condition.

BTW.: I checked that the delay introduced between created timestamp and the actually job subission is around 0.6s in most cases with some outliers at 2s.

@maxfischer2781
Copy link
Member

Is having a grace period enough or should we introduced a further timestamp for that use-case? Do you have any opionions on that? https://github.com/orgs/MatterMiners/teams/review

I think the "true" solution is to stop using asynccachemap for the entire queue and instead do an asyncbulkcall for those jobs we actually care about. This would once and for all fix the issue of outdated data and should still scale well enough.

In return, I think it is totally fine to introduce an additional timestamp as a stopgap solution to keep the asynccachemap approach running for now. Just pick a name that makes it clear what the timestamp is there for so that it doesn't get misused for something else.

@giffels
Copy link
Member Author

giffels commented Aug 7, 2023

Thanks a lot @maxfischer2781, that was the input I was looking for. 👍 Just to get the idea right, asyncbulkcall would in principle "block" all resource_status calls for a given time period, right?

@maxfischer2781
Copy link
Member

Yes, asyncbulkcall will asynchronously block until a delay or a specific amount of calls is reached. Since drones should mostly update in bulk (since they are all triggered by the Controller/FactoryPool at once or an internal timer) I think we can go for a small delay in the order of 0.1s-1s.

@giffels giffels mentioned this pull request Aug 8, 2023
10 tasks
@giffels giffels added bug Something isn't working Improvement Code Improvements labels Aug 8, 2023
@giffels giffels force-pushed the fix-updated-timestamp branch from 9c0fc20 to 023a71c Compare August 8, 2023 09:25
@giffels giffels self-assigned this Aug 8, 2023
return AttributeDict(resource_status=ResourceStatus.Deleted)
else:
return self.handle_response(resource_status)
return self.handle_response(await self._condor_q(resource_attributes))

Check failure

Code scanning / CodeQL

Wrong number of arguments in a call

Call to [method SiteAdapter.handle_response](1) with too few arguments; should be no fewer than 3.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial has been used, which creates a false postive in CodeQL.

return convert_to_attribute_dict(
{**resource_attributes, **self.handle_response(resource_status)}
)
return self.handle_response(await self._squeue(resource_attributes))

Check failure

Code scanning / CodeQL

Wrong number of arguments in a call

Call to [method SiteAdapter.handle_response](1) with too few arguments; should be no fewer than 3.
return convert_to_attribute_dict(
{**resource_attributes, **self.handle_response(resource_status)}
)
return self.handle_response(await self._showq(resource_attributes))

Check failure

Code scanning / CodeQL

Wrong number of arguments in a call

Call to [method SiteAdapter.handle_response](1) with too few arguments; should be no fewer than 3.
@giffels giffels force-pushed the fix-updated-timestamp branch from c76a913 to 04f3b0d Compare August 10, 2023 14:33
@giffels giffels marked this pull request as ready for review August 10, 2023 15:10
@giffels giffels requested review from maxfischer2781, a team and mschnepf and removed request for a team August 10, 2023 15:11
mschnepf
mschnepf previously approved these changes Aug 14, 2023
Copy link
Member

@mschnepf mschnepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one minor comment, but no showstoppers. Besides that, it looks good to me. Make it so.

tardis/adapters/sites/htcondor.py Show resolved Hide resolved
@giffels
Copy link
Member Author

giffels commented Sep 20, 2023

@maxfischer2781 and @mschnepf, I think we can continue reviewing this request. The code works at least with HoreKa.

mschnepf
mschnepf previously approved these changes Oct 2, 2023
Copy link
Member

@mschnepf mschnepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. 👍 Thanks for your work

Copy link
Member

@mschnepf mschnepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Copy link
Member

@maxfischer2781 maxfischer2781 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go for it.

@giffels giffels added this pull request to the merge queue Oct 5, 2023
Merged via the queue into MatterMiners:master with commit c85ed76 Oct 5, 2023
@giffels giffels deleted the fix-updated-timestamp branch October 5, 2023 07:58
giffels added a commit to giffels/tardis that referenced this pull request Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Improvement Code Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

drone_minimum_lifetime ignored
4 participants