-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk Executor and HTCondor Bulk Operations #224
Bulk Executor and HTCondor Bulk Operations #224
Conversation
Codecov Report
@@ Coverage Diff @@
## master #224 +/- ##
==========================================
- Coverage 99.50% 99.34% -0.16%
==========================================
Files 53 54 +1
Lines 2022 2139 +117
==========================================
+ Hits 2012 2125 +113
- Misses 10 14 +4
Continue to review full report at Codecov.
|
…an be used instead
I don't really get the issue with the deployment test.
My hunch is that the additional sleep calls are unrelated to what we actually want to test. They might be some internal calls to switch tasks during initialisation. It might be necessary to I've spuriously hit the issue when running the test suite locally. I've changed the test to only look at the entire (mocked) elapsed time. Alternatively, testing the last call could also work. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@giffels do you think you can give this a first go soon? There'll definitely be changes, so no need to be thorough, but having your input on some of the open questions would simplify things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your nice work on this and sorry for the late review. I have left a few comments to work on before getting green lights.
|
||
key_translator = StaticMapping( | ||
remote_resource_uuid="ClusterId", | ||
remote_resource_uuid="JobId", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is in
tardis/tardis/adapters/sites/htcondor.py
Lines 247 to 248 in ce859f7
resource_uuid = resource_attributes.remote_resource_uuid | |
resource_status = self._htcondor_queue[resource_uuid] |
_job_id
on remote_resource_id
it should be fine.
raise | ||
# successes are in stdout, failures in stderr, both in argument order | ||
# stdout: Job 15540.0 marked for removal | ||
# stderr: Job 15612.0 not found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a not found
needs to be treated here as well? At least I would like to have logger.debug
output here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can add a debug log. Would you be okay if it's next to the raise TardisResourceStatusUpdateFailed
in the adapter methods? It seems we don't get any sensible debug output from condor anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be okay!
tardis/utilities/asyncbulkcall.py
Outdated
if not isinstance(self._concurrency, int) or self._concurrency <= 0: | ||
raise ValueError( | ||
"'concurrent' must be one of True, False, None or an integer above 0" | ||
f", got {self._concurrency!r} instead" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use pydantic here as well, however I do not insists on.
tardis/utilities/asyncbulkcall.py
Outdated
async def _bulk_dispatch(self): | ||
"""Collect tasks into bulks and dispatch them for command execution""" | ||
while True: | ||
await asyncio.sleep(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason for this? Should be >0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not strictly required but avoids the while True
loop from starving the event loop.
Both await self._get_bulk
and await self._concurrent.acquire()
have fast paths that do not yield to the event loop if they can succeed immediately. So a "worst case" would be that, say, a few hundred tasks make a BulkCall and get queued, then _bulk_dispatch
packs them all into bulk tasks at once; only when the queue is empty would _bulk_dispatch
suspend and allow the bulk tasks to actually start running.
With the sleep(0)
the loop is guaranteed to let other tasks run no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this and _get_bulk
for better task switching.
- Moved and commented the
await asyncio.sleep(0)
. It will only trigger when the fast paths could be hit. - The first item is fetched before starting the timeout. This allows to efficiently wait for a long time, instead of spin-waiting in
delay
intervals.
# duration skipped via asyncio.sleep | ||
# use the `sum` to avoid `asyncio.sleep(0)` context switches to skew the result | ||
mock_elapsed = sum(args[0] for args, _ in mocked_asyncio_sleep.call_args_list) | ||
self.assertEqual(mock_elapsed, self.mock_site_agent.drone_heartbeat_interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I have never experienced this issue before. Do you remember what the error message was in your local tests?
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This pull request introduces 2 alerts and fixes 1 when merging f8b9c70 into eb1e91c - view on LGTM.com new alerts:
fixed alerts:
|
Alright, I guess this is ready for another round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for your work 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant work, thanks a lot @maxfischer2781. I will deploy it on the Compute4PUNCH instance for testing and merge afterwards.
This PR adds bulk execution to HTCondor SiteAdapter. Major changes include:
AsyncBulkCall
framework class for collecting tasks to execute in bulkHTCondorAdapter
uses bulk executions for its commandsSee #223.
Open design questions:
condor_q
calls be converted to a bulk call instead of cached view of the entire queue?condor_q
and "don't bother" otherwise.ClusterId.ProcId
instead of justClusterId
. The code can handle both UUID types but only produces the new one now.