Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Executor and HTCondor Bulk Operations #224

Merged
merged 58 commits into from
Feb 9, 2022

Conversation

maxfischer2781
Copy link
Member

@maxfischer2781 maxfischer2781 commented Dec 22, 2021

This PR adds bulk execution to HTCondor SiteAdapter. Major changes include:

  • generic AsyncBulkCall framework class for collecting tasks to execute in bulk
  • HTCondorAdapter uses bulk executions for its commands
    • deploy resource
    • stop resource
    • terminate resource
  • Documented settings

See #223.

Open design questions:

  • Should condor_q calls be converted to a bulk call instead of cached view of the entire queue?
  • Should there be separate bulk limits for each type of command? My hunch is "yes" if we have condor_q and "don't bother" otherwise.

⚠️ This PR changes the Resource UUID format used by the HTCondor Site adapter: It now uses ClusterId.ProcId instead of just ClusterId. The code can handle both UUID types but only produces the new one now.

@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2021

Codecov Report

Merging #224 (cea7d68) into master (f978b19) will decrease coverage by 0.15%.
The diff coverage is 97.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #224      +/-   ##
==========================================
- Coverage   99.50%   99.34%   -0.16%     
==========================================
  Files          53       54       +1     
  Lines        2022     2139     +117     
==========================================
+ Hits         2012     2125     +113     
- Misses         10       14       +4     
Impacted Files Coverage Δ
tardis/interfaces/siteadapter.py 100.00% <ø> (ø)
tardis/utilities/asyncbulkcall.py 96.00% <96.00%> (ø)
tardis/adapters/sites/htcondor.py 99.20% <98.18%> (-0.80%) ⬇️
tardis/interfaces/executor.py 90.90% <100.00%> (+10.90%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f978b19...cea7d68. Read the comment docs.

@maxfischer2781
Copy link
Member Author

maxfischer2781 commented Jan 3, 2022

I don't really get the issue with the deployment test.

======================================================================
FAIL: test_run (tests.resources_t.test_drone.TestDrone)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.8/unittest/mock.py", line 1325, in patched
    return func(*newargs, **newkeywargs)
  File "/__w/tardis/tardis/tests/resources_t/test_drone.py", line 103, in test_run
    mocked_asyncio_sleep.assert_called_once_with(
  File "/usr/lib/python3.8/unittest/mock.py", line 924, in assert_called_once_with
    raise AssertionError(msg)
AssertionError: Expected 'sleep' to be called once. Called 4 times.
Calls: [call(0), call(0), call(0), call(10)].

My hunch is that the additional sleep calls are unrelated to what we actually want to test. They might be some internal calls to switch tasks during initialisation.

It might be necessary to reset_mock, or to sum up the call_args_list and check that instead.


I've spuriously hit the issue when running the test suite locally.


I've changed the test to only look at the entire (mocked) elapsed time. Alternatively, testing the last call could also work.

@lgtm-com

This comment was marked as outdated.

@lgtm-com

This comment was marked as outdated.

@maxfischer2781
Copy link
Member Author

@giffels do you think you can give this a first go soon? There'll definitely be changes, so no need to be thorough, but having your input on some of the open questions would simplify things.

Copy link
Member

@giffels giffels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your nice work on this and sorry for the late review. I have left a few comments to work on before getting green lights.


key_translator = StaticMapping(
remote_resource_uuid="ClusterId",
remote_resource_uuid="JobId",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is in

resource_uuid = resource_attributes.remote_resource_uuid
resource_status = self._htcondor_queue[resource_uuid]
. If you apply _job_id on remote_resource_id it should be fine.

setup.py Show resolved Hide resolved
tardis/adapters/sites/htcondor.py Show resolved Hide resolved
tardis/adapters/sites/htcondor.py Show resolved Hide resolved
raise
# successes are in stdout, failures in stderr, both in argument order
# stdout: Job 15540.0 marked for removal
# stderr: Job 15612.0 not found
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a not found needs to be treated here as well? At least I would like to have logger.debug output here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add a debug log. Would you be okay if it's next to the raise TardisResourceStatusUpdateFailed in the adapter methods? It seems we don't get any sensible debug output from condor anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be okay!

Comment on lines 101 to 105
if not isinstance(self._concurrency, int) or self._concurrency <= 0:
raise ValueError(
"'concurrent' must be one of True, False, None or an integer above 0"
f", got {self._concurrency!r} instead"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also use pydantic here as well, however I do not insists on.

async def _bulk_dispatch(self):
"""Collect tasks into bulks and dispatch them for command execution"""
while True:
await asyncio.sleep(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for this? Should be >0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly required but avoids the while True loop from starving the event loop.

Setting the delay to 0 provides an optimized path to allow other tasks to run. This can be used by long-running functions to avoid blocking the event loop for the full duration of the function call.

Both await self._get_bulk and await self._concurrent.acquire() have fast paths that do not yield to the event loop if they can succeed immediately. So a "worst case" would be that, say, a few hundred tasks make a BulkCall and get queued, then _bulk_dispatch packs them all into bulk tasks at once; only when the queue is empty would _bulk_dispatch suspend and allow the bulk tasks to actually start running.
With the sleep(0) the loop is guaranteed to let other tasks run no matter what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this and _get_bulk for better task switching.

  • Moved and commented the await asyncio.sleep(0). It will only trigger when the fast paths could be hit.
  • The first item is fetched before starting the timeout. This allows to efficiently wait for a long time, instead of spin-waiting in delay intervals.

Comment on lines +103 to +106
# duration skipped via asyncio.sleep
# use the `sum` to avoid `asyncio.sleep(0)` context switches to skew the result
mock_elapsed = sum(args[0] for args, _ in mocked_asyncio_sleep.call_args_list)
self.assertEqual(mock_elapsed, self.mock_site_agent.drone_heartbeat_interval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I have never experienced this issue before. Do you remember what the error message was in your local tests?

@giffels giffels added enhancement New feature or request Improvement Code Improvements labels Jan 31, 2022
@giffels giffels added this to the 0.7.0 - Release milestone Jan 31, 2022
@lgtm-com

This comment was marked as outdated.

@lgtm-com

This comment was marked as outdated.

@lgtm-com

This comment was marked as outdated.

@lgtm-com
Copy link

lgtm-com bot commented Feb 3, 2022

This pull request introduces 2 alerts and fixes 1 when merging f8b9c70 into eb1e91c - view on LGTM.com

new alerts:

  • 2 for Clear-text logging of sensitive information

fixed alerts:

  • 1 for Wrong number of arguments in a call

@maxfischer2781
Copy link
Member Author

Alright, I guess this is ready for another round.

Copy link
Member

@eileen-kuehn eileen-kuehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for your work 👍

Copy link
Member

@giffels giffels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant work, thanks a lot @maxfischer2781. I will deploy it on the Compute4PUNCH instance for testing and merge afterwards.

@giffels giffels merged commit 1c30c48 into MatterMiners:master Feb 9, 2022
giffels added a commit to giffels/tardis that referenced this pull request Apr 19, 2022
giffels added a commit to giffels/tardis that referenced this pull request Jan 20, 2023
giffels added a commit to giffels/tardis that referenced this pull request Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Improvement Code Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants