Bulk Executor and HTCondor Bulk Operations #224

maxfischer2781 · 2021-12-22T13:36:22Z

This PR adds bulk execution to HTCondor SiteAdapter. Major changes include:

generic AsyncBulkCall framework class for collecting tasks to execute in bulk
HTCondorAdapter uses bulk executions for its commands
- deploy resource
- stop resource
- terminate resource
Documented settings

Open design questions:

Should condor_q calls be converted to a bulk call instead of cached view of the entire queue?
Should there be separate bulk limits for each type of command? My hunch is "yes" if we have condor_q and "don't bother" otherwise.

⚠️ This PR changes the Resource UUID format used by the HTCondor Site adapter: It now uses ClusterId.ProcId instead of just ClusterId. The code can handle both UUID types but only produces the new one now.

codecov-commenter · 2021-12-22T13:37:12Z

Codecov Report

Merging #224 (cea7d68) into master (f978b19) will decrease coverage by 0.15%.
The diff coverage is 97.08%.

@@            Coverage Diff             @@
##           master     #224      +/-   ##
==========================================
- Coverage   99.50%   99.34%   -0.16%     
==========================================
  Files          53       54       +1     
  Lines        2022     2139     +117     
==========================================
+ Hits         2012     2125     +113     
- Misses         10       14       +4

Impacted Files	Coverage Δ
tardis/interfaces/siteadapter.py	`100.00% <ø> (ø)`
tardis/utilities/asyncbulkcall.py	`96.00% <96.00%> (ø)`
tardis/adapters/sites/htcondor.py	`99.20% <98.18%> (-0.80%)`	⬇️
tardis/interfaces/executor.py	`90.90% <100.00%> (+10.90%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f978b19...cea7d68. Read the comment docs.

…an be used instead

maxfischer2781 · 2022-01-03T12:56:33Z

I don't really get the issue with the deployment test.

======================================================================
FAIL: test_run (tests.resources_t.test_drone.TestDrone)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.8/unittest/mock.py", line 1325, in patched
    return func(*newargs, **newkeywargs)
  File "/__w/tardis/tardis/tests/resources_t/test_drone.py", line 103, in test_run
    mocked_asyncio_sleep.assert_called_once_with(
  File "/usr/lib/python3.8/unittest/mock.py", line 924, in assert_called_once_with
    raise AssertionError(msg)
AssertionError: Expected 'sleep' to be called once. Called 4 times.
Calls: [call(0), call(0), call(0), call(10)].

My hunch is that the additional sleep calls are unrelated to what we actually want to test. They might be some internal calls to switch tasks during initialisation.

It might be necessary to reset_mock, or to sum up the call_args_list and check that instead.

I've spuriously hit the issue when running the test suite locally.

I've changed the test to only look at the entire (mocked) elapsed time. Alternatively, testing the last call could also work.

maxfischer2781 · 2022-01-24T14:11:32Z

@giffels do you think you can give this a first go soon? There'll definitely be changes, so no need to be thorough, but having your input on some of the open questions would simplify things.

giffels

Thanks a lot for your nice work on this and sorry for the late review. I have left a few comments to work on before getting green lights.

giffels · 2022-01-28T21:07:43Z

tardis/adapters/sites/htcondor.py


        key_translator = StaticMapping(
-            remote_resource_uuid="ClusterId",
+            remote_resource_uuid="JobId",


I think the problem is in

tardis/tardis/adapters/sites/htcondor.py

Lines 247 to 248 in ce859f7

resource_uuid = resource_attributes.remote_resource_uuid

resource_status = self._htcondor_queue[resource_uuid]

. If you apply _job_id on remote_resource_id it should be fine.

setup.py

tardis/adapters/sites/htcondor.py

giffels · 2022-01-28T21:26:47Z

tardis/adapters/sites/htcondor.py

+        raise
+    # successes are in stdout, failures in stderr, both in argument order
+    # stdout: Job 15540.0 marked for removal
+    # stderr: Job 15612.0 not found


Does a not found needs to be treated here as well? At least I would like to have logger.debug output here.

Sure, I can add a debug log. Would you be okay if it's next to the raise TardisResourceStatusUpdateFailed in the adapter methods? It seems we don't get any sensible debug output from condor anyway.

That would be okay!

giffels · 2022-01-28T21:46:38Z

tardis/utilities/asyncbulkcall.py

+        if not isinstance(self._concurrency, int) or self._concurrency <= 0:
+            raise ValueError(
+                "'concurrent' must be one of True, False, None or an integer above 0"
+                f", got {self._concurrency!r} instead"
+            )


You could also use pydantic here as well, however I do not insists on.

giffels · 2022-01-28T21:48:50Z

tardis/utilities/asyncbulkcall.py

+    async def _bulk_dispatch(self):
+        """Collect tasks into bulks and dispatch them for command execution"""
+        while True:
+            await asyncio.sleep(0)


Is there any reason for this? Should be >0?

It's not strictly required but avoids the while True loop from starving the event loop.

Setting the delay to 0 provides an optimized path to allow other tasks to run. This can be used by long-running functions to avoid blocking the event loop for the full duration of the function call.

Both await self._get_bulk and await self._concurrent.acquire() have fast paths that do not yield to the event loop if they can succeed immediately. So a "worst case" would be that, say, a few hundred tasks make a BulkCall and get queued, then _bulk_dispatch packs them all into bulk tasks at once; only when the queue is empty would _bulk_dispatch suspend and allow the bulk tasks to actually start running.
With the sleep(0) the loop is guaranteed to let other tasks run no matter what.

I've refactored this and _get_bulk for better task switching.

Moved and commented the await asyncio.sleep(0). It will only trigger when the fast paths could be hit.

The first item is fetched before starting the timeout. This allows to efficiently wait for a long time, instead of spin-waiting in delay intervals.

giffels · 2022-01-28T22:02:05Z

tests/resources_t/test_drone.py

+        # duration skipped via asyncio.sleep
+        # use the `sum` to avoid `asyncio.sleep(0)` context switches to skew the result
+        mock_elapsed = sum(args[0] for args, _ in mocked_asyncio_sleep.call_args_list)
+        self.assertEqual(mock_elapsed, self.mock_site_agent.drone_heartbeat_interval)


Hmm, I have never experienced this issue before. Do you remember what the error message was in your local tests?

lgtm-com · 2022-02-03T18:26:48Z

This pull request introduces 2 alerts and fixes 1 when merging f8b9c70 into eb1e91c - view on LGTM.com

new alerts:

2 for Clear-text logging of sensitive information

fixed alerts:

1 for Wrong number of arguments in a call

maxfischer2781 · 2022-02-04T09:26:46Z

Alright, I guess this is ready for another round.

eileen-kuehn

Looks great, thanks for your work 👍

giffels

Brilliant work, thanks a lot @maxfischer2781. I will deploy it on the Compute4PUNCH instance for testing and merge afterwards.

maxfischer2781 added 3 commits December 22, 2021 12:55

draft for bulk execution

3491a86

documented intended usage

0db8645

typo fix

973bdbb

maxfischer2781 added 8 commits January 3, 2022 12:40

typed Executor signature

6e47c3b

handle empty bulk, enforce order

9f8e5cc

bulk rm

78404b0

quoted types

8f3a07b

Protocol from typing_extensions

85cdab2

lazy initialising semaphore

8fb9287

quoted types

414b60d

Prior to Python 3.7, the low-level asyncio.ensure_future() function c…

676cfa0

…an be used instead

maxfischer2781 added 5 commits January 4, 2022 11:47

reading bulk settings from config

2fd2923

bulk execution checks for matching output

e7468b5

error formatting

9157397

using bulk submit/suspend/remove

76b5ea5

pre-3.8 call_args.kwargs

d95e0ed

This comment was marked as outdated.

Sign in to view

maxfischer2781 added 10 commits January 5, 2022 11:31

BulkExecution is directly callable

1c4014e

added some crisp typing lingo

071b1db

bulk commands take variadic args instead of tuple

a0b5dd8

BulkCommand spec

ecc71e4

BulkExecution argument is anonymous

d4f4713

code reduction

d33a880

flattened package layout

c21507f

test formatting

f6185e7

oops

03d3b2e

black

65b9893

This comment was marked as outdated.

Sign in to view

giffels requested changes Jan 28, 2022

View reviewed changes

giffels added enhancement New feature or request Improvement Code Improvements labels Jan 31, 2022

giffels added this to the 0.7.0 - Release milestone Jan 31, 2022

maxfischer2781 mentioned this pull request Jan 31, 2022

HTCondor Resource UUID deprecation #227

Closed

2 tasks

maxfischer2781 added 6 commits January 31, 2022 17:04

normalize UUIDs during condor_q lookups

ba5ff71

automatically inserting queue statements

612fb7d

concurrency only accepts int or None

42f2aab

debug log on rm/suspend failure

4eeea62

using suspend-until-first instead of spin lock waiting

d07f59c

Merge branch 'master' into feature/bulk_htcondor

0079046

This comment was marked as outdated.

Sign in to view

removed unused import

94d0613

This comment was marked as outdated.

Sign in to view

updated black

cea7d68

This comment was marked as outdated.

Sign in to view

maxfischer2781 added 2 commits February 3, 2022 18:59

added example for AsyncBulkCall + partial

f673f2d

adjusted wording of delay parameter

f8b9c70

maxfischer2781 requested review from giffels and eileen-kuehn February 4, 2022 09:26

eileen-kuehn approved these changes Feb 4, 2022

View reviewed changes

giffels approved these changes Feb 9, 2022

View reviewed changes

giffels merged commit 1c30c48 into MatterMiners:master Feb 9, 2022

giffels added a commit to giffels/tardis that referenced this pull request Apr 19, 2022

Add changelog for MatterMiners#224

317173d

giffels mentioned this pull request May 5, 2022

Add optional arguments to htcondor jdls #244

Merged

giffels added a commit to giffels/tardis that referenced this pull request Jan 20, 2023

Add changelog for MatterMiners#224

ee5622b

giffels added a commit to giffels/tardis that referenced this pull request Feb 24, 2023

Add changelog for MatterMiners#224

4c39f93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk Executor and HTCondor Bulk Operations #224

Bulk Executor and HTCondor Bulk Operations #224

maxfischer2781 commented Dec 22, 2021 •

edited

Loading

codecov-commenter commented Dec 22, 2021 •

edited

Loading

maxfischer2781 commented Jan 3, 2022 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

maxfischer2781 commented Jan 24, 2022

giffels left a comment

giffels Jan 28, 2022

giffels Jan 28, 2022

maxfischer2781 Jan 31, 2022

giffels Feb 2, 2022

giffels Jan 28, 2022

giffels Jan 28, 2022

maxfischer2781 Jan 31, 2022

maxfischer2781 Feb 1, 2022

giffels Jan 28, 2022

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

lgtm-com bot commented Feb 3, 2022

maxfischer2781 commented Feb 4, 2022

eileen-kuehn left a comment

giffels left a comment

	resource_uuid = resource_attributes.remote_resource_uuid
	resource_status = self._htcondor_queue[resource_uuid]

Bulk Executor and HTCondor Bulk Operations #224

Bulk Executor and HTCondor Bulk Operations #224

Conversation

maxfischer2781 commented Dec 22, 2021 • edited Loading

codecov-commenter commented Dec 22, 2021 • edited Loading

Codecov Report

maxfischer2781 commented Jan 3, 2022 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

maxfischer2781 commented Jan 24, 2022

giffels left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

lgtm-com bot commented Feb 3, 2022

maxfischer2781 commented Feb 4, 2022

eileen-kuehn left a comment

Choose a reason for hiding this comment

giffels left a comment

Choose a reason for hiding this comment

maxfischer2781 commented Dec 22, 2021 •

edited

Loading

codecov-commenter commented Dec 22, 2021 •

edited

Loading

maxfischer2781 commented Jan 3, 2022 •

edited

Loading