Improve control over dynamic requirements #3179

riga · 2022-06-26T14:57:20Z

This PR is meant as a continuation of #3178 and further improves the control over dynamic task requirements handled by the worker.

Description

I added a shallow class DynamicRequirements which is intended to wrap a batch of tasks being yielded as dyn. reqs. and optionally to define a custom, probably optimized completeness check for that batch. The new class is understood by the TaskProcess as a valid yield value of run() methods, which required changes to only a few lines.

Motivation and Context

As already mentioned in #3178, we are sometimes dealing with k's of tasks yielded as dynamic requirements in some wrapper task. These tasks store their outputs in a remote location and we can make the assumption that they are located in the same directory (presumably not unusual). For this matter we don't want to "stat" all output files separately, but just do a "listdir" on the common base directory, followed by a local comparison of basenames, saving us k's of interactions with remote resources. Although we make use of the caching implemented in the referenced PR, we would like to further reduce remote API calls but the workers (or rather TaskProcess) internally flatten all requirements and perform separate completeness checks:

https://github.com/spotify/luigi/blob/master/luigi/worker.py#L154

Which requirements exist and which don't is not of interest at this point, so batched completeness checks as the one suggested above would be fully compatible with the current logic :)

Have you tested this? If so, how?

Yep, I added a new test case, update the docs, and amended the dynamic_requirements.py example.

riga · 2022-06-26T17:01:38Z

The failing test reports

ModuleNotFoundError: No module named 'openapi_spec_validator'

and connected to that

_pytest.nodes.Collector.CollectError: ImportError while importing test module '/home/runner/work/luigi/luigi/test/contrib/ecs_test.py'.

Is it possible there's something wrong with contrib/ecs_test.py? I don't think the changes in this PR affect that one.

riga · 2022-06-27T16:32:22Z

Seems like a temporary glitch, all tests pass now 🎉

lallea · 2022-08-28T20:47:48Z

This could likely improve many of our jobs. Thanks!

I have considered implementing something similar, but in the storage interface layer, e.g. a GCSClient that batches lookups, potentially backed by a cache shared by workers. We already have a GCSClient with an in-memory cache, which improves things, but doesn't go all the way. Our cache implementation is very simple and makes some assumptions, so we haven't shared it.

What would the pros and cons be of DynamicRequirements vs similar functionality in the storage layer? DynamicRequirements can be used for multiple types of storage. A solution in the storage layer would be opaque to the worker core, and might arguably better separate concerns by keeping storage interface complexity within the storage layer. I am not sure how valuable a shared or persistent cache would be.

I do not doubt that this is a good solution. Just taking the opportunity to reason. Open source architecture is often accidental. :-)

riga · 2022-08-29T13:52:11Z

Thanks for the feedback @lallea !

What would the pros and cons be of DynamicRequirements vs similar functionality in the storage layer?

Interesting, we also work with in-memory / local caching on SSDs (if we can) plus batched lookups. However, I think having these DynamicRequirements on-top can be beneficial

a) to reduce the load even further, and
b) to make the first step of savings accessible for projects that can't or perhaps haven't yet invested time into custom caching solutions.

Some more background on our case (Physics research):

We sometimes have dynamic tasks with O(k) yielded tasks, and on some of our file systems, saving O(k) cache lookups can speed things up a lot. Depending on where we process things (we can't always choose that), we cannot control the worker infrastructure so we often end up on machines that only have shared, slow NFS's where caching via local disks isn't an option. Low-hanging fruits such as defining DynamicRequirements could be really helpful there.

lallea · 2022-08-29T21:49:15Z

Yes, I agree it looks useful even in the presence of other caching.

Could it make the bulk complete code in the Range classes missing_datetimes obsolete? I think that code has never worked properly. I have looked at it, but failed to understand how it could work. :-)

riga · 2022-09-01T17:17:19Z

@dlstadther Kind ping :)

dlstadther

Thanks for the ping @riga !

Changes LGTM; only left 2 small documentation questions and 1 optional code comment.

doc/tasks.rst

examples/dynamic_requirements.py

luigi/task.py

dlstadther

LGTM! Thank you for your hard work and contribution!

riga added 5 commits June 25, 2022 12:15

Add DynamicRequirements wrapper.

e5bc2a5

Fix name.

f39b11b

Typo.

f80141d

Add tests and docs.

04cfd19

Polish docs.

692f32c

riga requested review from dlstadther and a team as code owners June 26, 2022 14:57

Fix linter errors.

7fd2817

riga added 2 commits June 26, 2022 19:04

Fix docs.

78730c0

Fix typo.

8bb1334

Merge branch 'master' into feature/contorl_dynamic_dependency_handling

3211f48

Merge branch 'master' into feature/contorl_dynamic_dependency_handling

ddd009b

dlstadther reviewed Sep 1, 2022

View reviewed changes

doc/tasks.rst Outdated Show resolved Hide resolved

examples/dynamic_requirements.py Outdated Show resolved Hide resolved

luigi/task.py Outdated Show resolved Hide resolved

riga added 2 commits September 12, 2022 11:41

Merge branch 'master' into feature/contorl_dynamic_dependency_handling.

6b40f6e

Implement review comments.

8c3e3a5

dlstadther approved these changes Sep 12, 2022

View reviewed changes

dlstadther merged commit 07551b1 into spotify:master Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve control over dynamic requirements #3179

Improve control over dynamic requirements #3179

riga commented Jun 26, 2022

riga commented Jun 26, 2022 •

edited

Loading

riga commented Jun 27, 2022 •

edited

Loading

lallea commented Aug 28, 2022

riga commented Aug 29, 2022 •

edited

Loading

lallea commented Aug 29, 2022

riga commented Sep 1, 2022

dlstadther left a comment

dlstadther left a comment

Improve control over dynamic requirements #3179

Improve control over dynamic requirements #3179

Conversation

riga commented Jun 26, 2022

Description

Motivation and Context

Have you tested this? If so, how?

riga commented Jun 26, 2022 • edited Loading

riga commented Jun 27, 2022 • edited Loading

lallea commented Aug 28, 2022

riga commented Aug 29, 2022 • edited Loading

lallea commented Aug 29, 2022

riga commented Sep 1, 2022

dlstadther left a comment

Choose a reason for hiding this comment

dlstadther left a comment

Choose a reason for hiding this comment

riga commented Jun 26, 2022 •

edited

Loading

riga commented Jun 27, 2022 •

edited

Loading

riga commented Aug 29, 2022 •

edited

Loading