Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate outputs in luigi.task.flatten_output (#3106) #3174

Merged
merged 1 commit into from
Feb 9, 2023

Conversation

starhel
Copy link
Contributor

@starhel starhel commented Jun 10, 2022

Description

I've changed luigi.task.flatten_output implementation to deduplicate outputs from tasks which are required by more than one wrapper task. Implementation is slightly changed (from DFS to BFS), but there's no guaranteed order in docs.

Motivation and Context

Fixes #3106

Have you tested this? If so, how?

I've added unit tests for standard use cases and diamond problem.

@starhel starhel requested review from dlstadther and a team as code owners June 10, 2022 15:29
@starhel starhel force-pushed the 3106_deduplicate_outputs branch from d30d97a to c0f2ee9 Compare January 30, 2023 18:22
@starhel starhel requested a review from dlstadther January 30, 2023 19:22
@lallea
Copy link
Contributor

lallea commented Jan 30, 2023

We often use fine-grained tasks, resulting in large DAGs. The time it takes for Luigi to process the graph is sometimes a bottleneck. I'd appreciate if we did not choose a slower implementation, even if I appreciate a deterministic output order.

Could we either use an ordered set (https://pypi.org/project/ordered-set/) or an OrderedDict to emulate a set? Or perhaps use a set and sort in the output routine?

Thanks for doing the deduplication, BTW. :-)

@starhel
Copy link
Contributor Author

starhel commented Jan 31, 2023

We often use fine-grained tasks, resulting in large DAGs.

Just remember that flatten_output just gets outputs of subgraph - a list of outputs from expanding wrapper tasks, now the whole DAG.

Could we either use an ordered set (https://pypi.org/project/ordered-set/) or an OrderedDict to emulate a set? Or perhaps use a set and sort in the output routine?

I don't want to add a new dependency for a small feature. I moved the implementation to use an OrderedDict, it's equally fast as a set.

@starhel starhel force-pushed the 3106_deduplicate_outputs branch from c0f2ee9 to d9a950e Compare January 31, 2023 10:07
@lallea
Copy link
Contributor

lallea commented Jan 31, 2023

Ah, ok, thanks for the clarification. And for the OrderedDict. It seems like a good tradeoff point.

Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for going with the OrderedDict solution! LGTM

@dlstadther dlstadther merged commit d1d8e92 into spotify:master Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicated outputs of luigi.task.flatten_output
3 participants