Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache Beam Executor #225

Merged
merged 19 commits into from
Nov 16, 2021
Merged

Apache Beam Executor #225

merged 19 commits into from
Nov 16, 2021

Conversation

alxmrs
Copy link
Contributor

@alxmrs alxmrs commented Nov 7, 2021

The pangeo-forge pipeline model is adapted to run on Apache Beam. Here, we use dummy stage numbers ("steps") as the input and output PCollections needed to work with Beam's programming model. Where substantive inputs need to be mapped to stages with arguments, "mappables" are combined with these steps. The result is that each stage executes serially, and mappable stages execute in parallel.

Fixes #169.

@alxmrs alxmrs marked this pull request as draft November 7, 2021 05:39
@alxmrs
Copy link
Contributor Author

alxmrs commented Nov 7, 2021

I was going to take some time to update the docs to reflect Apache Beam support in this CL. However, it seems like the docs need a larger overhaul anyway. So, I'll mark this ready for review.

Edit: When I wrote the above, I didn't notice the BaseRecipe class. The docs have been updated.

@alxmrs alxmrs marked this pull request as ready for review November 7, 2021 05:52
@rabernat
Copy link
Contributor

rabernat commented Nov 7, 2021

This is fantastic Alex! I just activated the workflows, but I predict they will fail because beam is not in the dev environments. You'll need to add beam here and here.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think conda forge wants a dash in the name apache-beam: https://github.com/conda-forge/apache-beam-feedstock

ci/py3.8.yml Outdated Show resolved Hide resolved
ci/py3.9.yml Outdated Show resolved Hide resolved
@rabernat
Copy link
Contributor

rabernat commented Nov 8, 2021

I merged my suggested change to the py3.8.yml environment. This turned up an error in the beam executor: https://github.com/pangeo-forge/pangeo-forge-recipes/runs/4138427592?check_suite_focus=true#step:8:2443

@alxmrs
Copy link
Contributor Author

alxmrs commented Nov 8, 2021

Thanks, I should have caught that locally. Apparently, a variable name I chose to meet the linter created a type check conflict with Beam. func may have been used as an internal argument, and had a conflict? Anyways, I changed the parameter name to fun, and it meets Beam's type check and the precommit checks, locally.

@rabernat
Copy link
Contributor

rabernat commented Nov 8, 2021

I think this has now turned up an upstream issue with Xarray. (0.20.0 was just released) I'll make a PR for that fix.

@rabernat
Copy link
Contributor

rabernat commented Nov 8, 2021

For some reason, the 3.9 conda environment was taking over an hour to build. I'm restarting the build.

@rabernat
Copy link
Contributor

rabernat commented Nov 8, 2021

For some reason, the python 3.9 environment seems to be unsolvable. https://github.com/pangeo-forge/pangeo-forge-recipes/runs/4143328072?check_suite_focus=true#step:5:302

There must be a version conflict somewhere, but I don't know how to debug it. Any ideas?

@TomAugspurger
Copy link
Contributor

mamba tends to give better error messages when there are conflicts.

That said, mamba solves fine for me locally (linux). Maybe try using mamba if you aren't already? https://github.com/conda-incubator/setup-miniconda#example-6-mamba

@rabernat
Copy link
Contributor

rabernat commented Nov 8, 2021

Maybe try using mamba if you aren't already?

A PR to switch the CI to Mamba would be most welcome!

@alxmrs
Copy link
Contributor Author

alxmrs commented Nov 8, 2021

For some reason, the python 3.9 environment seems to be unsolvable.

It just occurred to me: Apache Beam only officially supports Python 3.8 -- it doesn't yet have support for 3.9, though it's on the roadmap.

@rabernat
Copy link
Contributor

rabernat commented Nov 11, 2021

If you merge master (now that #226 is in), the tests should pass here.

Just remove beam from the 3.9 env if it is not supported.

tests/test_pipelines.py Outdated Show resolved Hide resolved
@alxmrs alxmrs requested a review from rabernat November 11, 2021 21:26
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Alex! Just one place I'd like to tighten up.

pangeo_forge_recipes/executors/beam.py Show resolved Hide resolved
pangeo_forge_recipes/executors/beam.py Show resolved Hide resolved
tests/test_pipelines.py Outdated Show resolved Hide resolved
@alxmrs alxmrs requested a review from rabernat November 11, 2021 22:50
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great! We are almost there. Just a few tiny comments then I promise we can stop iterating and merge this.

tests/test_pipelines.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/executors/beam.py Show resolved Hide resolved
@alxmrs alxmrs requested a review from rabernat November 16, 2021 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Beam as an executor for recipe pipelines
4 participants