Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out providers into "standalone" python packages #33909

Open
1 task done
potiuk opened this issue Aug 30, 2023 · 6 comments
Open
1 task done

Split out providers into "standalone" python packages #33909

potiuk opened this issue Aug 30, 2023 · 6 comments
Labels
kind:meta High-level information important to the community

Comments

@potiuk
Copy link
Member

potiuk commented Aug 30, 2023

Body

As part of making it possible to move out of providers from Airflow core repository (we have not decided yet on it, we just want to make it possible) we should turn Airflow Providers into "real" packages.

Currently those packages are build "dynamically" -> https://github.com/apache/airflow/blob/main/dev/provider_packages/prepare_provider_packages.py is used as part of breeze release-management prepare-provider-packages to extract parts of the "airflow/providers/" sources dynamically, generate setup.* and pyproject.toml files dynamically and build the providers from those dynamically generated temporary folders.

This has some disadvantages - for example it does not make reproducible builds possible, and it requires complex breeze command, CI image and the python script in the image to build the packages (CI image is used to make sure all dependencies are installed and to provide isolation and cleanup between builds, also it allows to isolate (security) the host from container building the packages whenbuilding the providers in case of builds from contributor forks).

But this got us through last 3 years of releasing airflow and providers separately :).

With recent changes (#32604 upcoming #32048, the upcoming #33907 and a number of other changes already implemented in the past - we are quite close to make it possible to split out providers to "standalone" packages - where each provider is a separate "compliant with standard" package and has a complete independent directory where you can build the package without moving the sources around, using standard python tooling. This would require everything that relates to the providers to move to those directories (docs, tests are notably shared between airflow and providers in "docs" and "tests" and they should be moved around).

We should also eventually add automation of checking if there is anything left in core that refers to providers #11435.

The early draft POC attempts to add scripts to automate such migration and result of it can be seen here:

However, the challenge to solve for this one is to make it easy for contributors to contribute to airflow and providers together. We want to make it possible - simlarly as today to have an easy environment where you can edit both airflow and provider code and run tests, run airflow, run integration tests without extra hassle of installing and reinstalling the packages in editabkle mode.

Ideal workflow of the developer is where they can:

  • run breeze and be able to edit any sources locally in the host - and airflow in breeze should automatically pick the changes in both airlfow and providers when interpreter is started
  • install a local venv (in editable mode possibly) for airlfow and selected (all?) providers and the locally installed airflow should also pick changes for both airflow and providers when interpreter is started

The way how to do it and choice of (ideally) standard python tooling to make such move is not yet determined and is open for discussion, POCs and proposals.

Note: Some of the modern tools from PyPA world - Hatch, flit are recently evolving and adding more features, which might likely make it possible to combine multiple packages from monorepo into a single development installation, and we would likely want to use one of the standard tools for that rather than develop our own. We might consider contributing to some of those tools to make them more suitable for us. Possibly we could combine several tools (and for example use flit for providers, and hatch for airflow to combine repos as hatch seems to be better suited and has a roadmap for monorepo/multi-project setup, while flit is slick and very focused - or so it seems).

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.
@potiuk potiuk added the kind:meta High-level information important to the community label Aug 30, 2023
@potiuk
Copy link
Member Author

potiuk commented Aug 30, 2023

cc: @eladkal @uranusjr @Taragolis -> this captures some of the recent discussion from Slack: https://apache-airflow.slack.com/archives/CCPRP7943/p1692612215901809

At some point in time when (IF?) we have good idea how it can be done, we should bring it to the devlist.
I had a positive feedback on the "split" proposal quite some time ago https://lists.apache.org/thread/3s5tn1wnvo0cw9vofwmbjl0rkyvhrtbx but I do not want to make the discussion about it before we have a firm idea how we can achieve especially the point about "easy development" environment.

@potiuk
Copy link
Member Author

potiuk commented Aug 30, 2023

Another thought....

Maybe we do NOT have to solve the challenge of all-editable build for airflow for providers. I think we have a viable alternative....

  • Breeze could remain an all-editable environment even after we move the providers to subfolders. It could read all the dependency information from pyproject.toml files instead of provider.yaml files (we can very easily change the way how generated/provider_dependencies.json are generated and bring sources of providers (even mount them) to "airflow/providers". That would be an easy change for the current environment and CI.

  • When/If we get to contributing @kaxil's airflowctl to become our "community" tool eventually- We could use it to have an easy local venv to install "airflow" in it and add an easy way to "inject" provider's editable sources (selected or all) and dependencies into such environment. While airflowctl is mainly targeted to support DAG development, we could make it easy to "inject" editable sources of providers too. That would be even better than what we have now - because contributors could test new providers also with old airflow version very easily (it's not easy currently).

Breeze could remain the "CI" driver and "All-editable" sources where latest source airflow + latest providers could be developed and tested (which is absolutely necessary BTW).

Also cc: @mobuchowski and @hussein-awala as I know they are interested in DevEX and I think it would be great to do some discussion/brainstorming running about that and potential future changes to it in a bit more of an "interest group".

Of course others are invited too.

We could design together the future of Airflow dev-env for that and once we have the providers as "standalone" packages, we could then start a discussion "Does it make sense/ Do we want to split providers to a separate repo". I have no ready answer for that (lots of thoughts though). But I think this dicussion should happen after we have the providers as "standalone" packages in our main "monorepo".

@uranusjr
Copy link
Member

Honestly I’m not particularly keen on moving the packages to separate repositories since that’d just make contributing more difficult instead of easier. But making providers separate packages from the core Airflow could be useful.

@potiuk
Copy link
Member Author

potiuk commented Aug 30, 2023

Honestly I’m not particularly keen on moving the packages to separate repositories since that’d just make contributing more difficult instead of easier. But making providers separate packages from the core Airflow could be useful.

I tend to agree with that statement. But that discussion is yet to happen :). It's a bit of theoretical possibility and I wanted to only start it after we see potentially how separate packages inside airflow repo work for us.

@mobuchowski
Copy link
Contributor

I like providers being standalone packages by default. In addition to the benefits described here, it would allow installing providers straight from github via pip.

I agree with @uranusjr that being a monorepo instead of multiple separate repos is a net benefit, especially when also thinking about local development process, not only CI.

Ideal workflow of the developer is where they can:
run breeze and be able to edit any sources locally in the host - and airflow in breeze should automatically pick the changes in both airlfow and providers when interpreter is started
install a local venv (in editable mode possibly) for airlfow and selected (all?) providers and the locally installed airflow should also pick changes for both airflow and providers when interpreter is started

Ideally, there would also be no functional differences between Airflow running this way, and production image of Airflow - besides selection of installed/running providers.

@potiuk
Copy link
Member Author

potiuk commented Aug 30, 2023

Ideally, there would also be no functional differences between Airflow running this way, and production image of Airflow - besides selection of installed/running providers.

Agree. But this one is tricky as we rely on entrypoints (and this is what ProvidersManager and INSTALL_PROVIDERS_FROM_SOURCES attempts to workaround). But if we find another solution, I would gladly remove those hacks. That's why I was eyeying Hatch pypa/hatch#233 (hence my original comment about it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:meta High-level information important to the community
Projects
None yet
Development

No branches or pull requests

3 participants