Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrapping problem (how to bootstrap all-source environments) #342

Open
jaraco opened this issue Apr 19, 2020 · 55 comments
Open

Bootstrapping problem (how to bootstrap all-source environments) #342

jaraco opened this issue Apr 19, 2020 · 55 comments

Comments

@jaraco
Copy link
Member

jaraco commented Apr 19, 2020

In this comment, @pfmoore suggests that we may need to re-think --no-binary, specifically around wheel but more generally around any or all build dependencies. In pypa/pip#7831, we've essentially concluded that --no-binary :all: is unsupported for bootstrapping issues. As reported in pypa/wheel#344 and and pypa/wheel#332, wheel is essentially blocked from adopting any form of PEP 518, or any other tool that requires pyproject.toml (because presence of a pyproject.toml implies PEP 518) unless wheel can elect to break --no-binary wheel installs.

Furthermore, the backend-path solution is likely not to be viable. Even at it's face backend-path adds a path to the PYTHONPATH when building, but unless a source checkout of wheel has had setup.py egg-info run on it, it won't have the requisite metadata to advertise its distutils functionality.

@pfmoore has suggested that it's not the responsibility of the PyPA or its maintainers to solve the issue, but that the downstream packagers should propose a solution. To that end, should wheel restore support for pyproject.toml and break --no-binary workflows, thus breaking the world for downstream packagers and incentivizing them to come up with a solution (that seems inadvisable)? Should wheel (and other use-cases) remain encumbered with this (undocumented) constraint in order to force legacy behavior when building wheel, thus alleviating the incentive to fix the issue (also sub-optimal)?

Perhaps solving the "bootstrapping from source" problem would be a good one for the PSF to sponsor.

@jaraco jaraco changed the title Need to re-think no-binary support for build dependencies Re-think no-binary support for build dependencies (bootstrapping problem) Apr 19, 2020
@pradyunsg
Copy link
Member

Perhaps solving the "bootstrapping from source" problem would be a good one for the PSF to sponsor.

@brainwane @ewdurbin @di @dstufft ^ for the Packaging-WG "Fundable Packaging Improvements" page.

@brainwane
Copy link
Contributor

I'm interested in adding this to the list of fundable packaging improvements, but I need help with wording. Try answering each of these questions in about 20-200 words each:

  • What is the current situation/context? Example: "Scientists need to install some Python packages via pip and others with conda."

  • What ought to be fixed, made, or implemented?

  • What problems would this solve, and what new capabilities would it cause?

@brainwane
Copy link
Contributor

Anyone who wants to add this to the list of fundable packaging improvements can now submit a pull request on https://github.com/psf/fundable-packaging-improvements .

@jaraco
Copy link
Member Author

jaraco commented Jun 25, 2022

This issue is more general than just wheel and affects a number of packages beyond setuptools and wheel.

In pypa/setuptools#980, I learned of the issue in build tools. If build tools have dependencies, and those dependencies use the build tool, it's not possible to build the dependencies from source. For example,

  • Setuptools depends on appdirs.
  • Appdirs is built by setuptools.

An attempt to build/install Setuptools or Appdirs from source rather than wheels (as many platforms prefer to do) will fail.

Setuptools has worked around this issue by vendoring all of its dependencies, although it has a stated goal to stop vendoring (pypa/setuptools#2825).

However, in python/importlib_metadata#392 today, I've learned of even more places where these circular dependencies can manifest. In this case:

  • setuptools_scm depends on importlib_metadata.
  • importlib_metadata relies on setuptools_scm to build.

So unless setuptools_scm is pulled pre-built, when it attempts to build from source, it also pulls importlib_metadata which requires setuptools_scm to build.

I've probably stated before, I wish for libraries like setuptools_scm to be able to adopt dependencies at a whim (and not have users encountering errors).

These new cases lead me to believe the problem is not solveable by downstream packagers except by forcing all build tools (and their plugins) to vendor their dependencies.

I'd like to avoid each of these projects needing to come up with a bespoke hack to work around this issue and eliminate this footgun that will reduce adoption due to these bootstrapping issues.

@jaraco
Copy link
Member Author

jaraco commented Jun 25, 2022

What is the current situation/context? Example: "Scientists need to install some Python packages via pip and others with conda."

The situation is in any context where in a project, any of the PEP 518 build-system.requires has a dependency whose build-system.requires also leads back to the original project. For the general case, this situation is not a problem as the pip resolver can resolve dependencies and pull in their pre-built versions, but when users wish to build from source, the circular dependency prevents building.

What ought to be fixed, made, or implemented?

First, it's not obvious what needs to be implemented. This project still needs work to explore the problem space and propose options to be considered by the PyPA. It's going to require coordination across many projects and may require a PEP.

A rough sketch of some options to consider:

  • Implement a bootstrap handler that will identify and intercept these circular dependencies and provide an alternate way to make the functionality available without building (such as by adding the project's sources root and src/ directory to sys.path).
  • Require the builders to keep blessed, vendored copies of dependencies of build tools.
  • Explicitly disallow building these projects from source and require downstream packagers that wish to bootstrap a system from source to own the problem and devise their own workaround.
  • Require build tools not to have circular dependencies. Publish this formal requirement.

What problems would this solve, and what new capabilities would it cause?

It would solve the class of problems where errors occur when users attempt to build from sources but dependencies exist in the build dependencies. It would mean that build tools like setuptools and setuptools_scm could naturally depend on other libraries. It would mean that other projects could rely on these build tools without getting reports of failed builds.

@jaraco
Copy link
Member Author

jaraco commented Jun 25, 2022

Anyone who wants to add this to the list of fundable packaging improvements can now submit a pull request on https://github.com/psf/fundable-packaging-improvements .

Based on the criteria in the readme there, I don't believe this project is yet ready for a fundable project. There is not yet consensus or an accepted PEP. To the contrary, the initial feedback has been negative, meaning there's probably a need to develop that consensus first. Maybe the additional use case described above will help nudge us in that direction.

In the meantime, I'm slightly inclined not to workaround the issues for users, thus masking the root cause.

@pfmoore
Copy link
Member

pfmoore commented Jun 25, 2022

I think there's a good fundable project in here, but it's a research project. I think the problem is that we are getting feedback that people want to be able to "build everything from source", but that causes an issue because bootstrapping involves circular dependencies. This sounds pretty much identical to the issues of building a C compiler, so we could probably learn a lot from how people solved that problem. Also, presumably the people wanting to build Python packages from scratch stop at some point - do they need to build the Python interpreter itself from sources, for example?

I'd suggest proposing a research project to find out why people want to build with --no-binary :all:, and in particular what are the precise constraints they are working to. This could involve user surveys, requirements collection, etc. That's one of the reasons I think this would make a good fundable project - the skills needed to do a good job here are not common in the Python packaging community, so hiring specialists would be very cost effective.

The deliverables from such a project could be a requirement spec for "build everything from source" activities and in particular a listing of the constraints that need to be satisfied for any solution.

Once we have that documentation, follow-up projects to actually design and/or implement a solution will be much easier to specify. Or it may be that the negative reception that's happened so far can be crystallised into specific issues that we (the Python packaging community) aren't willing to support, which will give users much more actionable feedback at least.

@jaraco
Copy link
Member Author

jaraco commented Jul 3, 2024

In pypa/setuptools#4457 (comment), I describe what appears to be a distilled version of the implicit constraint:

  • Build backends can only have (build or runtime) dependencies on projects that don't rely on that build backend.

I wonder if instead the constraint should be:

  • Environments that rely exclusively on building from source (e.g. --no-binary :all:) must use a frontend that implements caching of built artifacts in order to break cycles in build dependencies.

@jaraco
Copy link
Member Author

jaraco commented Jul 3, 2024

From that conversation:

I think I've got that right. A build backend can have dependencies if all of the dependencies rely on a different build backend (hatchling, for now). Or a build backend can have no dependencies (flit_core). I wonder if we should document this constraint somewhere. Is there a place where we could document this constraint?

I don't think it's an explicit constraint, rather it's an implied consequence of the defined behaviour when combined with a requirement to build everything from source. In particular, it's not a constraint just on build backends, it affects projects that are dependencies of build backends as well - if setuptools depends on wheel, and wheel is built using setuptools, then wheel has a problem if you want to build it under a constraint of never using pre-built binaries. Conversely, if you are willing to use a multi-stage process where you build a setuptools wheel "by hand" and then use it to build everything else, there's no issue.

I'm not the best person to comment here, though, as I don't understand the whole "everything must be built from source" mindset - you have to start somewhere, and why can't that be "pre-build a setuptools wheel using some manual recipe, then use that to bootstrap everything else"? But from what I recall, there were some purists for whom that wasn't sufficient - and I never got a clear understanding of why.

I've had some conversations with these integrators (at spack and fedora and debian and others). Their primary motivation is based largely on a philosophy of "build everything from source" because that's the source of truth. Even at large enterprises like Facebook or Google, everything is built from source. They want to build from source to maintain independence (from other systems and abstractions) and provenance (no question about what was built from source and not). They (and often their users) want control and assurance that what's running is built from source and could be built again (repeated) without external resources. Think of it this way - if you woke up in a world with nothing but source tarballs, could you rebuild the system (and how).

I think some insight can be found your framing of the issue ("if you are willing to use a multi-stage process" and "pre-build a setuptools wheel"). I can see why integrators would shy away from (a) more complex processes when simple ones can do and (b) having special casing for select packages. They'd like to be able to apply a technique uniformally to any and all packages. If the process is multi-stage and has to manage interactions between packages, it's much more complex than one that can act on one package at a time. Such a multi-stage system could become impractical if enough of these interdependencies exist. It's also a risky proposition to be caught between two projects managed by independent parties.

Moreover, I don't even think having a pre-built setuptools wheel works to bootstrap the process if setuptools has dependencies. You need to have a special process to bootstrap the setuptools build, then you need another special process to bootstrap all of setuptools dependencies (because setuptools still doesn't have its dependencies) then you likely need to rebuild setuptools without the special process. And of course, this problem isn't unique to setuptools. If any of hatchling's dependencies decides to adopt hatchling or if hatchling wanted to adopt a dependency built in hatchling, it too would be broken.

I wonder if instead the constraint should be:

* Environments that rely exclusively on building from source (e.g. `--no-binary :all:`) _must_ use a frontend that implements caching of built artifacts in order to break cycles in build dependencies.

I just realized even this constraint isn't enough unless it also includes a weaker version of the other constraint:

  • Build backends must be capable of being built without any transitive dependencies to the backend.

In other words, setuptools cannot declare a build dependency on any setuptools-built project without also vendoring it.

@jaraco
Copy link
Member Author

jaraco commented Jul 3, 2024

Quoting myself from above:

Think of it this way - if you woke up in a world with nothing but source tarballs, could you rebuild the system (and how).

I'm finding this line of thought very useful in reasoning about the problem.

If someone gave me a directory of sdists for setuptools and all of its dependencies and nothing vendored, could I use a Python interpreter to build a functional setuptools?

In general, the answer is no. If any of those dependencies have C or Rust extensions and need setuptools to build them, there's no way to build those.

Even without extension modules, there are problems. I imagine it could be possible, since Python source files are importable, to analyze the metadata of the sdists (pyproject.toml:build-system) and determine which sdists are needed (all of them) and then somehow stage them to be importable. However, there's no way in general to do that. Each project might use a src layout or a flat layout or maybe something far less scrutable. What's needed is a working, built, installed build system (setuptools with all it dependencies) that's necessary to transform those sources into something importable.

Can you think of a way to address the situation?

In one approach, I imagine an integrator keeping a bootstrap build of each and every backend (and its dependencies) and falling back to that whenever a build dependency cycle is encountered, but given that the Python ecosystem allows for an unlimited number of backends, that could grow unwieldy... and would we want to put the onus on each and every integrator to implement their own such system? I argue no.

@pfmoore
Copy link
Member

pfmoore commented Jul 3, 2024

If someone gave me a directory of sdists for setuptools and all of its dependencies and nothing vendored, could I use a Python interpreter to build a functional setuptools?

This is indeed a good way of thinking about it. However:

  1. The way you build a functional setuptools does not need to be the "standard" way of building a wheel (such as pip wheel or build). It can be, and I'm sure it would be easier for people in that situation if it were, but bootstrapping from nothing is an unusual situation, so non-standard solutions are (IMO) acceptable, if needed.
  2. This isn't really a general packaging question, and it doesn't need new rules or constraints to cover it. It's simply something that build backend developers (and indeed, any package developer) should consider as a question of what use cases they want to support.

One solution, of course, is for setuptools to use flit_core as its backend. We know flit_core can be built from scratch, and indeed that's a deliberate design choice, so it's not something that's likely to change in the future. And flit_core can build pure Python packages like setuptools. Obviously there's a certain irony in not using setuptools to build itself, but maybe practicality beats purity here...? 🙂

@pradyunsg
Copy link
Member

FWIW, I do like the idea of having a single solution here (i.e. a single build-backend that can be bootstrapped from nothing) being reused across backends. If it's flit-core, that's fine by me as would be using something other than flit-core as a bootstrap-from-nothing backend and migrating flit to using that.

Subject of course, to other flit maintainers being onboard with the idea.

@pfmoore
Copy link
Member

pfmoore commented Jul 4, 2024

To re-emphasise, though, this wouldn’t be a requirement for build backends to use that build tool, it would simply be a community-supported solution for the (non-trivial) problem of build backends needing a means to bootstrap themselves.

If setuptools prefers to self-bootstrap, that’s fine, but they need to solve any “build from all source” issues themselves in that case.

@jaraco
Copy link
Member Author

jaraco commented Jul 4, 2024

I've been pondering using flit-core for setuptools. At least on integrator has already bristled at the idea, but it does seem worth exploring.

I don't think it solves the problem, however, simply for a build backend to adopt flit-core (or some other backend without dependencies), because there still exists a circular dependency with any dependencies of the backend that use the backend (e.g. you can't build importlib_resources without setuptools being installed and you can't install setuptools without importlib_resources; similarly, if a hatchling dependency were to adopt hatchling (or a setuptools with dependencies), it also would no longer be compatible). If the problem were as simple as special-casing the build backends, that wouldn't be too onerous, but since the requirement extends to any dependency, which may be independently managed, it creates an almost unmanageable situation. At the very least, before a backend accepts a dependency, it'll have to establish an expectation about what build backend that dependency uses.

My other reluctance for adopting flit-core is it's philosophy runs against a basic objective of projects I manage, which is to avoid the foot gun of double accounting (requiring versions and file manifests and other metadata to be manually replicated in static files). It's a compromise I'd be willing to make for a build backend but reluctant to impose for any dependency of a build backend.

So we're stuck unless we can get the integrators to accept a methodology of somehow bootstrapping build backends by providing pre-built artifacts of all dependencies of all supported backends (similar to how pip is able to bootstrap by using pre-built wheels).

The more I think about it, I think that should be the guidance. When bootstrapping a system, an integrator should either rely on pre-built artifacts from PyPI or supply their own pre-built artifacts for each backend (including its dependencies). These pre-built artifacts can be a part of the build system and can be discarded after bootstrapping. Such an approach, although somewhat heavy, should work universally and allow any build backend to simply declare dependencies without compromising the build-all-from-source model.

It feels like a choice between that or all build backends must bundle all of their dependencies (and not declare them) to be safe.

Does that sound right? If it sounds right, I'll put together a design proposal and review it with the system integrators.

@abravalheri
Copy link
Contributor

abravalheri commented Jul 4, 2024

I agree with Jason here, if the problem was only the "build time" dependencies when building setuptools itself, I think we have ways of fealing with that. Right now, setuptools can "bootstrap" itself in the sense that it can create its own wheel without dependencies. So we would not need flit-core.

The way I see, the problem is not about the "build dependencies" (i.e. [build-system] requires) for the backend itself, but rather the "runtime dependencies" (i.e. [project] dependencies) for the backends. While flit-core could be beneficial for addressing the former, it doesn't help with the latter, which is where our real struggle lies.

(This is related to the dilemma discussed in pypa/setuptools#4457 (comment)).

Jason's proposal on the other hand seem to be a good solution.

@jaraco
Copy link
Member Author

jaraco commented Jul 8, 2024

I've drafted this proposal. Please feel to have a look - you should be able to make comments or suggest edits. If not, let me know and I'll move it to a Google Doc. I'll plan to circulate this proposal with key stakeholders for system integrators in a week or so.

@ncoghlan
Copy link
Member

Looking at Doug Hellman's https://github.com/python-wheel-build/fromager project, that seems consistent with @jaraco's proposal (the prepare-build step in fromager accepts a --wheel-server-url, so build dependencies can be downloaded in binary format from a trusted index, rather than having to be built from source)

@jaraco
Copy link
Member Author

jaraco commented Jul 12, 2024

At this stage, I'd like to loop in the systems integrators, especially those involved in source-only builds. I know a few, such as @tgamblin, @mgorny, @doko42, @kloczek. If there are better contacts, or if you know of any other integrators who you think should be aware, please refer them here. If any of you are on discord, I'd also like to invite you to join the PyPA discord and in particular the #integrators channel, where we can discuss and coordinate this effort and efforts like it. But most importantly, please review the proposal and share your feedback on it (comment in the doc, comment here, or send a note in discord). I know it’s an ask for integrators to adopt a new methodology, but I believe it’s the only way to allow backends to have dependencies.

@jaraco
Copy link
Member Author

jaraco commented Jul 18, 2024

In this comment, Doug Hellman brought to my attention that PEP 517 actually forbids cyclic (transitive) dependencies (I'm ashamed I'd not previously linked this constraint to this issue). This requirement essentially boils down to "no build backend can have dependencies", a constraint that's been a problem ever since setuptools chose to adopt six. While there are certain conditions where a build backend can depend on other projects, it can only do so by constraining those projects (not to use the build backend). It also means that a backend with dependencies cannot be in a self-contained system (such as the coherent system; their build backend must always resolve outside of the system).

@jaraco
Copy link
Member Author

jaraco commented Jul 31, 2024

In the rootbeer repo, I've drafted an idea based on @RonnyPfannschmidt's proposal. Instead of a tool that collects the sources into an importable directory, I've used Git and submodules to assemble the build-time dependencies for Setuptools and its dependencies (basically setuptools' run and build-time dependencies and their transitive closure).

To use it, simply git clone --recurse-submodules --shallow-submodules --depth 1 https://github.com/pypa/rootbeer and then build any of the backend resources with env PYTHONPATH=path/to/rootbeer pyproject-build --no-isolation --skip-dependency-check.

It works by keeping clones of the submodules in special directories and then using symlinks to make flat- and src-layout packages available for import.

This repo currently supports setuptools and flit-core backends but could be extended to support all Python build backends.

Because it's a git repo, it can readily be trusted by pinning to a specific git hash (which also has pins to the submodules' hashes).

I believe this approach may generalize to any source-only integration bootstrapping process. It still would require integrators to identify build backends in order to know to employ the bootstrapping. I'm also yet unsure if this approach could support compiled artifacts. I'm assuming they're out of scope for now.

@RonnyPfannschmidt
Copy link

A key reason why I proposed importing metadata from project.toml is to avoid generated artifacts and being able to pin by sdist hash (to enable downstreams in distros)

@jaraco
Copy link
Member Author

jaraco commented Jul 31, 2024

A key reason why I proposed importing metadata from project.toml is to avoid generated artifacts and being able to pin by sdist hash (to enable downstreams in distros)

That's fair. My thinking was we'd first work toward something that can prove the viability of the concept. If it's viable, we can iterate on the technique and make it more dynamic and less generated. I agree it would be ideal for an integrator to simply run a tool against a set of sdists and get a bootstrapped environment with those sources and their metadata available.

@jaraco
Copy link
Member Author

jaraco commented Aug 1, 2024

To use it, simply git clone --recurse-submodules --shallow-submodules --depth 1 https://github.com/pypa/rootbeer and then build any of the backend resources with env PYTHONPATH=path/to/rootbeer pyproject-build --no-isolation --skip-dependency-check.

If anyone tried this earlier, it probably didn't work for you unless you have the same git config as I do. In pypa/rootbeer@ab9fa69, that's fixed (normal https URLs are used).

@vishwin
Copy link

vishwin commented Aug 6, 2024

In FreeBSD Ports, rootbeer seems akin to how we do Rust and Go software: we fetch the sources (tarballs) for all dependencies as part of the fetch phase for the target software itself (ie setuptools), and the dependency sources are extracted as if cargo or go's package manager did the fetching and extracting. To us, this is still vendoring dependencies, so I fail to see how this is an improvement from the existing _vendor directory practice in ie setuptools today, where only one tarball (with _vendor) is fetched.

Another note with FreeBSD Ports' build process in particular: there is absolutely no network access outside the fetch phase. We also require strict checksumming of all sources fetched. This precludes using pip or git amongst other tools.

@jaraco
Copy link
Member Author

jaraco commented Aug 6, 2024

I had a really good conversation yesterday with dhellmann, and I'm once again on the fence about what is the best approach. We talked about a source-based bootstrap repo (rootbeer) versus a wheel-based one (using pure-Python wheels), and his perspective is that that might be viable under the right conditions, and the wheel-based repo would be preferable because it would take less manual labor to maintain.

He also made the case for the build backends to simply comply with the DAG requirement (status quo), so I've added a relevant section to the alternatives. Doug suggested creating a system of tiered backends to honor the DAG.

I also added a section to consider that backends compile to a simplified backend, which might provide a way to avoid build cycles. That approach would be a substantial amount of work, probably insurmountable for Setuptools in the current state, and still a substantial amount of work for a Setuptools-compatible backend that can compile extensions.

@jaraco
Copy link
Member Author

jaraco commented Aug 6, 2024

I fail to see how this is an improvement from the existing _vendor directory practice in ie setuptools today

From the integrators perspective, it's very similar. From a build backend developer's perspective, it addresses the challenges (which I just elaborated) laid out in the proposal. I'm aiming for a systemic solution that will unblock build backends from having proper dependencies, to have a methodology for any integrator to build the Python ecosystem from source. Moving the vendored dependencies out of the build backend allows the build backend to develop normally, to adopt dependencies as needed and not be responsible for the weird behaviors that occur when vendoring goes wrong. That way, most users can get a consistent experience (a project is installed with its declared dependencies regardless of whether its a build backend or not). It additionally doesn't require ugly hacks to make the vendored packages available at run time.

But this gives me an idea - maybe these vendored artifacts could be supplied in each project's source repo (or a sister repo), but not distributed or managed at runtime.

@RonnyPfannschmidt
Copy link

I still maintain that a meta importer that considers pyproject toml to obtain runnable importable backends as first stage is feasible

After All it would enable runnable build backends on all pure python based back ends as far as I'm aware of

@jaraco
Copy link
Member Author

jaraco commented Aug 6, 2024

I still maintain that a meta importer that considers pyproject toml to obtain runnable importable backends as first stage is feasible

After All it would enable runnable build backends on all pure python based back ends as far as I'm aware of

Agreed. I consider that a variant of the Source Aggregation option. Rootbeer is a more static form of the same concept and adding dynamic metadata resolution in a meta importer would be a more sophisticated form.

@jaraco
Copy link
Member Author

jaraco commented Aug 6, 2024

In FreeBSD Ports, rootbeer seems akin to how we do Rust and Go software

I think the difference from those systems is that everything is built by one tool (e.g. cargo), so the problem you're solving is building regular packages in the ecosystem and not build tools themselves. How is cargo itself built? What if cargo had a build-time dependency on another rust-based projects that's built with cargo? Does that happen, and if so, how can FreeBSD ports break the cycle?

@RonnyPfannschmidt
Copy link

I disagree with the assessment that bootstrap source aggregation has to deal with a wide set of behaviors

However there's a need for a mechanic tool that helps with pinning the suggested dependency set an verifying it in ci,so that downstream integrators have a baseline and can self decide on informed deviation

@RonnyPfannschmidt
Copy link

The metadata for that could be part of a tool.source-bootstrap section for backends and key dependencies for standard support

@pfmoore
Copy link
Member

pfmoore commented Aug 6, 2024

I'm still not 100% clear on what the actual requirement is here. @RonnyPfannschmidt is it essential for builds to start from source trees (i.e., github URLs, and similar)? Or is a sdist downloaded from PyPI acceptable? What about a pure-Python wheel (which is basically nothing more than a zipped up set of source files, after all)? My understanding is that you want a reproducible build process, so what I'm asking is precisely what you're willing to accept the existence of as a baseline. Given that there's no automatic way to go from a Python package name (as stated in a list of project dependencies, for example) to a github URL, if you do want to rely only on source VCS locations, who do you expect to maintain the (project name/version -> VCS URL) mapping? Is that something you'd be willing to maintain yourself, or would you want that to be provided?

I don't care what the answer is here1, just that we agree on a common understanding of which answer is what we're working towards.

It feels to me like we're chasing solutions while still not completely understanding the requirements.

We can almost certainly come up with some solution for any of the above requirements, but the further "down the stack" we go, the more work it will be (for backend maintainers and redistributors).

While we're looking at requirements rather than solutions, I'll repeat my point from above:

I hope no-one is suggesting that a build backend should depend on anything other than pure Python packages? That would be far more difficult to handle (and IMO goes way beyond reasonable expectations).

Footnotes

  1. Although I reserve the right to push back if I think the requirements are unreasonable, or inconsistent... 🙂

@dhellmann
Copy link

I'm still not 100% clear on what the actual requirement is here. @RonnyPfannschmidt is it essential for builds to start from source trees (i.e., github URLs, and similar)? Or is a sdist downloaded from PyPI acceptable? What about a pure-Python wheel (which is basically nothing more than a zipped up set of source files, after all)?

For the project I'm working on, we start with sdists from pypi. For a few packages we're building, there are no sdists so for those we have manually identified the git repos and either download a GitHub release tarball with source or clone the repo. Of the several hundred packages we're building we have fewer than 10 where we had to do that, so the automation in fromager is doing most of the work for us based on those sdists.

We might use pure-Python wheels as part of the build toolchain, but probably wouldn't want to introduce those into the set of things we're building. We need to be able to rebuild, and in some cases change the packaging metadata or apply bug fixes to the code as part of that build. Both are typical downstream distro requirements to meet obligations for shipping fixes for issues while we work on pushing patches upstream.

I hope no-one is suggesting that a build backend should depend on anything other than pure Python packages? That would be far more difficult to handle (and IMO goes way beyond reasonable expectations).

I agree with this. We are building some packages that depend on compilers for Rust, C/C++, java, and go. We don't expect those to be expressed as dependencies on other Python packages, and in some cases where they are we remove them. For example, we remove dependencies on tools like cmake and use the version from the Linux distribution.

@RonnyPfannschmidt
Copy link

I consider both sdist and vcs checkouts/archives as viable starting points

@pfmoore
Copy link
Member

pfmoore commented Aug 6, 2024

OK. So my understanding of the problem here is that you might, for example, want to build project X from a sdist or VCS checkout that you've identified. You may have made changes to that source code as part of your role as a redistributor. Now, you want to do the build.

A simple python -m build /path/to/X will pull in the build backend, and any dependencies it may have. Normally, it would take wheels from PyPI (because that's how packaging tools set up environments by default). You don't want that, so you need to use the --no-isolation flag and set up the build environment manually. Is that right? Or do you rely on pip wheel --no-binary :all:? If the latter, that will simply fail if it sees cycles in the build backend dependency tree, as per PEP 517.

Assuming you're setting up the build environment, you need to install the build backend and all of its dependencies. Again, all from verified (and possibly patched) source. And so on, recursively. That process will only end if you have no loops, and ultimately hit only build backends that have no dependencies and build using an in-tree backend. That's precisely the scenario that in-tree backends were designed for, and the reason for the "no cycles" requirement in PEP 517.

So given this, we've established that in the presence of cycles, standard-compliant build tools won't be able to set up the build environment for you. So the question is, what's the best way to manually set up an environment with a working setuptools (or hatchling, or meson-python, or whatever) in it.

  1. You need to identify the full dependency tree for the build backend. Unfortunately, you can't get standard tools to do that for you, because you're in non-compliant territory here. So either you need to assume that all dependencies (and dependencies of dependencies, etc.) are statically defined in the sdists (or in pyproject.toml if you have a source tree rather than a sdist), and then read and resolve that tree manually (in theory you need a dependency resolver for this, but in practice you may be able to get away with a simple tree walk) or you need the build backend to publish a separate list of dependencies that are sufficient to create a working build environment. For simplicity, let's assume the backend publishes that list - essentially a lockfile for a "bootstrapping" build environment.
  2. Now, you need to make all of those packages available in the build environment. You start by downloading and unpacking all of the sources. This can either be by fetching them from PyPI, or by consulting a name/version -> VCS URL mapping.
  3. In nearly all cases, the dependencies will be structured either with a src layout or a "flat" layout. In both of these layouts, putting the root directory of the package tree onto sys.path will give a working package. Therefore, if you create a .pth file that contains one line for each package that can be made importable in this way, you have what you need.

That just leaves packages that choose to have a more complex structure, which cannot be exposed via a .pth entry. I'm assuming that such packages are rare, as they wouldn't support the "legacy" (pre-PEP 660) editable install mechanism, and we'd have probably heard about that... I think it's probably reasonable to require that build backends either don't depend (directly or recursively) on such packages, or they provide instructions on how to add an uninstalled copy of the project to sys.path.

So, in summary, I believe a build backend like setuptools could reasonably offer a set of instructions for creating a "bootstrap" build environment that went something like this:

  • Download and unpack sdists for X 1.0, Y 2.0, Z 1.0, ... into a staging directory. Also, check out the following VCS URLs into that staging directory.
  • Create a .pth file containing /path/to/staging/dir/X-1.0/src, /path/to/staging/dir/Y-2.0, /path/to/staging/dir/Z-1.0/src, ...
  • Put that .pth file into your environment's site-packages.

Your environment will now be able to run the build backend without downloading or installing any additional software.

IMO, that's a reasonable thing to ask of a build backend. The backend maintainers could choose to put whatever infrastructure they want in place to make providing this set of instructions easier, but they don't have to do so under the "everything from source" constraint. The only thing I can see which might cause problems is if there are dependencies which don't support being exposed via a .pth file. But unless @jaraco (or another build backend maintainer) can point to a specific project where this is the case and avoiding the dependency (or conditionalising it so that it's not needed for bootstrapping) is impossible, then I think we should simply say that build backends shouldn't depend on such projects (just like they shouldn't depend on projects that use C extensions).

Am I missing something here? I don't see a need for anything as complex as a meta importer, or a separately maintained bootstrap repo (would we need one for each build backend, or one huge one that aggregates all build backends?)

To give a simple example, the instructions for flit-core (not needed as flit-core has no dependencies, but it's an easy one):

  1. Check out https://github.com/pypa/flit
  2. Add a .pth file containing /path/to/checkout/flit_core to your site-packages.

For hatchling, it's not much harder - 5 projects (hatchling, packaging, pathspec, pluggy, and trove-classifiers), all of which can be added via a .pth file. I found these just by doing a (normal) pip install hatchling and seeing what got installed. I can't do this at the moment for setuptools, as it vendors its dependencies, but I doubt it would be much harder (just a longer list).

@vishwin
Copy link

vishwin commented Aug 6, 2024

I think the difference from those systems is that everything is built by one tool (e.g. cargo), so the problem you're solving is building regular packages in the ecosystem and not build tools themselves. How is cargo itself built? What if cargo had a build-time dependency on another rust-based projects that's built with cargo? Does that happen, and if so, how can FreeBSD ports break the cycle?

We build cargo as part of rustc:

  • it is part of the rustc source tarball
  • it is updated with rustc
  • it avoids circular dependencies

We generally generate our own bootstrapped toolchain to build the new one, especially if the Rust Project does not supply a bootstrap for the target architecture.

When applying this principle to Python packages, we cannot use bdists. More importantly, the ports system operates as a fully atomic DAG, so we cannot have a situation like setuptools depends on jaraco.text, which uses setuptools to build. This is even if only a new/updated setuptools is to be built where unchanged jaraco.text exists in the repository.

@dhellmann
Copy link

OK. So my understanding of the problem here is that you might, for example, want to build project X from a sdist or VCS checkout that you've identified. You may have made changes to that source code as part of your role as a redistributor. Now, you want to do the build.

A simple python -m build /path/to/X will pull in the build backend, and any dependencies it may have. Normally, it would take wheels from PyPI (because that's how packaging tools set up environments by default). You don't want that, so you need to use the --no-isolation flag and set up the build environment manually. Is that right? Or do you rely on pip wheel --no-binary :all:? If the latter, that will simply fail if it sees cycles in the build backend dependency tree, as per PEP 517.

fromager sets up a virtualenv then uses pip wheel with --no-isolation. We could use build, but pip is available via a system package, so that's one less dependency to deal with. https://github.com/python-wheel-build/fromager/blob/main/src/fromager/wheels.py#L205

Assuming you're setting up the build environment, you need to install the build backend and all of its dependencies. Again, all from verified (and possibly patched) source. And so on, recursively. That process will only end if you have no loops, and ultimately hit only build backends that have no dependencies and build using an in-tree backend. That's precisely the scenario that in-tree backends were designed for, and the reason for the "no cycles" requirement in PEP 517.

Yep, we're definitely relying on that no-cycles requirement for fromager.

So, in summary, I believe a build backend like setuptools could reasonably offer a set of instructions for creating a "bootstrap" build environment that went something like this:

  • Download and unpack sdists for X 1.0, Y 2.0, Z 1.0, ... into a staging directory. Also, check out the following VCS URLs into that staging directory.
  • Create a .pth file containing /path/to/staging/dir/X-1.0/src, /path/to/staging/dir/Y-2.0, /path/to/staging/dir/Z-1.0/src, ...
  • Put that .pth file into your environment's site-packages.

That could even be scripted. If the metadata was exposed in a standard way, then a single tool could do the work.

IMO, that's a reasonable thing to ask of a build backend. The backend maintainers could choose to put whatever infrastructure they want in place to make providing this set of instructions easier, but they don't have to do so under the "everything from source" constraint. The only thing I can see which might cause problems is if there are dependencies which don't support being exposed via a .pth file. But unless @jaraco (or another build backend maintainer) can point to a specific project where this is the case and avoiding the dependency (or conditionalising it so that it's not needed for bootstrapping) is impossible, then I think we should simply say that build backends shouldn't depend on such projects (just like they shouldn't depend on projects that use C extensions).

+1

Am I missing something here? I don't see a need for anything as complex as a meta importer, or a separately maintained bootstrap repo (would we need one for each build backend, or one huge one that aggregates all build backends?)

The only thing I'd push back on is the manual process part. It's not hard to do it 1 time. It becomes onerous when you have to do it repeatedly, for new versions of tools. The 517 standard made it possible to build fromager at all, so I'm a big fan of standardizing something like this.

@pfmoore
Copy link
Member

pfmoore commented Aug 7, 2024

That could even be scripted. If the metadata was exposed in a standard way, then a single tool could do the work.

What metadata do you need? If all you want is to know what the dependencies are for a build backend, you can do

pip install --disable-pip-version-check --quiet --ignore-installed --dry-run --report - hatchling | jq '.install | .[].metadata| {name,version}'

(replace "hatchling" with your build backend of choice, or a list of all the backends you want in your environment, if that's what you prefer).

You can get the sdist URL for a name, version pair from the simple API. The only thing you can't do with existing metadata is know what .pth entry to add - but that could be saved in a manually-maintained list:

hatchling: .
pluggy: ./src
# etc

Hmm, you've nerd-sniped me. I may spend some time actually writing this 🙂

@jaraco
Copy link
Member Author

jaraco commented Aug 7, 2024

Am I missing something here?

Yes. The reason RonnyPfannschmidt brings up metadata and entry points is because some packages are invisible without their entry points. setuptools_scm is a case-in-point. Setuptools relies on the entry point to detect that the plugin is present and invoke its behavior. So if setuptools depends on setuptools_scm (as a build dependency), but the metadata isn't present, the behavior will be silently missing and the build will likely complete but incorrectly.

It's conceivable that other metadata is accessed, such as importlib.metadata.version('wheel'), which would similarly fail if the metadata hadn't been made available (either by materializing it to .dist-info or by having a meta importer that knows how to load it from pyproject.toml).

More importantly, if the package is doing anything more than presenting packages in their source layout, it won't work. For example, Setuptools adds distutils-precedence.pth file to the install at build time, something that won't be present in the source tree. If any of the projects has dynamic or irregular procedures for supplying the code such that the source code doesn't map directly to the installed layout, it may not behave as designed.

That's what I mean when I say using the source code doesn't generalize. The only reliable way to get from a source checkout or sdist to an importable project is by invoking the build backend. Anything short of that is likely to stumble on unsupported edge cases and unexpected breakage.

Moreover, I don't want to leave it to each and every integrator to have to implement the procedure. I'd like to supply a procedure (either documented or in code) that they could ideally apply to any backend that would bootstrap the backend even if it has cycles in the build dependencies. Perhaps it could apply only to backends that have cycles.

It's also complicated by the fact that some dependencies are platform-sensitive, so getting the right set of dependency sources is a potentially complicated operation.

3. In nearly all cases, the dependencies will be structured either with a src layout or a "flat" layout.

I'd like to be able to support the essential layout as well. I'd like to be able to move jaraco.text to the coherent.build system, which uses the essential layout. I'd like jaraco.text not to be constrained to less sophisticated build systems only because Setuptools depends on it. I accept the responsibility to supply a routine that (a) detects the essential layout and (b) determines where it should be linked/installed. Incidentally, the rootbeer approach has the simplest implementation for projects on the essential layout - the sources can be checked out directly to a usable path instead of needing to be linked.

We generally generate our own bootstrapped toolchain to build the new one

This is essentially the concept I'm after with the proposal or the source-based alternative - providing a bootstrapped toolchain for build backends that have cycles.

would we need one for each build backend, or one huge one that aggregates all build backends?

It could go either way. When I started experimenting with rootbeer, I found that I couldn't build wheel without flit-core, so I added it as well, but there could conceivably be separate repos for each backend. The latter would be easier to maintain (better clarity of responsibility domain) whereas the former would be easier to use (one-stop shop for bootstrapping). I believe this is a concern we can tackle after we've established a recommended methodology.

It feels to me like we're chasing solutions while still not completely understanding the requirements.

I'll try to state the requirement succinctly:

It should be possible for system integrators (redistributors) to build a Python environment from pure sources, which may be sdists or VCS checkouts (or sometimes VCS tarballs). The current standards disallow cycles in "build requirements", but this limitation really extends to all dependencies and their build and runtime requirements. As a result, build backends are essentially blocked from having dependencies (or severely limited and forced to apply constraints on the projects they adopt). There should be a methodology that any integrator can follow to bootstrap an environment and break cycles in the build backends such that the build tooling can be built purely from source even when such cycles exist. Ideally, this methodology would allow for arbitrary dependencies, though it may only be feasible to support pure-Python dependencies.

@jaraco
Copy link
Member Author

jaraco commented Aug 7, 2024

If all you want is to know what the dependencies are for a build backend, you can do

pip install --disable-pip-version-check --quiet --ignore-installed --dry-run --report - hatchling | jq '.install | .[].metadata| {name,version}'

For fun, I ran this command against coherent.build and at the time of writing, it produces a large list. The pyobjc dependency comes from jaraco.path. That's an example of a dependency that could be omitted because it's not reached by the backend. Is there a way in that pip command to state something like --ignore-dependency pyobjc?

@pfmoore
Copy link
Member

pfmoore commented Aug 7, 2024

Yes. The reason RonnyPfannschmidt brings up metadata and entry points is because some packages are invisible without their entry points. setuptools_scm is a case-in-point. Setuptools relies on the entry point to detect that the plugin is present and invoke its behavior. So if setuptools depends on setuptools_scm (as a build dependency), but the metadata isn't present, the behavior will be silently missing and the build will likely complete but incorrectly.

Ah, OK. That's frustrating. Although in the case of setuptools_scm specifically, it might be possible to special-case it because when building from a sdist it basically isn't needed anyway (the version has already been frozen into the sdist).

I'll try to state the requirement succinctly
[... actual requirement omitted, for brevity ...]

Fair enough. You're asking for an extremely general solution, which I personally think is the wrong approach to take. But if that's the requirement you (and the other participants here) want to address, then I'm not going to object. I do think that a basic 80% solution, combined with some form of special casing for a small set of known "difficult" cases, is a better approach than something that tries to handle every possible weirdness that could occur. But again, it's not likely to be me designing the solution, so do whatever you think is best.

Is there a way in that pip command to state something like --ignore-dependency pyobjc?

No, apart from separating optional dependencies out into extras and picking the subset you need for this usage. You could of course write a custom program to do what you want - but it's not pip, because pip is an installer, not a generalised resolution engine, so it only handles what's needed for that task (and "installing something that according to the project's metadata is invalid" isn't a use case we want to support).

I'd like to be able to support the essential layout as well.

This is off-topic and I know we discussed this offline, but that layout has significant ecosystem-wide issues to address (specifically, pyproject.toml isn't at the top level of the source tree) so I don't think you should be complicating this issue with concerns about projects that use that layout - we already have enough to worry about with projects where the only standard they violate is the rule about not having cycles in the build requirements.

@RonnyPfannschmidt
Copy link

i want to reiterate that having a meta path hook that considers pyproject toml metadata most likely solves entrypoints and enough metadata to run for all back ends i'm aware of

i still maintain the conviction that enabling from source/vcs checkout import of the core dependencies while honoring metadata from pyproject.toml will most likely resolve any issues around dependency loops

if all packages in the dependency loop are importable and functional with just pyproject.toml and without needing a actual build - then that is a very reasonable initial stage to bootstrap editable wheels or real wheels of all involved packages as a next step

the only rules all dependencies would need to follow are something along the lines of not using distribution metadata not avaliable in the build (like version)

however in a later iteration that could be resolved by having the build backends in question provide plugins that feed that metadata using entrypoints

the tool should not be overly concerned by fs layout surrounding the packages - it should be confgirued with a lsit of folders that are either packages (pyproject.toml) or a directorty containing packages/symlinks to packages

i'll be a while beforre i can slap together a POC as i need to research how to correctly provide a Distribution and a MetaPath finder

@RonnyPfannschmidt
Copy link

i beleive a basic heusteric can be shipped with the pyproject approach
i consider flat, src, and coherent easily supportable, with the requirements of deviation requiring a pyproject.toml entry to disambiguate

for example building hatch from source via vcs checkouts would need checkouts of

  • hatch (includes hatchling)
  • packaging
  • flit (for flit-core)
  • cpburnz/python-pathspec
  • pytest-dev/pluggy
  • setuptools
  • setuptools_scm
  • tomli (if python < 3.12)
  • trove-classifiers
  • di/calver
  • the to be built tool

additionally

# package_locations.toml
# todo: expand schema in case details are missing
# some packages need a editable wheel for full metadata, but it seems a full topology can be done, so setuptools/flit-core would be aboutthe first tobe built, and then as more editable wheels are available more packages can be built

hatchling = "hatch/backend"
flit-core = "flit/flit-core"

the need the configuration file could potentially be elevated by requiring the toplevel pyproject.toml incldue a table with a mapping of package names and their nested location

i arrived at the conclusion that it seems easier to run the build backends to create editable wheels following a dependency tree than to create a import hook that can work off pyproject.toml for now,

@jaraco
Copy link
Member Author

jaraco commented Aug 13, 2024

i arrived at the conclusion that it seems easier to run the build backends to create editable wheels following a dependency tree than to create a import hook that can work off pyproject.toml for now,

A lot of people struggle to understand the problem. I struggled with it for a long time too.

You're right that in the current world, it's possible to create a hierarchy of build tools (flit-core, then setuptools dependencies, then setuptools, then hatchling's dependencies, then hatchling). And that would be easier it could be achieved in general. However, that only works because setuptools vendors its own dependencies and hatchling's dependencies don't depend on hatchling. There's no cycles, which imposes severe constraints on the build backends; they must vendor dependencies or greatly restrict the backends that their dependencies can use and at no point can a system be self-contained (all projects use the same backend).

As soon as you let setuptools have dependencies, it introduces a cycle. It becomes impossible to "run the build backends to create editable wheels" for setuptools dependencies when those dependencies depend on setuptools. I want some way to run setuptools (or other build backend) before it (and its dependencies) are installed.

It feels untenable for one project like setuptools to impose dependency and build system constraints on its dependencies merely because setuptools has adopted the dependency.

Does your vision account for these concerns? Can you elaborate on your proposed approach?

@RonnyPfannschmidt
Copy link

That case needs source based importlib metadata from pyproject.toml and all packages part of a circle would have to support operating with degraded metadata from project.toml

I'm under the impression that the hardship with implementing a fitting correct meta path loader for this highlights blind spots in the apis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants