Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip_import builds source packages on workspace load #96

Closed
siddharthab opened this issue Apr 21, 2018 · 12 comments
Closed

pip_import builds source packages on workspace load #96

siddharthab opened this issue Apr 21, 2018 · 12 comments

Comments

@siddharthab
Copy link
Contributor

pip_import currently installs all the wheels in a single repository, the same repository as the requirements.bzl file. That implies loading the requirements.bzl file in WORKSPACE also builds source packages for which wheels are not available (e.g. pysam). This makes the initial load slow, and requires all users to have the right environment setup to build any source packages into wheels, even if they don't do any python in their work.

It will be nice to have a "no-deps" mode where the requirements.txt file is expected to be the transitive closure (e.g. from pip freeze), and we have a separate repo for each package (relying on the --no-deps argument to pip wheel), such that they are downloaded and potentially built on demand, rather than on first load.

This also allows users to skip a dependency from pip altogether, and provide their own substitute by appending to the requirement dict.

@siddharthab
Copy link
Contributor Author

Something like the below repository rule that takes a single package name and version, downloads the wheel and calls whltool.par with the whl file. Users can provide their own requirements.bzl file with the requirement dict.

def _pip_repository_impl(rctx):
    if not rctx.which("python"):
        fail("python not found")
    if not rctx.which("pip"):
        fail("pip not found")

    exec_result = rctx.execute(["pip", "wheel",
                                "%s==%s" % (rctx.attr.package, rctx.attr.version),
                                "--no-deps",
                                ] + rctx.attr.wheel_args,
                               environment = rctx.attr.wheel_env)
    if exec_result.return_code:
        fail("Failed to obtain package wheel: \n%s\n%s" % (exec_result.stdout, exec_result.stderr))
    if exec_result.stderr:
        print(exec_result.stderr)

    exec_result = rctx.execute(["find", ".", "-type", "f", "-name", "*.whl"])
    if exec_result.return_code:
        fail("Could not find package wheel: \n%s" % exec_result.stderr)
    whl_file = exec_result.stdout.rstrip()

    args = [
        "python",
        rctx.path(rctx.attr._script),
        "--whl", whl_file,
        "--requirements", rctx.attr.requirements,
    ]

    if rctx.attr.extras:
        args += [
          "--extras=%s" % extra
          for extra in rctx.attr.extras
        ]

    result = rctx.execute(args)
    if result.return_code:
        fail("whl_library failed: %s (%s)" % (result.stdout, result.stderr))

    rctx.execute(["rm", whl_file])

    return

pip_repository = repository_rule(
    attrs = {
        "package": attr.string(mandatory = True),
        "version": attr.string(mandatory = True),
        "requirements": attr.string(mandatory=True),
        "extras": attr.string_list(),
        "wheel_args": attr.string_list(),
        "wheel_env": attr.string_dict(),
        "_script": attr.label(
            executable = True,
            default = Label("@io_bazel_rules_python//tools:whltool.par"),
            cfg = "host",
        ),
    },
    local = False,
    implementation = _pip_repository_impl,
)

@Globegitter
Copy link

@siddharthab That is something er are quite interested as well. Are you using that right now? If so do you have en example of how it is being used?

@Globegitter
Copy link

Globegitter commented Jun 8, 2018

Just looking further into it, imo it loading all requirements is not so much the issue (though I agree it would be nice to have a separate repository for each requirement), the main problem seems to be the need for the pip_install which installs all requirements on load rather than on a python build. If there was no pip_install function and deps would be declared e.g. via @pip_import_repository_name//:pip_library, so @py_deps//:freezegun or @py_deps//:thumbor rather than the current requirement function than that would also improve on the current situation.

But maybe a half-way solution does not make too much sense and might as well go all the way. Or I wonder if there was a way to generate multiple external repositories from one repository_rule similar to how it is done currently but without actually downloading/building the deps and then one would have to depend on the external repository directly, such as @pypi__argparse_1_4_0//:pkg.

@siddharthab
Copy link
Contributor Author

Yes. We have been using the scheme as I mentioned above. It has been working fine for us.

@Globegitter
Copy link

Globegitter commented Jun 8, 2018

@siddharthab But then you have to specify pip dependencies as well as their sub-dependencies by hand, right?

I just made a quick-and-dirty hack based on the existing rules_python that takes an requirements.txt file and just puts all those dependencies into the same external repository. So whenever one builds anything python related all dependencies in this requirements.txt are loaded, but at least it only loads those when anything related is being built.

Here is that code:

def _pip_requirements_repository_impl(rctx):
    if not rctx.which("python"):
        fail("python not found")
    if not rctx.which("pip"):
        fail("pip not found")

    rctx.file("BUILD", "")    

    req = '''
def requirement(name):
    name_key = name.replace("-", "_").lower()
    return "@%s//" + name_key + ":pkg"
''' % (rctx.attr.name)
    rctx.file("requirements.bzl", req)

    exec_result = rctx.execute(["pip", "wheel",
                                "-r%s" % rctx.path(rctx.attr.requirements),
                                ] + rctx.attr.wheel_args,
                               environment = rctx.attr.wheel_env)

    if exec_result.return_code:
        fail("Failed to obtain package wheel: \n%s\n%s" % (exec_result.stdout, exec_result.stderr))
    if exec_result.stderr:
        print(exec_result.stderr)

    exec_result = rctx.execute(["find", ".", "-type", "f", "-name", "*.whl"])
    if exec_result.return_code:
        fail("Could not find package wheel: \n%s" % exec_result.stderr)

    whl_files = exec_result.stdout.rstrip().split("\n")

    for whl_file in whl_files:
        whl_dir = str(whl_file).lower().split("-")[0]
        result = rctx.execute(["mkdir", whl_dir])

        if result.return_code:
            fail("mkdir failed: %s (%s)" % (result.stdout, result.stderr))

        args = [
            "python",
            rctx.path(rctx.attr._script),
            "--whl", whl_file,
            "--requirements", "@%s//:requirements.bzl" % rctx.attr.name,
            "--directory", whl_dir
        ]

        if rctx.attr.extras:
            args += [
            "--extras=%s" % extra
            for extra in rctx.attr.extras
            ]
        result = rctx.execute(args)
        if result.return_code:
            fail("whl_library failed: %s (%s)" % (result.stdout, result.stderr))

        rctx.execute(["rm", whl_file])

    return

pip_requirements_repository = repository_rule(
    attrs = {
        "requirements": attr.label(
            allow_files = True,
            mandatory = True,
            single_file = True,
        ),
        "extras": attr.string_list(),
        "wheel_args": attr.string_list(),
        "wheel_env": attr.string_dict(),
        "_script": attr.label(
            executable = True,
            default = Label("@io_bazel_rules_python//tools:whltool.par"),
            cfg = "host",
        ),
    },
    local = False,
    implementation = _pip_requirements_repository_impl,
)

Usage:

# WORKSPACE file
pip_repository(
    name="py_deps",
    requirements = "//:requirements.txt",
)
# BUILD file
load("@py_deps//:requirements.bzl", "requirement")
py_library(
    name = "lib",
    srcs = ["lib.py"],
    deps = [
        requirement("mock"),
    ],
)

So the usage stays exactly the same - has not been extensively tested.

@siddharthab
Copy link
Contributor Author

Yes, I think specifying every dependency manually is the bazel way. Without it, the versions of all your dependencies is not fixed for every user. And a person running the repository rule at a later date may get different versions for some of the unstated dependencies.

To list all the dependencies, I created a small script for my users that installs the package in virtualenv and gets a requirements.txt file out of the virtualenv. It's not pretty, but that's what I could think of.

@Globegitter
Copy link

You are right, it is also what I see as the bazel way, but interestingly also rules_nodejs has gone a different way by supporting package.json / lock files rather than specifying each dependency individually.

Either way, I am interested in the pip_repository approach you created. Can you by any chance share the script you put together to get the requirements.txt out of the virtualenv? Would be very useful.

@siddharthab
Copy link
Contributor Author

#!/bin/bash
# Utility script to generate a requirements file for a transitive closure of
# dependencies of a package.
# The arguments to this script are arguments to pipenv install.

set -euo pipefail

readonly tmp_dir="$(mktemp -d)"

function cleanup {
  rm  -rf "${tmp_dir}"
}
trap cleanup INT HUP QUIT TERM EXIT

cd "${tmp_dir}"
if ! command -v pipenv; then
  echo "pipenv not found; you can install by running:"
  echo "  pip install --user pipenv > /dev/null"
fi
pipenv install "$@" > /dev/null

echo ""
echo "Your package requirements are:"
pipenv run pip freeze

@Globegitter
Copy link

Awesome thanks for sharing @siddharthab

@alexeagle
Copy link
Collaborator

rules_python 0.1.0 has been released which upstreams the rules_python_external repo. Please switch from pip_import to pip_install which doesn't have this issue.

@skliarpawlo
Copy link

skliarpawlo commented Mar 12, 2021

@alexeagle could you please clarify what was fixed on version 0.1.0. We're using this version and it still downloads and installs each and every dependency from source requirements.txt on initial run, despite the bazel target we run only has a few dependencies out of the whole list.

@alexeagle
Copy link
Collaborator

Sorry you're totally right. I misunderstood the implementation when I wrote that, and forgot to update here.

The correct resolution is
#432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants