pip_import builds source packages on workspace load #96

siddharthab · 2018-04-21T22:24:48Z

pip_import currently installs all the wheels in a single repository, the same repository as the requirements.bzl file. That implies loading the requirements.bzl file in WORKSPACE also builds source packages for which wheels are not available (e.g. pysam). This makes the initial load slow, and requires all users to have the right environment setup to build any source packages into wheels, even if they don't do any python in their work.

It will be nice to have a "no-deps" mode where the requirements.txt file is expected to be the transitive closure (e.g. from pip freeze), and we have a separate repo for each package (relying on the --no-deps argument to pip wheel), such that they are downloaded and potentially built on demand, rather than on first load.

This also allows users to skip a dependency from pip altogether, and provide their own substitute by appending to the requirement dict.

The text was updated successfully, but these errors were encountered:

siddharthab · 2018-04-22T05:40:51Z

Something like the below repository rule that takes a single package name and version, downloads the wheel and calls whltool.par with the whl file. Users can provide their own requirements.bzl file with the requirement dict.

def _pip_repository_impl(rctx):
    if not rctx.which("python"):
        fail("python not found")
    if not rctx.which("pip"):
        fail("pip not found")

    exec_result = rctx.execute(["pip", "wheel",
                                "%s==%s" % (rctx.attr.package, rctx.attr.version),
                                "--no-deps",
                                ] + rctx.attr.wheel_args,
                               environment = rctx.attr.wheel_env)
    if exec_result.return_code:
        fail("Failed to obtain package wheel: \n%s\n%s" % (exec_result.stdout, exec_result.stderr))
    if exec_result.stderr:
        print(exec_result.stderr)

    exec_result = rctx.execute(["find", ".", "-type", "f", "-name", "*.whl"])
    if exec_result.return_code:
        fail("Could not find package wheel: \n%s" % exec_result.stderr)
    whl_file = exec_result.stdout.rstrip()

    args = [
        "python",
        rctx.path(rctx.attr._script),
        "--whl", whl_file,
        "--requirements", rctx.attr.requirements,
    ]

    if rctx.attr.extras:
        args += [
          "--extras=%s" % extra
          for extra in rctx.attr.extras
        ]

    result = rctx.execute(args)
    if result.return_code:
        fail("whl_library failed: %s (%s)" % (result.stdout, result.stderr))

    rctx.execute(["rm", whl_file])

    return

pip_repository = repository_rule(
    attrs = {
        "package": attr.string(mandatory = True),
        "version": attr.string(mandatory = True),
        "requirements": attr.string(mandatory=True),
        "extras": attr.string_list(),
        "wheel_args": attr.string_list(),
        "wheel_env": attr.string_dict(),
        "_script": attr.label(
            executable = True,
            default = Label("@io_bazel_rules_python//tools:whltool.par"),
            cfg = "host",
        ),
    },
    local = False,
    implementation = _pip_repository_impl,
)

Globegitter · 2018-06-08T09:59:47Z

@siddharthab That is something er are quite interested as well. Are you using that right now? If so do you have en example of how it is being used?

Globegitter · 2018-06-08T10:24:49Z

Just looking further into it, imo it loading all requirements is not so much the issue (though I agree it would be nice to have a separate repository for each requirement), the main problem seems to be the need for the pip_install which installs all requirements on load rather than on a python build. If there was no pip_install function and deps would be declared e.g. via @pip_import_repository_name//:pip_library, so @py_deps//:freezegun or @py_deps//:thumbor rather than the current requirement function than that would also improve on the current situation.

But maybe a half-way solution does not make too much sense and might as well go all the way. Or I wonder if there was a way to generate multiple external repositories from one repository_rule similar to how it is done currently but without actually downloading/building the deps and then one would have to depend on the external repository directly, such as @pypi__argparse_1_4_0//:pkg.

siddharthab · 2018-06-08T15:55:04Z

Yes. We have been using the scheme as I mentioned above. It has been working fine for us.

Globegitter · 2018-06-08T16:43:21Z

@siddharthab But then you have to specify pip dependencies as well as their sub-dependencies by hand, right?

I just made a quick-and-dirty hack based on the existing rules_python that takes an requirements.txt file and just puts all those dependencies into the same external repository. So whenever one builds anything python related all dependencies in this requirements.txt are loaded, but at least it only loads those when anything related is being built.

Here is that code:

def _pip_requirements_repository_impl(rctx):
    if not rctx.which("python"):
        fail("python not found")
    if not rctx.which("pip"):
        fail("pip not found")

    rctx.file("BUILD", "")    

    req = '''
def requirement(name):
    name_key = name.replace("-", "_").lower()
    return "@%s//" + name_key + ":pkg"
''' % (rctx.attr.name)
    rctx.file("requirements.bzl", req)

    exec_result = rctx.execute(["pip", "wheel",
                                "-r%s" % rctx.path(rctx.attr.requirements),
                                ] + rctx.attr.wheel_args,
                               environment = rctx.attr.wheel_env)

    if exec_result.return_code:
        fail("Failed to obtain package wheel: \n%s\n%s" % (exec_result.stdout, exec_result.stderr))
    if exec_result.stderr:
        print(exec_result.stderr)

    exec_result = rctx.execute(["find", ".", "-type", "f", "-name", "*.whl"])
    if exec_result.return_code:
        fail("Could not find package wheel: \n%s" % exec_result.stderr)

    whl_files = exec_result.stdout.rstrip().split("\n")

    for whl_file in whl_files:
        whl_dir = str(whl_file).lower().split("-")[0]
        result = rctx.execute(["mkdir", whl_dir])

        if result.return_code:
            fail("mkdir failed: %s (%s)" % (result.stdout, result.stderr))

        args = [
            "python",
            rctx.path(rctx.attr._script),
            "--whl", whl_file,
            "--requirements", "@%s//:requirements.bzl" % rctx.attr.name,
            "--directory", whl_dir
        ]

        if rctx.attr.extras:
            args += [
            "--extras=%s" % extra
            for extra in rctx.attr.extras
            ]
        result = rctx.execute(args)
        if result.return_code:
            fail("whl_library failed: %s (%s)" % (result.stdout, result.stderr))

        rctx.execute(["rm", whl_file])

    return

pip_requirements_repository = repository_rule(
    attrs = {
        "requirements": attr.label(
            allow_files = True,
            mandatory = True,
            single_file = True,
        ),
        "extras": attr.string_list(),
        "wheel_args": attr.string_list(),
        "wheel_env": attr.string_dict(),
        "_script": attr.label(
            executable = True,
            default = Label("@io_bazel_rules_python//tools:whltool.par"),
            cfg = "host",
        ),
    },
    local = False,
    implementation = _pip_requirements_repository_impl,
)

Usage:

# WORKSPACE file
pip_repository(
    name="py_deps",
    requirements = "//:requirements.txt",
)

# BUILD file
load("@py_deps//:requirements.bzl", "requirement")
py_library(
    name = "lib",
    srcs = ["lib.py"],
    deps = [
        requirement("mock"),
    ],
)

So the usage stays exactly the same - has not been extensively tested.

siddharthab · 2018-06-08T23:22:24Z

Yes, I think specifying every dependency manually is the bazel way. Without it, the versions of all your dependencies is not fixed for every user. And a person running the repository rule at a later date may get different versions for some of the unstated dependencies.

To list all the dependencies, I created a small script for my users that installs the package in virtualenv and gets a requirements.txt file out of the virtualenv. It's not pretty, but that's what I could think of.

Globegitter · 2018-06-10T17:14:27Z

You are right, it is also what I see as the bazel way, but interestingly also rules_nodejs has gone a different way by supporting package.json / lock files rather than specifying each dependency individually.

Either way, I am interested in the pip_repository approach you created. Can you by any chance share the script you put together to get the requirements.txt out of the virtualenv? Would be very useful.

siddharthab · 2018-06-12T03:19:03Z

#!/bin/bash
# Utility script to generate a requirements file for a transitive closure of
# dependencies of a package.
# The arguments to this script are arguments to pipenv install.

set -euo pipefail

readonly tmp_dir="$(mktemp -d)"

function cleanup {
  rm  -rf "${tmp_dir}"
}
trap cleanup INT HUP QUIT TERM EXIT

cd "${tmp_dir}"
if ! command -v pipenv; then
  echo "pipenv not found; you can install by running:"
  echo "  pip install --user pipenv > /dev/null"
fi
pipenv install "$@" > /dev/null

echo ""
echo "Your package requirements are:"
pipenv run pip freeze

Globegitter · 2018-06-20T08:04:08Z

Awesome thanks for sharing @siddharthab

alexeagle · 2020-10-16T14:07:38Z

rules_python 0.1.0 has been released which upstreams the rules_python_external repo. Please switch from pip_import to pip_install which doesn't have this issue.

skliarpawlo · 2021-03-12T16:30:58Z

@alexeagle could you please clarify what was fixed on version 0.1.0. We're using this version and it still downloads and installs each and every dependency from source requirements.txt on initial run, despite the bazel target we run only has a few dependencies out of the whole list.

alexeagle · 2021-03-13T01:47:58Z

Sorry you're totally right. I misunderstood the implementation when I wrote that, and forgot to update here.

The correct resolution is
#432

therc mentioned this issue Aug 23, 2018

pip_import doesn't work with tensorflow #71

Closed

brandjon added the untriaged label Nov 9, 2018

ChrisCummins mentioned this issue Jan 9, 2020

Break up requirements.txt ChrisCummins/phd#58

Open

alexeagle closed this as completed Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pip_import builds source packages on workspace load #96

pip_import builds source packages on workspace load #96

siddharthab commented Apr 21, 2018

siddharthab commented Apr 22, 2018

Globegitter commented Jun 8, 2018

Globegitter commented Jun 8, 2018 •

edited

Loading

siddharthab commented Jun 8, 2018

Globegitter commented Jun 8, 2018 •

edited

Loading

siddharthab commented Jun 8, 2018

Globegitter commented Jun 10, 2018

siddharthab commented Jun 12, 2018

Globegitter commented Jun 20, 2018

alexeagle commented Oct 16, 2020

skliarpawlo commented Mar 12, 2021 •

edited

Loading

alexeagle commented Mar 13, 2021

pip_import builds source packages on workspace load #96

pip_import builds source packages on workspace load #96

Comments

siddharthab commented Apr 21, 2018

siddharthab commented Apr 22, 2018

Globegitter commented Jun 8, 2018

Globegitter commented Jun 8, 2018 • edited Loading

siddharthab commented Jun 8, 2018

Globegitter commented Jun 8, 2018 • edited Loading

siddharthab commented Jun 8, 2018

Globegitter commented Jun 10, 2018

siddharthab commented Jun 12, 2018

Globegitter commented Jun 20, 2018

alexeagle commented Oct 16, 2020

skliarpawlo commented Mar 12, 2021 • edited Loading

alexeagle commented Mar 13, 2021

Globegitter commented Jun 8, 2018 •

edited

Loading

Globegitter commented Jun 8, 2018 •

edited

Loading

skliarpawlo commented Mar 12, 2021 •

edited

Loading