-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
py_unzip #29
py_unzip #29
Conversation
…unnable python application. The Problem: We store two copies of evey python file and c-extension in `parfile`s/the `unpar` directory. --- The Current State of Affairs: Take the following example `BUILD` file, ```py // BUILD.bazel pkg_python_app( name = "my_app", srcs = ["my_app.py"], bindir = "/usr/bin", libdir = "/lib", tar = "staging", ) ``` This created a `staging.tar` which when extracted created the following files, ``` /usr/bin/my_app # adds environment variables and calls '/lib/par/my_app.par' /lib /par/my_app.par # a parfile (zip,) containing all dependencies /unpar/my_app/ # the extracted version of 'my_app.par' __main__.py PAR_MANFIEST pypi__36__attrs_20_3_0/... com_128technology_i95/... ``` `/lib/unpar/my_app` is created the first time the application is run, or could be created eagerly by runing the executable with the environment variable `PAR_EXTRACT_ONLY=1`. The `parfile` had no compression applied to it, so that means **we keep two copies of every single python file and c-extension**. --- The History: zipfiles created by bazel (via the commandline flag `--build_python_zip`) cannot be run outside of the build/ directory. We chose `par_binary` (`google/subpar`) because it creates `parfile`s which _can_ be copied anywhere and executed (the python interpreter is not bundled in the parfile.) parfiles work best with `zip_safe = True`. That allows the python interpreter to find the python source files directly within the parfile without having to extract the archive first. c-extensions are **not** `zip_safe` so any `parfile` that includes one must be marked as `zip_safe = False`. Most of our python applications include a c-extension as a dependency (`lxml`, `pycrypto`, etc...) so _most_ are marked as `zip_safe = False`. `google/subpar` extracts to a **unique** tempdir when `zip_safe = False`. This lead to extremely large overhead for python "scripts" that were supposed to be invoked, do one thing, and exit. For example, we were running thousands of `salt` commands and each one was taking 5+ seconds to extract _before_ the beginning of `main()` was even called. We forked and created `128technology/subpar` which added a new `extract_dir` argument. This allowed the `parfile` to extract to a Known Good Place and re-use the extracted version of the application as long as the parfile didn't change. The `parfile` used a hash of all the files in the archive to create the `PAR_MANFIEST` so as to not re-extract unless that file in the extracted dir didn't match the one embeded within the `parfile`. The `parfile` still contained the `__main__.py` so the `exec_wrapper` called the `parfile` which used all the files in `unpar/` _except_ the `__main__.py`. We then continued to add more patches on top of `subpar` to byte-compile eagerly, etc... --- The Solution: `subpar` has outlived it's usefulness. We've hacked it to bits so it's time to re-write the parts we use from the ground(ish) up. Meet `py_unzip`. Take the same `BUILD` file from before and add `use_py_unzip = True` ```py // BUILD.bazel pkg_python_app( name = "my_app", srcs = ["my_app.py"], bindir = "/usr/bin", libdir = "/lib", use_py_unzip = True, # <--- new! ) ``` Now `my_app.tar` is created which extracts to, ``` /usr/bin/my_app # adds environment variables and calls '/lib/unzip/my_app/__main__.py' /lib/unzip/my_app/ __main__.py runfiles/ pypi__36__attrs_20_3_0/... com_128technology_i95/... ``` There is no duplication between the archive/unarchived version of the application. --- The Implementation: - Take the zipfile that bazel generates via `py_binary` (same as `--build_python_zip`.) This is done by `ctx.attr.src[OutputGroupInfo].python_zip_file` which is the same as ``` filegroup( name = "foo_zip", srcs= ["//bar:foo"], output_group = "python_zip_file", ) ``` as mentioned in this github comment, bazelbuild/bazel#3530 (comment). - Replace the `__main__.py` with one that works in the extracted directory. Bazel's main comes from `tools/python/python_bootstrap_template.txt` which templates in a half-dozen variables; our new `__main__.py` is based on that, but simplified to only work for our use-case (extracted outside of bazel.)
8aa1bbd
to
43a2a32
Compare
Take the following `BUILD` file as an example, ```py py_unzip(name = "my_app") ``` Three separate rules got instatiated, ```py py_binary(name = "_my_app") alias(name = "my_app", actual = "_my_app") _py_unzip(name = "my_app.tar") ``` It was done this way so that - `bazel run //:my_app` would execute the `py_binary`, - `bazel build //:my_app.tar` would build a tar containing the `py_unzip` archive. The hacky part was that `_py_unzip` was creating the `.tar` as an executable and the only output from the rule. Now `_py_unzip` creates both the tarfile and the executable proxy (previously done using the `alias`.)
exec_wrapper_name = "%s_exec_wrapper" % name | ||
exec_wrapper( | ||
name = exec_wrapper_name, | ||
env = env or get_python_env(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be if env is not None
? are empty dictionaries falsy in starlark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the spec
A dictionary used in a Boolean context is considered true if it is non-empty.
In this case if env
is specified it's likely not a dict
but the results of get_python_env
which returns a select
statement. those are also truthy so I don't see a need for an explicit is not None
check
@@ -0,0 +1,128 @@ | |||
%shebang% | |||
|
|||
# Copyright 2019 The Bazel Authors. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is almost Python, a .gitattributes
to mark it as linguist-language=python
might be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about how best to have my editor automatically recognize this file as python... do you know if vscode uses .gitattributes
to determine the language for syntax highlighting or is it only the file extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, guess I never responded to this. "files.associations"
is the vscode equivalent, in case this is still relevant
python/py_unzip/rezipper.py
Outdated
if info.filename == "__main__.py": | ||
addfile(info.filename, main.read_bytes(), mode=0o755) | ||
else: | ||
addfile(info.filename, z_in.read(info.filename), mode=0o644) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not just copy the mode from the input file, or is that not available in zipfiles? I'd imagine there might be some cases where 644 isn't the only valid set of permissions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I first attempted it I found some oddities where bazel complained about "permission denied"
but I should try again now that everything is working and I can triage one thing at a time
tar = ctx.outputs.tar | ||
ctx.actions.run( | ||
outputs = [tar], | ||
mnemonic = "ReZipper", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically I guess it's not "zipping" but naming is hard so 🤷
This allows downstream rules to extract the `PyInfo` or `OutputGroupInfo` from a `py_unzip`.
a1bf833
to
6d038e9
Compare
```sh $ git check-attr --all -- python/py_unzip/py_binary_shim.sh.tmpl shell ```
addfile( | ||
info.filename, | ||
z_in.read(info.filename), | ||
mode=info.external_attr >> 16 or 0o644, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure there's a use case but this would disallow 0o000
permissions right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... do we have any use-cases for shipping inaccessible files as part of one of these applications? We use 0o000
for testing purposes, but I don't know of any places where we rely on that at product runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not, it's a bit of an oddball set of permissions, but might be worth a cursory check on a DUT with something like rpm -qla | xargs stat -c '%a %n' | grep 000
Create an archive that can be expanded into a standalone, runnable python application.
The Problem:
We store two copies of evey python file and c-extension in
parfile
s/theunpar
directory.The Current State of Affairs:
Take the following example
BUILD
file,This created a
staging.tar
which when extracted created the following files,/lib/unpar/my_app
is created the first time the application is run, or could be created eagerly by runing the executable with the environment variablePAR_EXTRACT_ONLY=1
.The
parfile
had no compression applied to it, so that means we keep two copies of every single python file and c-extension.The History:
zipfiles created by bazel (via the commandline flag
--build_python_zip
) cannot be run outside of the build/ directory.We chose
par_binary
(google/subpar
) because it createsparfile
s which can be copied anywhere and executed (the python interpreter is not bundled in the parfile.)parfiles work best with
zip_safe = True
.That allows the python interpreter to find the python source files directly within the parfile without having to extract the archive first.
c-extensions are not
zip_safe
so anyparfile
that includes one must be marked aszip_safe = False
.Most of our python applications include a c-extension as a dependency (
lxml
,pycrypto
, etc...) so most are marked aszip_safe = False
.google/subpar
extracts to a unique tempdir whenzip_safe = False
.This lead to extremely large overhead for python "scripts" that were supposed to be invoked, do one thing, and exit.
For example, we were running thousands of
salt
commands and each one was taking 5+ seconds to extract before the beginning ofmain()
was even called.We forked and created
128technology/subpar
which added a newextract_dir
argument.This allowed the
parfile
to extract to a Known Good Place and re-use the extracted version of the application as long as the parfile didn't change.The
parfile
used a hash of all the files in the archive to create thePAR_MANFIEST
so as to not re-extract unless that file in the extracted dir didn't match the one embeded within theparfile
.The
parfile
still contained the__main__.py
so theexec_wrapper
called theparfile
which used all the files inunpar/
except the__main__.py
.We then continued to add more patches on top of
subpar
to byte-compile eagerly, etc...The Solution:
subpar
has outlived it's usefulness. We've hacked it to bits so it's time to re-write the parts we use from the ground(ish) up.Meet
py_unzip
.Take the same
BUILD
file from before and adduse_py_unzip = True
Now
my_app.tar
is created which extracts to,There is no duplication between the archive/unarchived version of the application.
The Implementation:
py_binary
(same as--build_python_zip
.)This is done by
ctx.attr.src[OutputGroupInfo].python_zip_file
which is the same asas mentioned in this github comment, bazelbuild/bazel#3530 (comment).
__main__.py
with one that works in the extracted directory.Bazel's main comes from
tools/python/python_bootstrap_template.txt
which templates in a half-dozen variables; our new__main__.py
is based on that, but simplified to only work for our use-case (extracted outside of bazel.)