Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py_unzip #29

Merged
merged 7 commits into from
Oct 12, 2023
Merged

py_unzip #29

merged 7 commits into from
Oct 12, 2023

Conversation

jacobbogdanov
Copy link
Collaborator

@jacobbogdanov jacobbogdanov commented Jul 31, 2023

Create an archive that can be expanded into a standalone, runnable python application.

The Problem:

We store two copies of evey python file and c-extension in parfiles/the unpar directory.


The Current State of Affairs:

Take the following example BUILD file,

// BUILD.bazel
pkg_python_app(
  name = "my_app",
  srcs = ["my_app.py"],
  bindir = "/usr/bin",
  libdir = "/lib",
  tar = "staging",
)

This created a staging.tar which when extracted created the following files,

/usr/bin/my_app  # adds environment variables and calls '/lib/par/my_app.par'
/lib
  /par/my_app.par  # a parfile (zip,) containing all dependencies
  /unpar/my_app/ # the extracted version of 'my_app.par'
    __main__.py
    PAR_MANFIEST
    pypi__36__attrs_20_3_0/...
    com_128technology_i95/...

/lib/unpar/my_app is created the first time the application is run, or could be created eagerly by runing the executable with the environment variable PAR_EXTRACT_ONLY=1.

The parfile had no compression applied to it, so that means we keep two copies of every single python file and c-extension.


The History:

zipfiles created by bazel (via the commandline flag --build_python_zip) cannot be run outside of the build/ directory.
We chose par_binary (google/subpar) because it creates parfiles which can be copied anywhere and executed (the python interpreter is not bundled in the parfile.)

parfiles work best with zip_safe = True.
That allows the python interpreter to find the python source files directly within the parfile without having to extract the archive first.
c-extensions are not zip_safe so any parfile that includes one must be marked as zip_safe = False.
Most of our python applications include a c-extension as a dependency (lxml, pycrypto, etc...) so most are marked as zip_safe = False.

google/subpar extracts to a unique tempdir when zip_safe = False.
This lead to extremely large overhead for python "scripts" that were supposed to be invoked, do one thing, and exit.
For example, we were running thousands of salt commands and each one was taking 5+ seconds to extract before the beginning of main() was even called.

We forked and created 128technology/subpar which added a new extract_dir argument.
This allowed the parfile to extract to a Known Good Place and re-use the extracted version of the application as long as the parfile didn't change.
The parfile used a hash of all the files in the archive to create the PAR_MANFIEST so as to not re-extract unless that file in the extracted dir didn't match the one embeded within the parfile.
The parfile still contained the __main__.py so the exec_wrapper called the parfile which used all the files in unpar/ except the __main__.py.

We then continued to add more patches on top of subpar to byte-compile eagerly, etc...


The Solution:

subpar has outlived it's usefulness. We've hacked it to bits so it's time to re-write the parts we use from the ground(ish) up.

Meet py_unzip.

Take the same BUILD file from before and add use_py_unzip = True

// BUILD.bazel
pkg_python_app(
  name = "my_app",
  srcs = ["my_app.py"],
  bindir = "/usr/bin",
  libdir = "/lib",
  use_py_unzip = True,  # <--- new!
)

Now my_app.tar is created which extracts to,

/usr/bin/my_app  # adds environment variables and calls '/lib/unzip/my_app/__main__.py'
/lib/unzip/my_app/
  __main__.py
  runfiles/
    pypi__36__attrs_20_3_0/...
    com_128technology_i95/...

There is no duplication between the archive/unarchived version of the application.


The Implementation:

  • Take the zipfile that bazel generates via py_binary (same as --build_python_zip.)
    This is done by ctx.attr.src[OutputGroupInfo].python_zip_file which is the same as
filegroup(
  name = "foo_zip",
  srcs=  ["//bar:foo"],
  output_group = "python_zip_file",
)

as mentioned in this github comment, bazelbuild/bazel#3530 (comment).

  • Replace the __main__.py with one that works in the extracted directory.
    Bazel's main comes from tools/python/python_bootstrap_template.txt which templates in a half-dozen variables; our new __main__.py is based on that, but simplified to only work for our use-case (extracted outside of bazel.)

…unnable python application.

The Problem:

We store two copies of evey python file and c-extension in `parfile`s/the `unpar` directory.

---

The Current State of Affairs:

Take the following example `BUILD` file,
```py
// BUILD.bazel
pkg_python_app(
  name = "my_app",
  srcs = ["my_app.py"],
  bindir = "/usr/bin",
  libdir = "/lib",
  tar = "staging",
)
```
This created a `staging.tar` which when extracted created the following files,

```
/usr/bin/my_app  # adds environment variables and calls '/lib/par/my_app.par'
/lib
  /par/my_app.par  # a parfile (zip,) containing all dependencies
  /unpar/my_app/ # the extracted version of 'my_app.par'
    __main__.py
    PAR_MANFIEST
    pypi__36__attrs_20_3_0/...
    com_128technology_i95/...
```

`/lib/unpar/my_app` is created the first time the application is run, or could be created eagerly by runing the executable with the environment variable `PAR_EXTRACT_ONLY=1`.

The `parfile` had no compression applied to it, so that means **we keep two copies of every single python file and c-extension**.

---

The History:

zipfiles created by bazel (via the commandline flag `--build_python_zip`) cannot be run outside of the build/ directory.
We chose `par_binary` (`google/subpar`) because it creates `parfile`s which _can_ be copied anywhere and executed (the python interpreter is not bundled in the parfile.)

parfiles work best with `zip_safe = True`.
That allows the python interpreter to find the python source files directly within the parfile without having to extract the archive first.
c-extensions are **not** `zip_safe` so any `parfile` that includes one must be marked as `zip_safe = False`.
Most of our python applications include a c-extension as a dependency (`lxml`, `pycrypto`, etc...) so _most_ are marked as `zip_safe = False`.

`google/subpar` extracts to a **unique** tempdir when `zip_safe = False`.
This lead to extremely large overhead for python "scripts" that were supposed to be invoked, do one thing, and exit.
For example, we were running thousands of `salt` commands and each one was taking 5+ seconds to extract _before_ the beginning of `main()` was even called.

We forked and created `128technology/subpar` which added a new `extract_dir` argument.
This allowed the `parfile` to extract to a Known Good Place and re-use the extracted version of the application as long as the parfile didn't change.
The `parfile` used a hash of all the files in the archive to create the `PAR_MANFIEST` so as to not re-extract unless that file in the extracted dir didn't match the one embeded within the `parfile`.
The `parfile` still contained the `__main__.py` so the `exec_wrapper` called the `parfile` which used all the files in `unpar/` _except_ the `__main__.py`.

We then continued to add more patches on top of `subpar` to byte-compile eagerly, etc...

---

The Solution:

`subpar` has outlived it's usefulness. We've hacked it to bits so it's time to re-write the parts we use from the ground(ish) up.

Meet `py_unzip`.

Take the same `BUILD` file from before and add `use_py_unzip = True`
```py
// BUILD.bazel
pkg_python_app(
  name = "my_app",
  srcs = ["my_app.py"],
  bindir = "/usr/bin",
  libdir = "/lib",
  use_py_unzip = True,  # <--- new!
)
```

Now `my_app.tar` is created which extracts to,
```
/usr/bin/my_app  # adds environment variables and calls '/lib/unzip/my_app/__main__.py'
/lib/unzip/my_app/
  __main__.py
  runfiles/
    pypi__36__attrs_20_3_0/...
    com_128technology_i95/...
```
There is no duplication between the archive/unarchived version of the application.

---

The Implementation:

- Take the zipfile that bazel generates via `py_binary` (same as `--build_python_zip`.)
  This is done by `ctx.attr.src[OutputGroupInfo].python_zip_file` which is the same as
```
filegroup(
  name = "foo_zip",
  srcs=  ["//bar:foo"],
  output_group = "python_zip_file",
)
```
  as mentioned in this github comment, bazelbuild/bazel#3530 (comment).

- Replace the `__main__.py` with one that works in the extracted directory.
  Bazel's main comes from `tools/python/python_bootstrap_template.txt` which templates in a half-dozen variables; our new `__main__.py` is based on that, but simplified to only work for our use-case (extracted outside of bazel.)
Take the following `BUILD` file as an example,

```py
py_unzip(name = "my_app")
```

Three separate rules got instatiated,

```py
py_binary(name = "_my_app")
alias(name = "my_app", actual = "_my_app")
_py_unzip(name = "my_app.tar")
```

It was done this way so that
- `bazel run //:my_app` would execute the `py_binary`,
- `bazel build //:my_app.tar` would build a tar containing the `py_unzip` archive.

The hacky part was that `_py_unzip` was creating the `.tar` as an executable and the only output from the rule.

Now `_py_unzip` creates both the tarfile and the executable proxy (previously done using the `alias`.)
exec_wrapper_name = "%s_exec_wrapper" % name
exec_wrapper(
name = exec_wrapper_name,
env = env or get_python_env(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be if env is not None ? are empty dictionaries falsy in starlark?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the spec

A dictionary used in a Boolean context is considered true if it is non-empty.

In this case if env is specified it's likely not a dict but the results of get_python_env which returns a select statement. those are also truthy so I don't see a need for an explicit is not None check

@@ -0,0 +1,128 @@
%shebang%

# Copyright 2019 The Bazel Authors. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is almost Python, a .gitattributes to mark it as linguist-language=python might be nice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about how best to have my editor automatically recognize this file as python... do you know if vscode uses .gitattributes to determine the language for syntax highlighting or is it only the file extension?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, guess I never responded to this. "files.associations" is the vscode equivalent, in case this is still relevant

if info.filename == "__main__.py":
addfile(info.filename, main.read_bytes(), mode=0o755)
else:
addfile(info.filename, z_in.read(info.filename), mode=0o644)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just copy the mode from the input file, or is that not available in zipfiles? I'd imagine there might be some cases where 644 isn't the only valid set of permissions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I first attempted it I found some oddities where bazel complained about "permission denied" but I should try again now that everything is working and I can triage one thing at a time

tar = ctx.outputs.tar
ctx.actions.run(
outputs = [tar],
mnemonic = "ReZipper",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically I guess it's not "zipping" but naming is hard so 🤷

This allows downstream rules to extract the `PyInfo` or `OutputGroupInfo` from a `py_unzip`.
addfile(
info.filename,
z_in.read(info.filename),
mode=info.external_attr >> 16 or 0o644,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure there's a use case but this would disallow 0o000 permissions right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... do we have any use-cases for shipping inaccessible files as part of one of these applications? We use 0o000 for testing purposes, but I don't know of any places where we rely on that at product runtime

Copy link
Contributor

@ian-h-chamberlain ian-h-chamberlain Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not, it's a bit of an oddball set of permissions, but might be worth a cursory check on a DUT with something like rpm -qla | xargs stat -c '%a %n' | grep 000

@jacobbogdanov jacobbogdanov marked this pull request as ready for review October 12, 2023 16:00
@jacobbogdanov jacobbogdanov merged commit 909fe60 into master Oct 12, 2023
@jacobbogdanov jacobbogdanov deleted the jbog/py_unzip branch October 12, 2023 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants