Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ostree archive's mtime modification results in slower python execution #1469

Open
euank opened this issue Feb 26, 2018 · 44 comments · May be fixed by coreos/rpm-ostree#5019
Open

ostree archive's mtime modification results in slower python execution #1469

euank opened this issue Feb 26, 2018 · 44 comments · May be fixed by coreos/rpm-ostree#5019

Comments

@euank
Copy link

euank commented Feb 26, 2018

System information

$ cat /etc/os-release 
NAME=Fedora
VERSION="27 (Atomic Host)"
ID=fedora
VERSION_ID=27
PRETTY_NAME="Fedora 27 (Atomic Host)"
VARIANT="Atomic Host"
VARIANT_ID=atomic.host

$ curl 169.254.169.254/latest/meta-data/instance-type; echo; curl 169.254.169.254/latest/meta-data/ami-id; echo
t2.micro
ami-4ef37336

High level problem

At a high level, the problem is that atomic and other python programs are slow by default.

$ time atomic --help >/dev/null

real	0m1.268s
user	0m1.184s
sys	0m0.059s

$ sudo ostree admin unlock

$ time sudo atomic --help > /dev/null

real	0m1.402s
user	0m1.202s
sys	0m0.136s
$ time sudo atomic --help > /dev/null

real	0m0.453s
user	0m0.375s
sys	0m0.055s

As we can clearly see, the atomic cli has a performance gain of 3x to 4x simply by letting it rewrite the pyc files. This performance issue will likely be shared by most other python software.

Why does this happen?

ostree sets a file's timestamps to 0:

archive_entry_update_pathname_utf8 (entry, pathstr);
archive_entry_set_ctime (entry, ts, OSTREE_TIMESTAMP);
archive_entry_set_mtime (entry, ts, OSTREE_TIMESTAMP);
archive_entry_set_atime (entry, ts, OSTREE_TIMESTAMP);

Unfortunately, this results in some files being considered incorrect. For example, a .pyc file, per this blog, includes information about the 'mtime' of the .py file it was sourced from.

Since ostree modifies that timestamp, the .pyc file created during the rpmbuild becomes invalid and is ignored.

We can verify this is the problem by using perf to notice where time is spent, or using strace to notice which files are changed that result in the performance speedup, e.g.:

$ sudo strace -e trace=open,openat atomic --help 2>&1 | cat | grep -Eo '/usr/lib[^"]+' | grep -E "\.py$" | wc -l
455

$ # ostree unlock and sudo atomic, as above

$ sudo strace -e trace=open,openat atomic --help 2>&1 | cat | grep -Eo '/usr/lib[^"]+' | grep -E "\.py$" | wc -l
0

$ find /var/tmp/ostree-unlock-ovl.XCJGFZ/upper/ -name "*.pyc" | wc -l
455

Additional considerations

Note that python is not unique in using mtime to store information about whether a given cache file is up to date. I don't know off-hand of other software that's definitely impacted by this choice. , but I wouldn't be surprised if the go toolchain had a similar problem (edit, nope, go's pkg directory on usr seems to be handled fine)

@euank euank changed the title mtime modification results in slower python execution ostree archive's mtime modification results in slower python execution Feb 26, 2018
@cgwalters
Copy link
Member

I think the ideal fix here is to change RPM to canonicalize those timestamps to 0 as well, basically bringing the libostree model there.

Alternatively, cPython could learn to stop doing timestamp checks on read-only filesystems in general, or perhaps just /usr if it's read-only.

@cgwalters
Copy link
Member

I suspect in practice a bigger win though would be teaching cPython to generate a larger cache file that could be easily mmap()ed like fontconfig and ldconfig etc. do.

@nanonyme
Copy link
Contributor

nanonyme commented Jan 3, 2019

Seems like related to https://gitlab.com/freedesktop-sdk/freedesktop-sdk/issues/554 (although not identical root cause)
The solution is to set SOURCE_DATE_EPOCH for all Python packages so we get invalidation method CHECKED_HASH. This works just fine with ostree since it doesn't care about timestamps. It works out of the box with Python 3.7.0 and higher.

@agners
Copy link
Contributor

agners commented Jan 29, 2019

I currently looking into a related issue in meta-updater, which creates an OSTree rootfs using OpenEmbedded:
advancedtelematic/meta-updater#461

OpenEmbedded has reproducible build support. Setting the SOURCE_DATE_EPOCH to 0 seems to work. However, that caused the build to fail for a Python recipes which tried to pack files into a egg file. It seems that zip only supports timestamps newer than 1980. I think for the OpenEmbedded case it would be good if we can choose what timestamp OSTree uses, so we could choose something more recent.

@hroncok
Copy link

hroncok commented Apr 11, 2024

I've been debugging a related RHEL 9 issue and I found the following.

On Fedora Silverblue 40 Beta, some of the .pyc files have zeroed mtimes embedded in them.

For example:

>>> pyc = '/usr/lib/python3.12/site-packages/click/__pycache__/__init__.cpython-312.pyc'
>>> def py_demarshal_long(b):
...   return (b[0] + (b[1] << 8) + (b[2] << 16) + (b[3] << 24))
... 
>>> with open(pyc, 'rb') as f:
...   chunk = f.read(12)
... 
>>> py_demarshal_long(chunk[8:12])
0

(The code above returns 1689897600 on my non-Silverblue Fedora 39 system.)

This is good because the /usr/lib/python3.12/site-packages/click/__init__.py mtime is zero, so this bytecode cache is valid.

However, apparently, only some of the .pyc files installed in Fedora Silverblue have been altered this way. For example, if I run rpm --verify on the package owning the aforementioned file, I get:

test@fedora:~$ rpm --verify python3-click
.......T.    /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/INSTALLER
.......T.  l /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/LICENSE.rst
.......T.    /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/METADATA
.......T.    /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/WHEEL
.......T.    /usr/lib/python3.12/site-packages/click-8.1.7.dist-info/top_level.txt
.......T.    /usr/lib/python3.12/site-packages/click/__init__.py
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/__init__.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/__init__.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_compat.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/_compat.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_termui_impl.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_termui_impl.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_textwrap.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_textwrap.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_winconsole.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/_winconsole.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/core.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/core.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/decorators.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/decorators.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/exceptions.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/exceptions.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/formatting.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/formatting.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/globals.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/globals.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/parser.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/parser.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/shell_completion.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/shell_completion.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/termui.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/termui.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/testing.cpython-312.opt-1.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/testing.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/types.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/types.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/__pycache__/utils.cpython-312.opt-1.pyc
..5....T.    /usr/lib/python3.12/site-packages/click/__pycache__/utils.cpython-312.pyc
.......T.    /usr/lib/python3.12/site-packages/click/_compat.py
.......T.    /usr/lib/python3.12/site-packages/click/_termui_impl.py
.......T.    /usr/lib/python3.12/site-packages/click/_textwrap.py
.......T.    /usr/lib/python3.12/site-packages/click/_winconsole.py
.......T.    /usr/lib/python3.12/site-packages/click/core.py
.......T.    /usr/lib/python3.12/site-packages/click/decorators.py
.......T.    /usr/lib/python3.12/site-packages/click/exceptions.py
.......T.    /usr/lib/python3.12/site-packages/click/formatting.py
.......T.    /usr/lib/python3.12/site-packages/click/globals.py
.......T.    /usr/lib/python3.12/site-packages/click/parser.py
.......T.    /usr/lib/python3.12/site-packages/click/py.typed
.......T.    /usr/lib/python3.12/site-packages/click/shell_completion.py
.......T.    /usr/lib/python3.12/site-packages/click/termui.py
.......T.    /usr/lib/python3.12/site-packages/click/testing.py
.......T.    /usr/lib/python3.12/site-packages/click/types.py
.......T.    /usr/lib/python3.12/site-packages/click/utils.py
.......T.  d /usr/share/doc/python3-click/CHANGES.rst
.......T.  d /usr/share/doc/python3-click/README.rst
.......T.  l /usr/share/licenses/python3-click/LICENSE.rst

The T in the output means mtime does not match RPM database (well, it is zero, so I expected this). The 5 in the output means the content differs. Presumably, the .pyc files were regenerated by Python at some point when the filesystem was not read-only but the mtimes of .py files were already zero. Interestingly, only some .pyc files are "corrected", others are not.

See:

test@fedora:~$ python3 -v -c 'import click.testing' 2>&1 | grep '# bytecode is stale for' -A2
# bytecode is stale for 'click.testing'
# code object from /usr/lib/python3.12/site-packages/click/testing.py
# could not create '/usr/lib/python3.12/site-packages/click/__pycache__/testing.cpython-312.pyc': OSError(30, 'Read-only file system')

This is even broken for the standard library:

test@fedora:~$ python3 -v -c 'import mailbox' 2>&1 | grep '# bytecode is stale for' -A2
# bytecode is stale for 'mailbox'
# code object from /usr/lib64/python3.12/mailbox.py
# could not create '/usr/lib64/python3.12/__pycache__/mailbox.cpython-312.pyc': OSError(30, 'Read-only file system')
--
# bytecode is stale for 'urllib'
# code object from /usr/lib64/python3.12/urllib/__init__.py
# could not create '/usr/lib64/python3.12/urllib/__pycache__/__init__.cpython-312.pyc': OSError(30, 'Read-only file system')
--
# bytecode is stale for 'email.generator'
# code object from /usr/lib64/python3.12/email/generator.py
# could not create '/usr/lib64/python3.12/email/__pycache__/generator.cpython-312.pyc': OSError(30, 'Read-only file system')

This seems like a design flaw of the mtime zeroing thing.


When a Python app is running in a restricted SELinux context, any attempt to rewrite a .pyc file is reported, For example, rhsmcertd (from subscription-manager) on RHEL:

[12474.053718] audit: type=1400 audit(1712618630.878:6): avc:  denied  { write } for  pid=2924 comm="rhsmcertd-worke" name="__pycache__" dev="overlay" ino=17360017 scontext=system_u:system_r:rhsmcertd_t:s0 tcontext=system_u:object_r:bin_t:s0 tclass=dir permissive=0

@dbnicholson
Copy link
Member

FWIW, ostree proper doesn't do anything with python modules besides changing the modification time like it does for all files. Recompiling must be happening at a different layer in Fedora.

@cgwalters
Copy link
Member

A few things here:

At some point, we will switch to entirely relying on composefs which gives us all the advantages of an ostree-like model in terms of on-disk/page cache sharing, without the mtime canonicalization need to ensure hardlinking works.

Today with e.g. podman (without composefs) content-identical files in distinct layers are not shared on disk or memory, partially because of this issue.

Interestingly, only some .pyc files are "corrected", others are not
Presumably, the .pyc files were regenerated by Python at some point when the filesystem was not read-only but the mtimes of .py files were already zero.

Hmm. That is odd. I was going to say I wasn't sure how that would be possible, but actually...on the build system, in some cases we may materialize cached filesystem trees using zeroed mtime, and if the Python code happens to be executed at build time, I think that could result in this. Python code not executed at build wouldn't rewrite its mtimes. That would explain some of the mystery here if true.

@ericcurtin
Copy link
Collaborator

ericcurtin commented Apr 12, 2024

I've been using composefs with ostree in CentOS Automotive Stream Distribution on a couple of devices the last few weeks and it's been working great FWIW

@cgwalters
Copy link
Member

I've been using composefs with ostree in CentOS Automotive Stream Distribution on a couple of devices the last few weeks and it's been working great FWIW

But that's still running rpm-ostree which canonicalizes all mtimes to zero, right? There's a much bigger step to take in this project where we have a mode where we hard require composefs - and actually note that ostree very intentionally doesn't include mtimes in its metadata, so moving towards a model where they are included again would lead more closely to including e.g. OCI container or other metadata as source of truth.

(All of this of course heavily intersects with https://reproducible-builds.org/ - it still doesn't make sense to let mtimes just randomly "float" in images as it will cause spurious changes, even though that's what happens with most image builds (including podman/docker) by default. At least nowdays with RPM one can canonicalize by setting SOURCE_DATE_EPOCH, but that only helps installs that use it, and it's sadly all still defeated because of rpm-software-management/rpm#2219 )

@nanonyme
Copy link
Contributor

nanonyme commented Apr 12, 2024

This is by the way why freedesktop-sdk and all derivative runtimes use hash-based bytecode, not timestamp-based bytecode. (effectively that SOURC_DATE_EPOCH approach) It works correctly with ostree.

@nanonyme
Copy link
Contributor

Notably semantically correct use of SOURCE_DATE_EPOCH is not to set it to 0 but the timestamp when your sources have last been modified.

@frenzymadness
Copy link

This is by the way why freedesktop-sdk and all derivative runtimes use hash-based bytecode, not timestamp-based bytecode. (effectively that SOURC_DATE_EPOCH approach) It works correctly with ostree.

We produce the same RPM content also for standard distributions where timestamp-based invalidation of bytecode cache files is faster.

@hroncok
Copy link

hroncok commented Apr 12, 2024

Solution ideas:

A. Fedora Python RPM packages can switch from timestamp-based to hash-based invalidation

  • This would solve the issue once all the Python packages are rebuilt with the new setting.
  • This would make Python slightly slower for all Fedora users1.

B. Python timestamp-based invalidation can special case mtime==0

  • This will solve the issue once Python is patched that way.
  • Theoretically, it might break other use cases (which?).
  • Might be hard to land that change upstream, considering hash-based invalidation exists.

C. ostree can regenerate all .pyc files when building

  • When mtimes of .py files are zero but the filesystem is writeable, invoke python -m compileall with proper arguments.
  • We (Fedora Python maintainers) can help with the command.
  • We (Fedora Python maintainers) have no idea where to plug this in.
  • Ostree can choose to use any invalidation mode, all will work, but unchecked-hash will be fastest (on runtime).

D. ostree can patch all .pyc files when building

  • For each found .pyc file, the mtime can be zeroed in-place.
  • This might be faster than regenerating all of them.
  • We (Fedora Python maintainers) can still help with the command.
  • We (Fedora Python maintainers) still have no idea where to plug this in.
  • The invalidation mode is likely patchable in-place as well.
  • Some identical .pyc files are hardlinked to each other to save space, this procedure would need to accommodate for that (or lose the benefit).

Footnotes

  1. Except of course it will be faster than the current Silverblue, but only because this issue. Compared to other solutions presented later, it will still be slower.

@travier
Copy link
Member

travier commented Apr 12, 2024

Thanks for the detailed options! Could option C be implemented by running python -m compileall (https://docs.python.org/3/library/compileall.html) on all .py files in /usr?

How would we do option D?

@hroncok
Copy link

hroncok commented Apr 12, 2024

Thanks for the detailed options! Could option C be implemented by running python -m compileall (https://docs.python.org/3/library/compileall.html) on all .py files in /usr?

You don't want to run it on all .py files, only those that already have their pyc files. There are .py files in /usr without .pyc files and we want to keep them that way. But otherwise, yes, this is correct.

  1. find all .pyc files we can understand (by path)
  2. find all corresponding sources and note down all their opt-levels we found in (1)
  3. run compileall --hardlink-dupes ... on all files from (2) grouped by opt-level combinations

How would we do option D?

  1. find all .pyc files we can understand (by path)
  2. open each such file and zero some bytes (the thrid 32bit word) if it is a timebased-invalidation .pyc (if the second 32bit word is 0)

If we edit the files in place and we never write anything if the third word is already zero, the hardlinks mentioned earlier should be preserved.


The listed implementations of C a D assume we only care for Python bytecode for the "main" Python version (e.g. 3.12 in Fedora 40). If it needs to support all Pythons available in Fedora, it might be a bit more complex.

@travier
Copy link
Member

travier commented Apr 12, 2024

There are .py files in /usr without .pyc files and we want to keep them that way. But otherwise, yes, this is correct.

Could you explain why we should not do that? Won't those Python files not be optimized as well?

@hroncok
Copy link

hroncok commented Apr 12, 2024

Only some .py files are importable modules. See https://fedoraproject.org/wiki/Changes/No_more_automagic_Python_bytecompilation#Detailed_Description on why we don't blindly bytecompile them all.

@travier
Copy link
Member

travier commented Apr 12, 2024

  • We (Fedora Python maintainers) have no idea where to plug this in.

For Fedora Atomic Desktops, we can plug this at the end of https://pagure.io/workstation-ostree-config/blob/main/f/fedora-common-ostree.yaml#_104, which is basically a script with full RW access to the system during the build of the ostree commit (image of the system).

@hroncok
Copy link

hroncok commented Apr 12, 2024

For Fedora Atomic Desktops, we can plug this at the end of https://pagure.io/workstation-ostree-config/blob/main/f/fedora-common-ostree.yaml#_104, which is basically a script with full RW access to the system during the build of the ostree commit (image of the system).

At that point, are the mtimes of files in /usr already zero? If they are, we can plug in both C and D. If they are not, we would need to pre-zero them before C (or use D).

@travier
Copy link
Member

travier commented Apr 12, 2024

Let's try to find the rough list of commands to do option C & D:

Option C:

  1. find all .pyc files we can understand (by path)
  2. find all corresponding sources and note down all their opt-levels we found in (1)
  3. run compileall --hardlink-dupes ... on all files from (2) grouped by opt-level combinations
  1. find /usr -iname "*.pyc" ? How do we filter files that can "understand"? What does this means?
  2. I found https://wiki.gentoo.org/wiki/Project:Python/Byte_compiling. Not sure how to do that
  3. Do we only pass the corresponding .py files to this command?

Option D:

  1. find all .pyc files we can understand (by path)
  2. open each such file and zero some bytes (the thrid 32bit word) if it is a timebased-invalidation .pyc (if the second 32bit word is 0)
  1. Same question as above
  2. Could we make a small (python?) script that do it?

Given the above, it feels like option D would be easier to do.

At that point, are the mtimes of files in /usr already zero? If they are, we can plug in both C and D. If they are not, we would need to pre-zero them before C (or use D).

Good point, I don't think they are, I'll give it a try.

@travier
Copy link
Member

travier commented Apr 12, 2024

At that point, are the mtimes of files in /usr already zero? If they are, we can plug in both C and D. If they are not, we would need to pre-zero them before C (or use D).

I've confirmed that they don't have a 0 mtimes yet as they are not part of the ostree commit yet at this point. So option D is likely the preferred route.

@hroncok
Copy link

hroncok commented Apr 12, 2024

find all .pyc files we can understand (by path)

all paths matching the following path regexes:

  1. .*/__pycache__/[^/]+\.cpython-312\.pyc$
  2. .*/__pycache__/[^/]+\.cpython-312\.opt-1\.pyc$
  3. .*/__pycache__/[^/]+\.cpython-312\.opt-2\.pyc$

For option C, we would need to find them separately. For option D, we can find them all at once.

For option C, we need to find the source. To do that, we replace the .cpython-312.opt-1.pyc or a similar suffix with .py and walk one directory up (i.e. ..).

I found https://wiki.gentoo.org/wiki/Project:Python/Byte_compiling. Not sure how to do that

Assume we group the sources to the following buckets:

  1. sources paths with .pyc files for levels 0, 1, 2
  2. sources paths with .pyc files for levels 0 and 1
  3. sources paths with .pyc files for level 0 only

(Other combinations should not exist in Fedora, but if we are paranoid, we need to check.)

Now we call:

python3 -m compileall [-j8] -o 0 -o 1 -o 2 -q -f --hardlink-dupes --invalidation-mode={checked-hash,timestamp,unchecked-hash} <ALL FILES IN THE FIRST BUCKET>
python3 -m compileall [-j8] -o 0 -o 1      -q -f --hardlink-dupes --invalidation-mode={checked-hash,timestamp,unchecked-hash} <ALL FILES IN THE SECOND BUCKET>
python3 -m compileall [-j8] -o 0           -q -f --hardlink-dupes --invalidation-mode={checked-hash,timestamp,unchecked-hash} <ALL FILES IN THE LAST BUCKET>

The -j8 option makes it run in parallel. It's optional.


Could we make a small (python?) script that do it?

Absolutely.

@hroncok
Copy link

hroncok commented Apr 12, 2024

MIN_MAGIC = 3390  # The first magic number supporting PEP 552
ZERO = bytes((0, 0, 0, 0))


def pyc_set_zero_mtime(pyc_path):
    with open(pyc_path, "r+b") as f:
        w = f.read(4)
        if len(w) < 4:
            return

        magic = (w[0] + (w[1] << 8) + (w[2] << 16) + (w[3] << 24)) & 0xFFFF
        if magic < MIN_MAGIC:
            invalidation = ZERO
        else:
            invalidation = f.read(4)
            if len(invalidation) < 4:
                return

        if invalidation == ZERO:
            f.write(ZERO)

Here's a Python function that implements D for one given pyc path. It silently skips invalid (short) pyc files or pyc files with hash-based invalidation modes. It supports pyc files for Python 3.4+

@dbnicholson
Copy link
Member

We've been compiling all python modules at ostree build time on Endless for years. But then Debian doesn't include the compiled modules in the packages. Note the hard linking at compile time is mostly pointless here since ostree will already deduplicate the objects when committing.

@nanonyme
Copy link
Contributor

@cgwalters it's conceptually not build timestamp. It's .py file modification timestamp. Coincidentally this ends up being build timestamp in RPM but this is coincidence only and not they purpose of the timestamp.

@travier
Copy link
Member

travier commented Apr 12, 2024

I've adapted your script to include finding the files:

#!/usr/bin/env python3

import os
import re

MIN_MAGIC = 3390  # The first magic number supporting PEP 552
ZERO = bytes((0, 0, 0, 0))

def pyc_set_zero_mtime(pyc_path):
    with open(pyc_path, "r+b") as f:
        w = f.read(4)
        if len(w) < 4:
            return

        magic = (w[0] + (w[1] << 8) + (w[2] << 16) + (w[3] << 24)) & 0xFFFF
        if magic < MIN_MAGIC:
            invalidation = ZERO
        else:
            invalidation = f.read(4)
            if len(invalidation) < 4:
                return

        if invalidation == ZERO:
            f.write(ZERO)

if __name__ == "__main__":
    regex = re.compile(".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")

    for root, dirs, files in os.walk("/usr"):
        for file in files:
            if regex.match(file):
                print(file)
                # pyc_set_zero_mtime(file)

But it fails on:

./test.py:27: SyntaxWarning: invalid escape sequence '\.'
  regex = re.compile(".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")

Investigating but someone with more Python experience might find the answer faster than me :).

@hroncok
Copy link

hroncok commented Apr 12, 2024

-    regex = re.compile(".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")
+    regex = re.compile(r".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")

@hroncok
Copy link

hroncok commented Apr 12, 2024

Also, file is only the filename, not the path, so the regex won't ever match.

@hroncok
Copy link

hroncok commented Apr 12, 2024

if __name__ == "__main__":
    regex = re.compile(r".*/__pycache__/[^/]+\.cpython-.*(\.opt-1|\.opt-2)?\.pyc$")

    for root, dirs, files in os.walk("/usr"):
        for file in files:
            path = os.path.join(root, file)
            if regex.match(path):
                print(path)
                # pyc_set_zero_mtime(path)

@travier
Copy link
Member

travier commented Apr 12, 2024

Looks better now indeed. Thanks

@travier
Copy link
Member

travier commented Apr 12, 2024

PR to add the workaround for Fedora Atomic Desktops: https://pagure.io/workstation-ostree-config/pull-request/505

@nanonyme
Copy link
Contributor

nanonyme commented Apr 12, 2024

Something also to keep in mind with long-term solutions Fedora might move to checked hash as part of https://fedoraproject.org/wiki/Changes/ReproduciblePackageBuilds. The performance hit is quite negligible price for reproducible bytecode.

@hroncok
Copy link

hroncok commented Apr 12, 2024

That's unlikely. We use the timestamp invalidation mode and we clamp mtimes to SOURCE_DATE_EPOCH (or whatever is that called).

@cgwalters
Copy link
Member

OK so there's basically an impedance mismatch between the current container -> ostree flow. In the more medium term what I'm thinking is that we will move away from the existing ostree-container model implemented in https://github.com/ostreedev/ostree-rs-ext/ to one where the canonical storage is composefs backed. There's a lot going on on this topic, but I started on containers/composefs#286 related to this, which would be a potential new composers-native storage for both podman and bootc/ostree.

@cgwalters
Copy link
Member

xref https://github.com/keszybz/add-determinism/ which is a tool which forcibly canonicalizes python timestamps; it'd probably be good to glom onto that.

@keszybz
Copy link

keszybz commented Jul 12, 2024

At least nowdays with RPM one can canonicalize by setting SOURCE_DATE_EPOCH, but that only helps installs that use it

Well, "installs that use it" are … all installs? Fedora/Suse/anybody-else-who-cares-about-reproducible-rpm-builds set source_date_epoch_from_changelog 1. So mtimes don't "randomly float" — they are set to either the mtimes in the sources (when older than the packaging update) or clamped to the packaging timestamp. IMO this is very good outcome. (The last rebuild to test reproducibility indicates that that are some packages/packaging workflows which mess up mtimes, but those are already rather rare, and usually easily fixed.)


I came here from keszybz/add-determinism#26.

Generally, it looks like something that can be added to add-det without too much trouble. In particular, add-det has logic to preserve hard links, which came up in the discussion. (Even if the answer is that it doesn't matter for rpm-ostree, I think it's nicer to preserve them, not least because rpm -V doesn't become unhappy.)

So let's say that we implement add-det --clear-pyc-mtimes. I'm not clear on what add-det is supposed to do in that case: just clear bytes 8..12 if mtime==0 on the file, or clear bytes 8..12 but ignore the mtime, or clear those bytes and also run the full normalization procedure. It should be pretty trivial to implement any of those modes, but I'm not sure what is wanted…

@travier
Copy link
Member

travier commented Jul 15, 2024

@keszybz
Copy link

keszybz commented Jul 15, 2024

OK, so after looking at the script in 505, it seems that add-det should set embedded mtime to 0 unconditionally, and ignore mtimes.

@prestist
Copy link

Well, Im not sure we want to unconditionally set mtime to 0. Other non-ostree systems likely dont need or want that. I think having it set as an argument which can be triggered by a caller who knows what they want is best. The logic has to remain fairly conditional as we use the conditional logic to verify the version of python before trying to replace any bytes since the location that the mtime is placed is version dependent.

@keszybz
Copy link

keszybz commented Jul 17, 2024

Right, sorry, my comment was not clear.

What I wanted to say is: when invoked by ostree for the special ostree cleanup add-det should set embedded mtime to 0 unconditionally, and ignore mtimes.

keszybz added a commit to keszybz/add-determinism that referenced this issue Jul 17, 2024
Pyc files store the mtime, size, and optionally hash of the original
py file. Mtime is used to validate the cached bytecode: the mtime
of the .py file (if present) must be equal to the stored value in
the .pyc file. If not, the pyc file is considered stale.

On ostree systems, all mtimes are set to 0. The problem is that the
.py file is created first, then the byte compilation step creates
.pyc with some embedded timestamp, and later ostree discards mtimes
on all files. On the end system, the bytecode cache is considered
invalid.

This new handler does the two very simple things:
1. zero out the mtime in the .pyc file
2. set mtime (in the filesystem) to zero for the .py file

This second step is also done by ostree, so strictly speaking, it's
not necessary. But it's easy to do it, and this way the bytecode
still matches the file on disk, so if we called Python after this
operation, it would be able to use the bytecode and wouldn't try to
rewrite it.

Replaces #26.
See ostreedev/ostree#1469,
https://gitlab.com/fedora/bootc/tracker/-/issues/3,
https://pagure.io/workstation-ostree-config/pull-request/505.
@keszybz
Copy link

keszybz commented Jul 17, 2024

Please see if keszybz/add-determinism#27 would work for you: add-determinism --handler=pyc-zero-mtime /some/path/.

lukewarmtemp added a commit to lukewarmtemp/rpm-ostree that referenced this issue Jul 17, 2024
Fixes: ostreedev/ostree#1469

Assuming that add-determinism is installed on the system,
apply add-determinism to set the embedded mtime in all .pyc
files to zero.

Signed-off-by: Luke Yang <[email protected]>
Co-authored-by: Steven Presti <[email protected]>
evan-goode pushed a commit to evan-goode/workstation-ostree-config that referenced this issue Jul 24, 2024
Python mtime timestamp logic does not work well with ostree force 0
mtime.

Use a small script to post-process all compiled '.pyc' and force the
timestamp to 0.

See: ostreedev/ostree#1469

Co-Authored-By: Miro Hrončok <[email protected]>
jlebon added a commit to jlebon/os that referenced this issue Nov 29, 2024
A lot of history here, but essentially today the pycache files generated
at compose time are invalid because they're of course created before the
epoch mtime canonicalization ostree does and pyc files include the mtime
in their headers to know if they're out of date.

Since `/usr` is not read-only in the layering flow, Python will happily
regenerate the cache files as Python code gets executed.

This isn't necessarily a big deal and actually would make dnf faster,
but it's also awkward and confusing to have only a subset of Python code
in the OS cached correctly and sloshing that diff around in container
layers when what we really want is having it fixed correctly at base
compose time. So let's just avoid it for consistency and to reduce the
diff with the legacy variants.

We do this by moving the base ones out of the way and then moving them
back in place, possibly overwriting regenerated ones. Note that using
`ENV PYTHONDONTWRITEBYTECODE=1` is not sufficient here; that doesn't
seem to reach RPM scriptlets which use Python.

See also: ostreedev/ostree#1469
jlebon added a commit to jlebon/os that referenced this issue Nov 29, 2024
A lot of history here, but essentially today the pycache files generated
at compose time are invalid because they're of course created before the
epoch mtime canonicalization ostree does and pyc files include the mtime
in their headers to know if they're out of date.

Since `/usr` is not read-only in the layering flow, Python will happily
regenerate the cache files as Python code gets executed.

This isn't necessarily a big deal and actually would make dnf faster,
but it's also awkward and confusing to have only a subset of Python code
in the OS cached correctly and sloshing that diff around in container
layers when what we really want is having it fixed correctly at base
compose time. So let's just avoid it for consistency and to reduce the
diff with the legacy variants.

We do this by moving the base ones out of the way and then moving them
back in place, possibly overwriting regenerated ones. Note that using
`ENV PYTHONDONTWRITEBYTECODE=1` is not sufficient here; that doesn't
seem to reach RPM scriptlets which use Python.

See also: ostreedev/ostree#1469
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.