Replies: 11 comments 23 replies
-
A better approach would be to change the RPM format to provide a digest that can be verified in a streaming manner. |
Beta Was this translation helpful? Give feedback.
-
> As mentioned previously, the final RPM which is on-disk after having
> being transcoded by rpm2extents is not a bitwise copy of the original RPM,
> as such signature/digest verifications on anything that includes the
> payload will not work.
A better approach would be to change the RPM format to provide a digest
that can be verified in a streaming manner.
The problem is not about verifying in a streaming manner, this is what is
currently done. The problem is that the final transcoded rpm (after it has
been verified) will not be able to pass `rpmkeys` anymore because it is not
a bit wise copy of the original rpm anymore.
I think what you mean by changing the rpm format is so that we can verify
the rpm in chunks while streaming it vs in its integrity, this would be a
different discussion and does not change the fact that once we have the
resulting transcoded rpm, it will not pass the original signature
verification because that rpm has now changed (for the purpose of
illustration let say this is equivalent of having the original rpm with its
payload compressed and storing it with its payload uncompressed, content of
files that would eventually make it to the disk are the same, but the rpm
is not a bit wise copy of the one we downloaded from the repo).
|
Beta Was this translation helpful? Give feedback.
-
@pmatilai do you have any feedback on this? |
Beta Was this translation helpful? Give feedback.
-
@pmatilai, checking back if you have any feedback on this? Is there anything you need more details on? |
Beta Was this translation helpful? Give feedback.
-
I already said this in the original PR but maybe it got lost in the noise: this needs a different kind of API, interrupted chains is not a supportable concept. Something that allows the plugin to register as the owner of this payload format and the handler(s) called at appropriate times, with appropriate data. Such as the package file descriptor, and such as to actually install the file by reflinking it - AFAICS fsmUnpack(), or actually rpmfiArchiveReadToFilePsm() is the precise thing you'd want to replace with your own: you get a writable fd and you get to populate it with content by whatever means, while letting rpm handle all the other logic. That's about the extent of control I'm willing to consider handling to plugins. So... what it all really comes down to is ability to define a custom archive type (see rpmfiNewArchiveReader()), and register them so that rpm knows to call it from fsmIter() in place of the internal default. |
Beta Was this translation helpful? Give feedback.
-
Thanks for working on this - the new CoW extent based approach looks exciting. I have a few comments / questions on the initial proposal...
I just wanted to highlight that, although not optimal, cpio archives can still be used alongside reflinks if the On a separate note, Btrfs recently gained support for writing compressed extents directly to disk via the new |
Beta Was this translation helpful? Give feedback.
-
Hello @pmatilai, I work at Meta with @chantra and @malmond77 . I am implementing the new API for the RPM CoW plugin based on your comment. This is what I have implemented so far: I defined 2 new fields of type When a plugin wants to register as an archive reader for a package, it sets the field Then in Here is the code: Could you please let me know if this is what you had in mind? |
Beta Was this translation helpful? Give feedback.
-
@pmatilai Hello, I was wondering if you had a chance to have a look at my proposal? Please let me know if this is unclear or if you need more details. |
Beta Was this translation helpful? Give feedback.
-
Okay so, trying to give a better idea of what I'm after: The fact that RPMRC_PLUGIN_CONTENTS is still needed is a strong signal that it's not quite right. The fsm code for handling different types doesn't differ at all, it's just a function pointer that gets called. Ie there's no rpmpluginsCallFsmFileArchiveReader() or rpmpluginsCallFsmFilePrepare(), these are determined by the early probe. And so there are no rpmteSetCustomFileInstaller() or rpmteSetCustomArchiveReader() or any corresponding custom-call-functions either, it's always just a function pointer to whatever is handling the content. |
Beta Was this translation helpful? Give feedback.
-
zstd frame headers can carry both compressed and uncompressed sizes, which makes it a little easier to seek around within the compressed payload.
That's mostly what I did in my rpm-rs based poc - files aren't individually compressed, but the start and end of individual file data segments correspond to zstd frame boundaries, so individual files can be extracted or written via
Returning to this, I suppose I could still pad |
Beta Was this translation helpful? Give feedback.
-
Related PR: #1470
RPM CoW minimizes IO by taking advantages of reflinking available in file systems such as BTRFS and XFS.
In order to do this, the data on disk must be layed down as extents which can then be reflink-ed.
To do this, RPM CoW introduces
rpm2extents
, an RPM utility to take an RPM as input stream and output that same RPM where the payload has been extracted and aligned to form file extents. Some meta data is also added to the end of the file to be able to find at which offset in the transcoded RPM file a given file in the payload is. When installing the file, rpm can then reflink this file instead of copying it. Therefore, the RPM that is on-disk, while having the same content, is not bitwise copy.Under a typical case, it saves at least 1 IO (1 read when reading the compressed RPM, 1 write when copying files to disk). This becomes more valuable when servers are running container workload, or CI workload (think mock builders) as the same package are installed multiple times from the same cached file. Data is now reflinked (O(1)) instead of being decompressed and copied, saving both IOs and disk space.
This also means that RPM needs to be taught how to install such files. Fortunately, RPM supports plugins which allows augmenting RPM in an optional way and without modifying the core RPM logic. The current API has limitations for the RPM CoW use case though, but we should be able to provide new API entry points to make this work.
RPM CoW changes currently needs a few more hooks to be able to perform its job:
rpmfilesIter(files, RPMFI_ITER_FWD)
instead of uses the content of the archive withrpmfiNewArchiveReader(payload, files, RPMFI_ITER_READ_ARCHIVE)
as the list of files/directories and their metadata is now read from the RPM headers metadata vs the cpio archive one. In the current RPM CoW implementation, this is what is being done with plugin_fsm_file_archive_reader_funcAnother new concept that needs to be introduced to the RPM plugin API is the ability for a plugin hook to stop the plugin chain from being executed. Currently, every plugin that registers to a specific hook is called as part of the plugin chain, regardless of whatever action may have been performed by a previous plugin. In cases such as installing file or creating the rpmfi iterator, as soon as a plugin has performed that action, the chain must be interrupted. In the current code, this is done by the introduction of the
RPMRC_PLUGIN_CONTENTS
return code.As mentioned previously, the final RPM which is on-disk after having being transcoded by
rpm2extents
is not a bitwise copy of the original RPM, as such signature/digest verifications on anything that includes the payload will not work.The current approach is that
rpm2extents
performs the signature validation while streaming the original RPM. The result of this verification (return code + text output) is written into the transcoded RPM trailing metadata so it can be re-used later byrpmkeys
, or anything that needs to validate the package likerpm -i
. Having a plugin hook that can defer the validation of RPM to the plugin would help in separating that logic from RPM core.Beta Was this translation helpful? Give feedback.
All reactions