Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch default pool from LVM to BTRFS-Reflink #6476

Open
DemiMarie opened this issue Mar 22, 2021 · 89 comments
Open

Switch default pool from LVM to BTRFS-Reflink #6476

DemiMarie opened this issue Mar 22, 2021 · 89 comments
Labels
C: storage P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.

Comments

@DemiMarie
Copy link

DemiMarie commented Mar 22, 2021

The problem you're addressing (if any)

In R4.0, the default install uses LVM thin pools. However, LVM appears to be optimized for servers, which results in several shortcomings:

  • Space exhaustion is handled poorly, requiring manual recovery. This recovery may sometimes fail.
  • It is not possible to shrink a thin pool.
  • Thin pools slow down system startup and shutdown.

Additionally, LVM thin pools do not support checksums. This can be achieved via dm-integrity, but that does not support TRIM.

Describe the solution you'd like

I propose that R4.3 use BTRFS+reflinks by default. This is a proposal ― it is by no means finalized.

Where is the value to a user, and who might that user be?

BTRFS has checksums by default, and has full support for TRIM. It is also possible to shrink a BTRFS pool without a full backup+restore. BTRFS does not slow down system startup and shutdown, and does not corrupt data if metadata space is exhausted.

When combined with LUKS, BTRFS checksumming provides authentication: it is not possible to tamper with the on-disk data (except by rolling back to a previous version) without invalidating the checksum. Therefore, this is a first step towards untrusted storage domains. Furthermore, BTRFS is the default in Fedora 33 and openSUSE.

Finally, with BTRFS, VM images are just ordinary disk files, and the storage pool the same as the dom0 filesystem. This means that issues like #6297 are impossible.

Describe alternatives you've considered

None that are currently practical. bcachefs and ZFS are long-term potential alternatives, but the latter would need to be distributed as source and the former is not production-ready yet.

Additional context

I have had to recover manually from LVM thin pool problems (failure to activate, IIRC) on more than one occasion. Additionally, the only supported interface to LVM is the CLI, which is rather clumsy. The LVM pool requires nearly twice the amount of code as the BTRFS pool, for example.

Relevant documentation you've consulted

man lvm

Related, non-duplicate issues

#5053
#6297
#6184
#3244 (really a kernel bug)
#5826
#3230 ― since reflink files are ordinary disk files we could just rename them without needing a copy
#3964
everything in https://github.com/QubesOS/qubes-issues/search?q=lvm+thin+pool&state=open&type=issues

Most recent benchmarks: #6476 (comment)

@DemiMarie DemiMarie added T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality. P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Mar 22, 2021
@iamahuman
Copy link

It might be a good idea to compare performance (seq read, rand read, allocation, overwrite, discard) between the three backends. See: #3639

@GWeck
Copy link

GWeck commented Apr 15, 2021

With regard to VM boot time, LVM storage pool was slightly faster than BTRFS, but this may be still within the margin of error (LVM: 7.43 s versus BTRFS: 8.15 s for starting a debian-10-minimal VM).

@DemiMarie DemiMarie changed the title R4.1: switch default pool from LVM to BTRFS-Reflink [RFC] R4.1: switch default pool from LVM to BTRFS-Reflink Apr 15, 2021
@DemiMarie
Copy link
Author

Marking as RFC because this is by no means finalized.

@tlaurion
Copy link
Contributor

@DemiMarie following comment I'm posting deconstructed thoughts here.

No problem with QubesOS searching the best FS to switch for on 4.1 release, and questioning partition scheme, but i'm a bit lost on the direction of QubesOS 4.1 and the goals here (stability? performance? backups? portability? security?)

I was kind of against having dom0 having seperate LVM pool for space constrains resulting of the change, but agreed and accepted that the pool metadata exhaustion possibility was a real tangible issue that hit me a lot before, for which resolution is sketchy and still not advertised in widget correctly for users simply upgrading and being hit with.

The fix in new install resolved the issue, while QubesOS decided to split the dom0 pool out of main pool, so fixing pool issues on the system would be more easy for the end user or non existent.

I am just not so sure why switching filesystem is on point now, where LVM thin provisioning seems to fit the goal, but willing to hear more about the advantages.

I am interested into the reasoning for such a switch, and the probabilities of doing so, since I am really interested into pushing wyng-backups farther, inside/outside of Heads inside/outside of QubesOS, of grant/self funding the work so that QubesOS metadata would be included in wyng-backups, permitting restore/verification/fresh deployment/revert from local(oem recovery VM)/remote source, just applying diff where required from ssh remote red only mountpoint.

This filesystem choice seems to be less relevant then what can make those changes consume dom0 LVM which should be excludedof dom0 so that dmverity can be setuped under Heads/Safeboot. But this is irrelevant to this ticket.

@DemiMarie
Copy link
Author

I am just not so sure why switching filesystem is on point now, where LVM thin provisioning seems to fit the goal, but willing to hear more about the advantages.

The advantages are listed above. In short, a BTRFS pool is more flexible, and it offers possibilities (such as whole-system snapshots) that I do not believe are possible with LVM thin provisioning. BTRFS also offers flexible quotas, and can always recover from out of space conditions provided that a small amount of additional storage (such as a spare partition set aside for the purpose) is available. Furthermore, BTRFS checksumming and scrubbing appear to be useful. Finally, new storage can be added to and removed from a BTRFS pool at any time, and the pool can be shrunk as well.

BTRFS also has disadvantages: its throughput is worse than LVM, and there are reports of bad performance on I/O heavy workloads such as QubesOS. Benchmarks and user feedback will be needed to determine which is better, which is why this is an RFC.

I am interested into the reasoning for such a switch, and the probabilities of doing so, since I am really interested into pushing wyng-backups farther, inside/outside of Heads inside/outside of QubesOS, of grant/self funding the work so that QubesOS metadata would be included in wyng-backups, permitting restore/verification/fresh deployment/revert from local(oem recovery VM)/remote source, just applying diff where required from ssh remote red only mountpoint.

I believe that btrfs send and btrfs receive offer the same functionality as wyng-backups, but am not certain as I never used either. As far as the probability: this is currently only a proposal, and I am honestly not sure if switching this close to the R4.1 release date is a good idea. In any case, LVM will continue to be fully supported ― this just flips the default in the installer.

@tasket
Copy link

tasket commented Apr 18, 2021

@DemiMarie There are many questions swirling around advanced storage on Linux, but I think the main ones applicable here are about reliability and performance. Btrfs and Thin LVM appear to offer trade-offs on those qualities, and I don't think its necessarily a good move to switch the Qubes default for a slower storage scheme at this point; storage speed is critical for Qubes' usability and large disk image files with random write patterns are Btrfs' weakest point.

Running out of space is probably Thin LVM's weakest point, although this can be pretty easily avoided. For one, dom0 root is moving to a dedicated pool in R4.1, which will keep admin working in most situations. Adding more protections to the domU pool can also be done with some pretty simple userland code. (For those who are skeptical, note that this is the general approach taken by Stratis.)

The above mentioned Btrfs checksums is a nice-to-have feature against accidental damage, but it unfortunately does not come close to providing authentication. To my knowledge, no CRC mode can do that even if its encrypted. Any attacker able to induce some calculated change in an encrypted volume would probably find the malleability of encrypted CRCs to be little or no obstacle. IMHO, the authentication aspect of the proposal is a non-starter. (BTW, it looks like dm-integrity may be able to do this now along with discard support, if its journal mode supports internal tags.)

As for backups, Wyng basically exists because tools like btrfs send are constrained to using the same back end (Btrfs with admin privileges) which severely narrows the user's options for backup destinations. Wyng can also be adapted to any storage source that can create snapshots and report their deltas (Btrfs included).

The storage field also continues to evolve in interesting ways: Red Hat is creating Stratis while hardware manufacturers implemented NVMe objects and enhanced parallelism. Stratis appears to be based on none other than Thin LVM's main components (dm-thin, etc) in addition to dm-integrity, with XFS on top; all the layers are tied together to respond cohesively from a single management interface. This is being developed to avoid Btrfs maintenance and performance pitfalls.

I think some examination of Btrfs development culture may also be in order, as it has driven Red Hat to exasperation and a decision to drop Btrfs. I'm not sure just what it is about accepting Btrfs patches that presents a problem, but it makes me concerned that too much trust has been eroded and that Btrfs may become a casualty in 'storage wars' between an IBM / Red Hat camp and what I'd call an Oracle-centric camp.


FWIW, I was one of the first users to show how Qubes could take advantage of Btrfs reflinks for cloning and to request specific reflink support. Back in 2014, it was easy to assume Btrfs shortcomings would be addressed fairly soon, since those issues were so obvious. Yet they are still unresolved today.

My advice at this point is to wait and see – and experiment. There is an unfortunate dearth of comparison tests configured in a way that makes sense; they usually compare Btrfs to bare Ext4, for example, and almost always overlook LVM thin pools. So its mostly apples vs oranges. However, what little benchmarking I've seen of thin LVM suggests a performance advantage vs Btrfs that would be too large to ignore. There are also Btrfs modes of use we should explore, such as any performance gain from disabling CoW on disk images; if this were deemed desirable then the Qubes Btrfs driver would have to be refactored to use subvolume snapshots instead of reflinks. An XFS reflink comparison on Qubes would also be very interesting!

@DemiMarie
Copy link
Author

@DemiMarie There are many questions swirling around advanced storage on Linux, but I think the main ones applicable here are about reliability and performance. Btrfs and Thin LVM appear to offer trade-offs on those qualities, and I don't think its necessarily a good move to switch the Qubes default for a slower storage scheme at this point; storage speed is critical for Qubes' usability and large disk image files with random write patterns are Btrfs' weakest point.

In retrospect, I agree. That said (as you yourself mention below) XFS also supports reflinks and lacks this problem.

Running out of space is probably Thin LVM's weakest point, although this can be pretty easily avoided. For one, dom0 root is moving to a dedicated pool in R4.1, which will keep admin working in most situations. Adding more protections to the domU pool can also be done with some pretty simple userland code. (For those who are skeptical, note that this is the general approach taken by Stratis.)

Will it be possible to reserve space for use by discards? A user needs to be able to free up space even if they make a mistake and let the pool fill up.

The above mentioned Btrfs checksums is a nice-to-have feature against accidental damage, but it unfortunately does not come close to providing authentication. To my knowledge, no CRC mode can do that even if its encrypted. Any attacker able to induce some calculated change in an encrypted volume would probably find the malleability of encrypted CRCs to be little or no obstacle. IMHO, the authentication aspect of the proposal is a non-starter. (BTW, it looks like dm-integrity may be able to do this now along with discard support, if its journal mode supports internal tags.)

The way XTS works is that any change (by an attacker who does not have the key) will completely scramble a 128-bit block; my understanding is that a CRC32 with a scrambled block will only pass with probability 2⁻³². That said, BTRFS also supports Blake2b and SHA256, which would be better choices.

As for backups, Wyng basically exists because tools like btrfs send are constrained to using the same back end (Btrfs with admin privileges) which severely narrows the user's options for backup destinations. Wyng can also be adapted to any storage source that can create snapshots and report their deltas (Btrfs included).

Good to know, thanks!

The storage field also continues to evolve in interesting ways: Red Hat is creating Stratis while hardware manufacturers implemented NVMe objects and enhanced parallelism. Stratis appears to be based on none other than Thin LVM's main components (dm-thin, etc) in addition to dm-integrity, with XFS on top; all the layers are tied together to respond cohesively from a single management interface. This is being developed to avoid Btrfs maintenance and performance pitfalls.

I think some examination of Btrfs development culture may also be in order, as it has driven Red Hat to exasperation and a decision to drop Btrfs. I'm not sure just what it is about accepting Btrfs patches that presents a problem, but it makes me concerned that too much trust has been eroded and that Btrfs may become a casualty in 'storage wars' between an IBM / Red Hat camp and what I'd call an Oracle-centric camp.

My understanding (which admittedly comes from a comment on Y Combinator) is that BTRFS moves too fast to be used in RHEL. RHEL is stuck on one kernel for an entire release, and rebasing BTRFS every release became too difficult, especially since Red Hat has no BTRFS developers.

FWIW, I was one of the first users to show how Qubes could take advantage of Btrfs reflinks for cloning and to request specific reflink support. Back in 2014, it was easy to assume Btrfs shortcomings would be addressed fairly soon, since those issues were so obvious. Yet they are still unresolved today.


My advice at this point is to wait and see – and experiment. There is an unfortunate dearth of comparison tests configured in a way that makes sense; they usually compare Btrfs to bare Ext4, for example, and almost always overlook LVM thin pools. So its mostly apples vs oranges. However, what little benchmarking I've seen of thin LVM suggests a performance advantage vs Btrfs that would be too large to ignore. There are also Btrfs modes of use we should explore, such as any performance gain from disabling CoW on disk images; if this were deemed desirable then the Qubes Btrfs driver would have to be refactored to use subvolume snapshots instead of reflinks. An XFS reflink comparison on Qubes would also be very interesting!

That it would be, especially when combined with Stratis. The other major problem with LVM2 (and possibly dm-thin) seems to be snapshot and discard speeds; I expect XFS reflinks to mitigate most of those problems.

@tasket
Copy link

tasket commented Apr 18, 2021

Ah, new Btrfs feature... Great! I'd consider enabling one of its hashing modes as being able to support authentication.

I'd still consider the Stratis concept to be more interesting for now, as Qubes' current volume management is pretty similar but potentially even better and simpler due to having a privileged VM environment.

@DemiMarie
Copy link
Author

Ah, new Btrfs feature... Great! I'd consider enabling one of its hashing modes as being able to support authentication.

Agreed. While I am not aware of any way to tamper with a LUKS partition without invalidating a CRC, Blake2b is by far the better choice.

I'd still consider the Stratis concept to be more interesting for now, as Qubes' current volume management is pretty similar but potentially even better and simpler due to having a privileged VM environment.

I agree, with one caveat: my understanding is that LUKS/AES-XTS-512 + BTRFS/Blake2b-256 is sufficient to protect against even malicious block devices, whereas dm-integrity is not. dm-integrity is vulnerable to a partial rollback attack: it is possible to rollback parts of the disk without dm-integrity detecting it. Therefore, dm-integrity is not (currently) sufficient for use with untrusted storage domains, which is a future goal of QubesOS.

@DemiMarie
Copy link
Author

@tasket: what are your thoughts on using loop devices? That’s my biggest worry regarding XFS+reflinks, which seems to otherwise be a very good choice for QubesOS. Other approaches exist, of course; for instance, we could modify blkback to handle regular files as well as block devices.

@0spinboson
Copy link

0spinboson commented Apr 20, 2021

I really wish the FS's name wasn't a misogynistic slur. That aside, my only experience with it, under 4.0, had my Qubes installation become unbootable, and I found it very difficult to fix, relative to a system built on LVM. And that does strike as relevant to the question whether Qubes switches, while imo this is only partly addressable via improving the documentation (since the other part is the software we have to use to restore).

@DemiMarie
Copy link
Author

FS's name wasn't a misogynistic slur

@0spinboson would you mind clarifying which filesystem you are referring to?

@tasket
Copy link

tasket commented Apr 20, 2021

Will it be possible to reserve space for use by discards? A user needs to be able to free up space even if they make a mistake and let the pool fill up.

Yes, its simple to allocate some space in a pool using a non-zero thin lv. Just reserve the lv name in the system, make it inactive, and check that it exists on startup.

Further, it would be easy to use existing space-monitoring components to also pause any VMs associated with a nearly-full pool and then show an alert dialog to the user.

it is possible to rollback parts of the disk without dm-integrity detecting it.

I thought the journal mode would prevent that? I don't know it in detail, but something like a hash of the hashes of the last changed blocks, computed with the prior journal entry, would have to be in each journal entry.

what are your thoughts on using loop devices? That’s my biggest worry regarding XFS+reflinks

I forgot they were a factor... its been so long since I've used Qubes in a file-backed mode. But this should be the same for Btrfs, I think.

FWIW, the XFS reflink suggestion was more speculative, along the lines of "What if we benchmark it for accessing disk images and its almost as fast as thin LVM?". The regular XFS vs Ext4 benchmarks I'm seeing suggest it might be possible. Its also not aligned with the Stratis concept, as that is closer to thin LVM with XFS just providing the top layer. (Obviously we can't use Stratis itself unless it supports a mode that accounts for the top layer being controlled by domUs.)

Also FWIW: XFS historically supported a 'subvolume' feature for accessing disk image files, instead of loopdevs. It requires certain IO sched conditions are met before it can be enabled.

@0spinboson
Copy link

FS's name wasn't a misogynistic slur

@0spinboson would you mind clarifying which filesystem you are referring to?

'Butterface', was intentional, afaik.

@Rudd-O
Copy link

Rudd-O commented Oct 26, 2021

No, it was not. The file system is named btrfs because it means B-tree FS. That the name is often pronounced with a hilarious word may or may not be seen as a pun, but that is on the beholder's eye.

@dmoerner
Copy link

Basic question: If I install R4.1 with BTRFS by selecting custom, and then using Anaconda to automatically create the Qubes partitions with BTRFS, is that sufficient for the default pool to use BTRFS-Reflink? Or do I have to do something extra for the "Reflink" part?

@rustybird
Copy link

If I install R4.1 with BTRFS by selecting custom, and then using Anaconda to automatically create the Qubes partitions with BTRFS, is that sufficient for the default pool to use BTRFS-Reflink?

Yes

@noskb
Copy link

noskb commented Nov 29, 2021

@DemiMarie DemiMarie modified the milestones: Release 4.1, Release TBD Dec 23, 2021
@kocmo
Copy link

kocmo commented Aug 19, 2024

Additionally, LVM thin pools do not support checksums. This can be achieved via dm-integrity, but that does not support TRIM.

Ext4 has metadata checksums enabled since e2fsprogs 1.43, so at least some filesystem integrity checking is happening inside VMs:

root@sys-firewall /h/user# dumpe2fs /dev/mapper/dmroot | grep metadata_csum
dumpe2fs 1.47.0 (5-Feb-2023)
Filesystem features:      ... metadata_csum ...

Does Qubes have mechanisms to report kernel errors from VMs and dom0 to the user, via toast notifications or so?

In Qubes 4.2.1, 4.2.2 dom0 systemd journal continuously gets repeated PAM error messages :-/

dom0 pkexec[141170]: PAM unable to dlopen(/usr/lib64/security/pam_sss.so): /usr/lib64/security/pam_sss.so: cannot open shared object file: No such file or directory
dom0 pkexec[141170]: PAM adding faulty module: /usr/lib64/security/pam_sss.so

@tlaurion
Copy link
Contributor

tlaurion commented Oct 3, 2024

@DemiMarie tasket/wyng-backup#211

With proper settings, I confirm btrfs to be way better performance wise then lvm2 with large qubes, clones+specialization (qusal used), where my tests of beesd have stopped momentarily by lack of time.

@DemiMarie
Copy link
Author

@tlaurion Can you provide proper benchmarks? Last benchmarks by @marmarek found that BTRFS was not faster than LVM2, which is why LVM2 is still the default.

@tlaurion
Copy link
Contributor

tlaurion commented Oct 25, 2024

@tlaurion Can you provide proper benchmarks? Last benchmarks by @marmarek found that BTRFS was not faster than LVM2, which is why LVM2 is still the default.

@DemiMarie @marek @tasket unfortunately I don't. Notes are scattered under bees and wyng-backup issues for the moment which is how I optimized my btrfs setup and would never look back to thinlvm ever again, until zfs is figured out to be added under installer.

But prior of being to do a proper perf comparison, defaults of btrfs filesystem creation and fstab mount options need revisiting, including détection of Block device type being HDD/(ssd/nvme).

Incomplete notes out of my head:

  • trimming needs to be async otherwise qubes snapshot rotation kills IO waiting on btrfs-transaction putting system into apparent frozen state.
  • btrfs : only system needs to be safeguarded with DUP, otherwise incredible overhead with reflink and snapshots. Metadata and data should be created in single instead of DUP to consume less space reserved and reduce overhead of what is already done by non-HDD firmware (block device type dependent)
  • mode optimizations should be specified under fstab (block type dépendant)

Will try to revisit scattered issues and post them here with further edit when I have a bit more time to invest in this issue. Collaboration needed.

Or at least point this comments to those comments.

@marmarek
Copy link
Member

until zfs is figured out to be added under installer.

Very unlikely (unless available in upstream kernel, which is also very unlikely). We have CI job that checks if ZFS pool works, and quite often we find that it doesn't work with the latest kernel yet. So official ZFS support would hold back kernel updates, which I heard from @DemiMarie is completely unacceptable if doing any sort of GPU acceleration work (which we do want at some point). Currently the said CI job doesn't work, because there is no dkms package for Fedora 41 (which R4.3 is based on) yet.

Back to topic:

trimming needs to be async otherwise qubes snapshot rotation kills IO waiting on btrfs-transaction putting system into apparent frozen state.

Is there any impact on resilience for power failure cases?

mode optimizations should be specified under fstab (block type dépendant)

What do you mean?

Can you collect specific options that need to be set (fstab and elsewhere)?

@DemiMarie
Copy link
Author

until zfs is figured out to be added under installer.

Very unlikely (unless available in upstream kernel, which is also very unlikely). We have CI job that checks if ZFS pool works, and quite often we find that it doesn't work with the latest kernel yet. So official ZFS support would hold back kernel updates, which I heard from @DemiMarie is completely unacceptable if doing any sort of GPU acceleration work (which we do want at some point). Currently the said CI job doesn't work, because there is no dkms package for Fedora 41 (which R4.3 is based on) yet.

For clarification: I am not sure if Qubes OS will use the latest stable kernel or the latest LTS kernel, and if Qubes OS will skip the first few releases in a stable branch. If Qubes OS will skip the first few releases (as seems likely, since these releases often have easily-found bugs), this might give ZFS sufficient time to catch up.

What must be taken regularly for GPU acceleration are weekly updates in the upstream branch that Qubes OS has chosen to follow. I would be highly surprised if those break OpenZFS, though there are obviously no guarantees.

@tlaurion
Copy link
Contributor

tlaurion commented Oct 26, 2024

trimming needs to be async otherwise qubes snapshot rotation kills IO waiting on btrfs-transaction putting system into apparent frozen state.

Is there any impact on resilience for power failure cases?

Not that i'm aware of; DUP is for HDD (and general failsafe mechanism where QubesOS volatile/snapshot rotation + reflink overloads those offer, where ssd/nvme does its own thing in firmware and where BRTFS is pretty atomic anyway.

Also note that for wyng-backups (my goal), I disabled volumes to keep globally as well, so impacts on IO performance (I rely on single snapshots from wyng last backup) are widely non-observable on my side because of that (otherwise volatile volume of root+private + volumes to keep and rotation of snapshots was first observable drawnback of using BRTFS in current setup without any optimizations from current default from installer defaults (which are good at first then performance penalties gets heavier the more rootfs clones (templates) having volumes to keep + appvms private volumes cloned + reflinked to the point of if one uses qusal, it's just not fun even on newer hardware with nvme so I would not advise testing this on ivybridge without one wanting to throw laptop at the window with end user thinking QubesOS is just a crappy OS.

mode optimizations should be specified under fstab (block type dépendant)

What do you mean?

Can you collect specific options that need to be set (fstab and elsewhere)?

@marmarek @tasket current fstab:
UUID=a8b2c51e-f325-4647-8fc2-bc7e93f49645 / btrfs subvol=root,x-systemd.device-timeout=0,noatime,compress=zstd,ssd,space_cache=v2,discard=async,noautodefrag 0 0
mainly for ssd:

  • ssd, space_cache=v2 (default now), discard=async
    and for reflink perf improvements:
  • noautodefrag

One would avocate that system should be in DUP for resilience, but I have observed no impact. Definitely, data should be single, metadata could be doubled in size/be single and system should stay DUP for reliability reasons. Todo: detect block device type and deviate from default needs to happen, to what: that is the question.

[user@dom0 ~]
$ sudo btrfs fi usage /
Overall:
    Device size:		   1.79TiB
    Device allocated:		   1.59TiB
    Device unallocated:		 207.99GiB
    Device missing:		     0.00B
    Device slack:		     0.00B
    Used:			 553.73GiB
    Free (estimated):		   1.24TiB	(min: 1.24TiB)
    Free (statfs, df):		   1.24TiB
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:1.57TiB, Used:549.96GiB (34.14%)
   /dev/mapper/luks-09f96e74-eb7e-4058-b3f4-e4f0ca8492fd	   1.57TiB

Metadata,single: Size:19.00GiB, Used:3.76GiB (19.81%)
   /dev/mapper/luks-09f96e74-eb7e-4058-b3f4-e4f0ca8492fd	  19.00GiB

System,single: Size:32.00MiB, Used:224.00KiB (0.68%)
   /dev/mapper/luks-09f96e74-eb7e-4058-b3f4-e4f0ca8492fd	  32.00MiB

Unallocated:
   /dev/mapper/luks-09f96e74-eb7e-4058-b3f4-e4f0ca8492fd	 207.99GiB

A reminder: those points of research were addressed in my joint grant application plan for wyng-backup under #858 (comment)

@rustybird
Copy link

rustybird commented Oct 26, 2024

@tlaurion:

I'm intrigued but skeptical that mkfs.btrfs --metadata=dup (the default) would have a notable negative impact on reflinking. Is there a benchmark showing this?

--data=single is already the default.

For mount options:

ssd

Automatically applied if the drive's /sys/block/foo/queue/rotational attribute (which is passed through by dm-crypt) is 0

discard=async

Default on modern kernels, but unfortunately it's overridden to discard[=sync] in fstab by the Anaconda installer since R4.2

noautodefrag

Default

@tlaurion
Copy link
Contributor

tlaurion commented Oct 27, 2024

@tlaurion:

I'm intrigued but skeptical that mkfs.btrfs --metadata=dup (the default) would have a notable negative impact on reflinking. Is there a benchmark showing this?

--data=single is already the default.

Mine was dup on default install.

For mount options:

ssd

Automatically applied if the drive's /sys/block/foo/queue/rotational attribute (which is passed through by dm-crypt) is 0

discard=async

Default on modern kernels, but unfortunately it's overridden to discard[=sync] in fstab by the Anaconda installer since R4.2

noautodefrag

Default

By lack of proper benchmarking, I guess the culprit to fix here is then noautodefrag and redo testing

@no-usernames-left
Copy link

It's unfortunate to watch so many cycles being spent to deal with issues that wouldn't even merit mention with ZFS.

Inability to use bleeding-edge kernel releases seems a bit disingenous as a reason to disqualify ZFS given how far back from the edge Qubes (rightfully!) stays.

Regarding distributing binaries (in the installer or otherwise), DKMS seems like it would be one good solution.

@rustybird
Copy link

rustybird commented Oct 27, 2024

@tlaurion:

--data=single is already the default.

Mine was dup on default install.

Odd. Does your filesystem span multiple block devices? Even then I don't see why that would result in --data=dup, unless Anaconda is doing something weird.

I guess the culprit to fix here is then noautodefrag and redo testing

Just not using the autodefrag mount option is the same as explicitly using noautodefrag. autodefrag is definitely unsuitable for file-reflink pools, because defragmentation duplicates shared data (so it's something that has to be done manually and selectively, restricted to .img files that don't share any data with other .img files).

@tlaurion
Copy link
Contributor

It's unfortunate to watch so many cycles being spent to deal with issues that wouldn't even merit mention with ZFS.

Inability to use bleeding-edge kernel releases seems a bit disingenous as a reason to disqualify ZFS given how far back from the edge Qubes (rightfully!) stays.

Regarding distributing binaries (in the installer or otherwise), DKMS seems like it would be one good solution.

I tend to agree with that statement. Mixing dgpu pass-through with pool efficient management (online dedup being a big win for my use case, vs not caring at all with all graphic acceleration) woukd resolve most of my issues and time spent with bees which will keep doing of file dedup and consume CPU cycles I would prefer not spending.

Zfs > btrfs on all levels, once again.

@tlaurion
Copy link
Contributor

@tlaurion:

--data=single is already the default.

Mine was dup on default install.

Odd. Does your filesystem span multiple block devices? Even then I don't see why that would result in --data=dup, unless Anaconda is doing something weird.

No. Single disk, 2 luks, one rootfs where /var/lib/qubes is btrfs reflink pool.

I guess the culprit to fix here is then noautodefrag and redo testing

Just not using the autodefrag mount option is the same as explicitly using noautodefrag. autodefrag is definitely unsuitable for file-reflink pools, because defragmentation duplicates shared data (so it's something that has to be done manually and selectively, restricted to .img files that don't share any data with other .img files).

My bad. Not autodefrag but discard=async

Crossref Zygo/bees#283 (comment)

@marmarek
Copy link
Member

Inability to use bleeding-edge kernel releases seems a bit disingenous as a reason to disqualify ZFS given how far back from the edge Qubes (rightfully!) stays.

I don't want Qubes users to be in this situation: openzfs/zfs#16590 (comment) (6.11 kernel is in Qubes stable repo for 2+ weeks already).

@DemiMarie
Copy link
Author

Inability to use bleeding-edge kernel releases seems a bit disingenous as a reason to disqualify ZFS given how far back from the edge Qubes (rightfully!) stays.

I don't want Qubes users to be in this situation: openzfs/zfs#16590 (comment) (6.11 kernel is in Qubes stable repo for 2+ weeks already).

To elaborate: Users who turn on GPU acceleration will need to update their kernel weekly to the latest release on their branch of choice. For those who are using kernel-latest, this means that they will not be able to receive security patches for GPU drives.

@no-usernames-left
Copy link

Inability to use bleeding-edge kernel releases seems a bit disingenous as a reason to disqualify ZFS given how far back from the edge Qubes (rightfully!) stays.

I don't want Qubes users to be in this situation: openzfs/zfs#16590 (comment) (6.11 kernel is in Qubes stable repo for 2+ weeks already).

I completely agree, and I don't want them to either. That said, this was caused by a user adding a dependency outside of the project's control upon which the base system relied. If Qubes included it instead, presumably you'd hold back the kernel until there was a compatible ZFS version, just like you do for other Qubes-specific dependencies.

The fact that users are going to such lengths to use ZFS shows there is demand for it (and @Rudd-O has already gone to great effort to lay some of the groundwork).

What ZFS brings to the table meshes very well with how Qubes aims to achieve its goals. I still cannot understand why we are dismissing it out of hand. (Its pedigree is also far superior to that of Btrfs.)

@no-usernames-left
Copy link

For those who are using kernel-latest, this means that they will not be able to receive security patches for GPU drives.

Yet another reason to stay back from the bleeding edge given the security-above-all ethos of Qubes.

And if we are holding the kernel back for any other reason, we can also hold it back to ensure compatibility with ZFS.

@DemiMarie
Copy link
Author

Inability to use bleeding-edge kernel releases seems a bit disingenous as a reason to disqualify ZFS given how far back from the edge Qubes (rightfully!) stays.

I don't want Qubes users to be in this situation: openzfs/zfs#16590 (comment) (6.11 kernel is in Qubes stable repo for 2+ weeks already).

I completely agree, and I don't want them to either. That said, this was caused by a user adding a dependency outside of the project's control upon which the base system relied. If Qubes included it instead, presumably you'd hold back the kernel until there was a compatible ZFS version, just like you do for other Qubes-specific dependencies.

Holding back kernel updates is 100% incompatible with GPU acceleration, because Linux does not reliably issue security advisories for GPU driver vulnerabilities and so Qubes OS’s security team does not know which patches need to be backported. Qubes OS will be offering GPU acceleration in the future because there are many users who simply cannot use Qubes OS without it. Therefore, holding back kernel updates is not a sustainable solution.

It might be possible to provide a ZFS DKMS package that only supported LTS kernels, which is what Qubes OS ships by default. However, there are users who must use kernel-latest to get support for their hardware and such users would not be able to use ZFS.

The fact that users are going to such lengths to use ZFS shows there is demand for it (and @Rudd-O has already gone to great effort to lay some of the groundwork).

There is absolutely demand for ZFS, and for very good reason: ZFS is the most reliable filesystem available today, on any operating system. To be clear, if I had a production server using an LTS kernel I would most likely choose ZFS for data volumes (though probably not the root volume unless I was on Ubuntu). Qubes OS, however, is not a server operating system, and cannot rely on LTS kernels because it needs

What ZFS brings to the table meshes very well with how Qubes aims to achieve its goals. I still cannot understand why we are dismissing it out of hand. (Its pedigree is also far superior to that of Btrfs.)

ZFS is not being dismissed out of hand. There are, however, multiple severe problems with it:

  1. It is unclear whether ZFS binary kernel modules can legally be redistributed. Redistribution in source is possible, but without reproducible DKMS builds it is incompatible with module signing for secure boot 1 and causes other practical problems.
  2. ZFS is an out of tree kernel module and therefore gets periodically broken by kernel updates. For the reasons mentioned above holding back kernel updates will be a security no-no in the future. LTS updates are much less likely to break ZFS, but not everyone uses LTS kernels.

Footnotes

  1. Reproducible builds solve this problem because the user can produce a module that is identical to the module that Qubes OS built. This means that the signature can be extracted from the signed module, shipped alongside the source, and added back to the newly built kernel module to produce one that is validly signed.

@no-usernames-left
Copy link

However, there are users who must use kernel-latest to get support for their hardware and such users would not be able to use ZFS.

Then why throw the baby out with the bathwater? While licensing, secure boot, etc are figured out, those who need kernel-latest could stay on LVM while those who don't could benefit from ZFS.

@marmarek
Copy link
Member

People can already use ZFS on Qubes OS, it isn't even that complicated to enable. But due to the above issues, it won't be part of the default installation.

@tlaurion
Copy link
Contributor

Is there still desire to make btrfs first candidate of qubesos and fixing default fs/fstab options?

@rustybird
Copy link

@tlaurion:

Is there still desire to make btrfs first candidate of qubesos and fixing default fs/fstab options?

The latter part can be done outside of this ticket, because it affects Btrfs users even while Btrfs isn't the default installation layout.

(If you find a way to reproduce the installer bizarrely setting up dup data on your single disk setup, please open an issue! Otherwise, the only thing I'm aware of that needs some tweaking in the installer and in qubes-dist-upgrade is not to hardcode discard[=sync] in fstab.)

@tlaurion
Copy link
Contributor

tlaurion commented Nov 4, 2024

@tlaurion:

Is there still desire to make btrfs first candidate of qubesos and fixing default fs/fstab options?

The latter part can be done outside of this ticket, because it affects Btrfs users even while Btrfs isn't the default installation layout.

(If you find a way to reproduce the installer bizarrely setting up dup data on your single disk setup, please open an issue!

I confirm that on fresh installation:

  • Data is single
  • Metadata is DUP (this is the problematic one if clones extensively used, should be single)
  • System is DUP

Otherwise, the only thing I'm aware of that needs some tweaking in the installer and in qubes-dist-upgrade is not to hardcode discard[=sync] in fstab.)

Agreed. Will apply this on test laptop (w530, quad core no HT, SSD Samsung 860 1TB) and see if adding discard=async alone under fstab (without touching metadata DUP or anything else) is enough to not explode pool with qusal + wyng with my usage of qubes+templates (extensive clones) will could issues and report back later on.

@tlaurion
Copy link
Contributor

tlaurion commented Nov 24, 2024

@tlaurion:

Is there still desire to make btrfs first candidate of qubesos and fixing default fs/fstab options?

The latter part can be done outside of this ticket, because it affects Btrfs users even while Btrfs isn't the default installation layout.

(If you find a way to reproduce the installer bizarrely setting up dup data on your single disk setup, please open an issue! Otherwise, the only thing I'm aware of that needs some tweaking in the installer and in qubes-dist-upgrade is not to hardcode discard[=sync] in fstab.)

Someone can launch perf comparison with that fstab setting changed vs default on large disks and with multiple clones + snapshots of vm startup time? Total boot time to ready to work state?

@marmarek @DemiMarie that would help getting this ticket moving with single change perf diff.

See also https://forum.qubes-os.org/t/btrfs-and-qubes-os/6967/54

@DemiMarie
Copy link
Author

Someone can launch perf comparison with that fstab setting changed vs default on large disks and with multiple clones + snapshots of vm startup time? Total boot time to ready to work state?

Can you do that?

@tlaurion
Copy link
Contributor

tlaurion commented Nov 25, 2024

Someone can launch perf comparison with that fstab setting changed vs default on large disks and with multiple clones + snapshots of vm startup time? Total boot time to ready to work state?

Can you do that?

I thought this is what we waited to have to redo test bench, with newer kernel version etc. No, I do not have test bench to produce tests results that would have the desired impact.

@marmarek I suggest redoing tests as of last time for comparison, whatever they were, without synced discards in fstab. Maybe that could be part of 4.3 feature freeze. My setup is stable with that. Default install is wrong with it. Need numbers to prove it.

@tlaurion
Copy link
Contributor

tlaurion commented Nov 25, 2024

At #6476 (comment) last test results

@marmarek said

We may re-test, and re-consider for R4.3.

@kocmo
Copy link

kocmo commented Nov 25, 2024

without synced discards in fstab

On older hardware (Crucial MX500 SATA SSD) I even mount btrfs rootfs in dom0 with nodiscard, and run "fstrim -av" from cron daily - to decrease i/o load further.

When I was mounting discard=async, I did observe TRIM activity contributing to i/o load.

Build times for our product have improved, but I don't have apples-to-apples benchmarks. Seat-of-the-pants feeling is that I have better responsiveness with nodiscard than with discard=async, when many VMs are doing heavy i/o. Both switching metadata to SINGLE and mounting nodiscard seemed to help.

@tlaurion
Copy link
Contributor

tlaurion commented Dec 2, 2024

I think discard=sync is artifact of lvm being generalized to btrfs.

I tried both async and no discard at all, seeing no important loss on nvme but haven't tested on older hardware with ssd not hdd. To be noted that on newer hardware with so many cores without hyperthreading activated, host has a lot of CPU cycles to waste and async just pushes changes down queues that are not even hitting iowait on newer hardware.

My recommendation here would be to either

  • revert discard on 4.3 installer (both lvm/btrfs, rely on cron to do it when user leave machines up long enough)
  • implement discard=async upon detection of nvme which implies newer hardware, if that's a good assumption
  • at least change 4.3 iso installer default from synced discard to async and postpone the rest to later, since all observations point that synced discard is shooting ourselves in the foot.

Once one of those is chosen, relaunch tests of lvm vs btrfs. Post results here.

Bees has launched a call for testing where they finally changed their default crawling mode for vm use case with multiple snapshots. Cannot guarantee I will put community time on this anytime soon, but since it would be scratching my own itches, and Heads space is limited, and that Heads cannot have tools to support all filesystem that exist, I would need to prioritize tghis accordingly. I would not look back to lvm personally, but repeat that btrfs default installer choices need to be revisited, but first step is to prove one is better then the other and move this forward. I guess we are close of 4.3 feature freeze here, so its now or 4.3.1 or even 4.4? We also know, from history, that iso installer defaults can be revisited after a major version install, but things get complicated when it comes to pool settings.

Some more input :

  • btrfs varlibqubes would benefit being a subvolume of rootfs (dom0) for everything wyng /bees if there is still interest into that. Reasoning is simple: wyng needs to snapshot the pool/subvolume.
    • Would be better if varlibqubes was independent, at least as a subvolume level. @tasket advocates into putting dom0 as a pool disk image, or seperate it so that dom0 can be backup easily, which is not the case right now.
      • Note that /boot backup is resolved on wyng side. Lvm has advantage there, since dom0 can be snapshot and backup where btrfs can't as of now.

@marmarek :

So to have apple to apple comparison, this needs to be tested in openqa on real hardware. Providing results here would be, in my opinion, invalid. @DemiMarie suggested providing end-user workflow stats like starting dvm, converting pdf to safe pdf, images to safe images; but that won't tackle qubesd rotations happening after qubes vm dying, which IOWAIT relies on reflink/snapshots rotation which is what I hope we are talking about here. When we start a vm, we talk about revisions_to_keep, we talk about reflink pool, we talk about lvm snapshots. We talk about a lot of low level stuff which in absence of apple to apple comparison with kernel versions used in dom0 today needs to be the same and refreshed, and most importantly known to the use case of the ones wetting their feet with this: we talk about differences when launching vms that have a big private disk to rotate, we talk about hitting caches of SSD/nvme, we talk about HDD spinning and slowing us down. Launching an app in dvm, which depends on a small templatevm with no private disk is a totally void comparison to most of us. And @sven did a comparison in qubesos forum already from clean installed OS with those stats, showing that lamda user was seeing great improvements already just by switching to btrfs for those basic tests. That was 2 years ago https://forum.qubes-os.org/t/ext4-vs-btrfs-performance-on-qubes-os-installs/13585/4

@marmarek I still don't understand the results of last qa ran last year. Conclusions seemed to be: better for some, worst for some, no magic improvement: we stay with lvm. But it seems that those tests are not tapering with templtes having 2 revision to keep, not having big private volumes would would show perf diff. Results of tests can only be as good as the tests. IF tests are not testing what is aimed to be tested, then the results mean nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: storage P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: enhancement Type: enhancement. A new feature that does not yet exist or improvement of existing functionality.
Projects
None yet
Development

No branches or pull requests