-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDD performance decreasing a lot #298
Comments
On COW filesystems, it's usually only useful to measure free space fragmentation because you cannot avoid to fragment files - it's a feature of COW to do exactly that. I'm not sure if free space fragmentation can be measured in btrfs, but it can be optimized by using tools like See: https://github.com/knorrie/python-btrfs (maybe this toolset has a utility to measure fragmentation) Optimizing free space fragmentation should improve write performance because the HDD heads tend to move less. Remember: COW always writes data to unused space. |
Is this with |
Wow, highly interesting. Thanks for the huge amount of development @Zygo and testing by @kakra only 1 week after the initial comment and RC. But already existing fragmentation would not be healed by new extent scan mode? |
If I understood correctly, bees will not decrease fragmentation, but due to the way the new extent scanner works, it avoids a lot of additional fragmentation that the old scanner introduced. It will still split extents and thus fragment some files but the impact is greatly reduced. To use it, switch to scan mode 4 (or remove the parameter from your script, it's the default). |
As I'm building bees on Gentoo with version (**)9999 taking the latest repository files, how can I see which version I currently have? |
It's the first line of output:
It also appears in
|
I'm maintaining that ebuild. The 9999 version builds from git master. It should have told you about the new scan mode at the start of the build, then you're using the latest build. Something like |
Today going to start v0.10-57-ge403398-dirty on my problematic 1TB HDD. First tests on SSDs have been successful. Before I would like to optimize the free space fragmentation:
How would I do that? Wouldn't it be |
There is a utility in the You can also use Never use Generally it's better to run balance after dedupe, since the dedupe will leave behind a lot of random free space fragments. Balancing before dedupe will move the data around before the free space appears. Balancing also requires bees to scan all the data again to keep the locations in the hash table up to date. Minimizing balance using the usage filter keeps the amount of rescanned data low. |
The knorrie version does a better job than plain @Massimo-B, see above, I posted this:
|
Too late, I already did My knowledge about the internals of btrfs are insufficient. I tried learning from #btrfs on Libera for not polluting tickets here with basic questions. From what I understood, please correct if I'm wrong: btrfs organizes in chunks and extents. Chunks are physical organisation, extents are logical. Chunks are ~1GiB for data, extents are 4KiB up to several MiB. Chunks contain extents. Chunks are not necessarily contiguous and can be physically fragmented. Extents extents can be fragmented over chunks. Files can be fragmented over extents. Now what does balance do? Reading man btrfs-balance, I'm confused as it first describes "block groups" for Balance does spread chunks over devices. What it actually does on single devices is not that clear. Section EXAMPLES->MAKING BLOCK GROUP LAYOUT MORE COMPACT says, it takes partially filled chunks and move the extents to other chunks, then removing the empty chunks? That means the extents of a file is not changed, but extents are moved from partially filled chunks for filling existing chunks to have a higher usage? This does defragment the fragmentation of extents over chunks? So why do you say "never use -musage"? In the way of -musage=50, -musage=85 the compacting done during the balancing could be run out of free space until the -musage=0 is able to free the empty chunks? Generally, the balancing on HDD I need to get rid of free space fragmentation. Would I need balancing on SSD/NVMe at all? What does the knorrie version do better? Is it just the order of processing because it sorts for least used chunks first? Some -dusage=10 and -dusage=20 would also only consider less used chunks and moves their extents to higher used chunks, maybe not sorted. |
No.
It makes metadata more compact, and thus more likely it want to allocate a new block group metadata---if it's not possible, you may be stuck since deleting things requires updating metadata.
I recommend you balance when you have little unallocated space (but still a lot of unused space inside block groups), so when a new metadata block group is needed, you have that space. My backup HDD performance also decreased quite a lot, but I think it's due to bees "deduplicated" shared data between snapshots and so unshared its metadata. |
It's the order which helps finishing the process faster with less IO pressure. |
The loss of performance is concerning. Dedupe will leave behind a lot of small free space holes and btrfs isn't great at filling small free space holes. Balance won't fix that--it allocates on...64K? (power-of-2? minimum-length?) boundaries, so the new block groups created during balance are full of tiny holes too. Regular writes do a second allocation pass that eventually fills in the smaller holes so allocation gets fast again, but I don't know of any way to speed that process up. Unsharing snapshot metadata can also cause a slowdown--or a speedup, at random--because it churns all the metadata and redistributes it across the disk surface. That will also randomly get faster or slower over time with normal writes, as they'll move metadata pages that are modified by normal day-to-day writes closer together and reduce seek time. A dev_extent is a contiguous region of physical space on one device. A chunk is one or more dev_extents organized in a RAID profile. A block group is a chunk with a free space counter. Block groups and chunks are very similar, and the terms are used interchangeably in cases where the presence or absence of the free space counter is not important. An extent is a contiguous region of logical space in a block group. A file is a list of references to extents, indexed by offset from the beginning of the file. Extents are always contiguous--one fragment is one extent by definition. Extents are created when writing to a file, and can be smaller than the written region if the written region exceeds the maximum size for an extent, or if the largest free space available in any block group is smaller than the written region. An extent cannot be modified once it is written. Extent reference lists in files can be modified--overwriting part of an extent will result in the reference to the old extent being shortened or split, and a new reference to the new extent will be inserted into the file's extent reference list. nodatacow files are an exception to this rule, but only if the extent that is overwritten is not shared. prealloc files have extent references that can be overwritten exactly once. Balance moves extents from one block group to others, possibly creating new block groups in the process, or filling in empty space in existing block groups. Each block group is deleted after its contents have been relocated. The product of balance is the free space left behind by the deleted block groups, e.g.
Almost every use case for balance requires a script to balance block groups in the optimal order for that use case. Extents cannot cross block group boundaries. Balance cannot change the length of any extent. This means balance cannot necessarily fill block groups to 100%. Balance on a filesystem that is over about 90% full (the point where problems start varies from 84% to 99.5% depending on what's on the filesystem) can sometimes run out of space because of these limitations. Defrag copies data to a new extent, and asks the allocator to allocate the largest extents possible (or exactly the size of the file, if that's smaller). This request may or may not be honored, and in the latter case, defrag will replace a large extent with multiple smaller ones. |
I think btrfs really needs a tool to walk all the extents (possibly file by file), look if there's a chain of extents for that file which are not shared with other refs, and then rewrite that into continuous new extent - aka "snapshot-aware defrag". Then another pass would do something like "btrfs-balance-least-used" to fill the gaps that were created previously. Then, if needed, repeat (if the fs is quite full already, multiple passes are usually needed to bubble all the free space to the end of the storage). Tho, due to chunked allocation, this works differently on btrfs. This would be similar to the work that many modern defragmenters do on NTFS cloud images to move as little data as possible while leaving as few holes as possible: defragment, consolidate, repeat. This improves access through the backend storage because bigger holes result in more flash cell trimming, and fewer fragments result in less IO requests (especially if the backend storage is properly maintained). From my own experience, such a cycle on a regular basis improves IO a lot in cloud VMs. Well, at least for SSD. A HDD defragmenter gets much more performance back by doing proper spatial reallocation (to restore data locality of closely related data). This is hard to implement correctly, and almost impossible on btrfs (due to cow). |
If I would run into this situation, what could I do? Again some basic questions for better understanding:
Data chunks are like a reservation on the filesystem and can't be used for others like Metadata? The What is the usual way when initially filling a btrfs? Chunks are filled one by one, and if there is no chunk left, a new chunk is going to be created (allocated) from free space? Without balancing, partially filled or empty chunks will never be deleted, so chunk count is only increasing, never decreasing in normal usage without balancing? So in order to have enough free space for new data, it does not matter if data is going to partially filled chunks or by allocating new chunks, as long as the allocation is not done for metadata and unused, or the other way round that data chunks have too much allocation but unused while metadata can't allocate new chunks? In that context, what does Free (estimated) mean? The man page says:
So "Free" means Unallocated + Allocated but unused for Data only? But that never takes into account if Metadata could be blocked if no allocation is possible anymore? |
Balancing is how you do that. You can run balance with a utility like For "small" filesystems you can use
Yes. If you consistently use a filesystem the same way, the ratio of data to metadata stays relatively constant, so when the filesystem is finally filled up, the data to metadata ratio will be correct for your workload; however, if you change the way you use the filesystem (e.g. using snapshots with the Due to asymmetries between data and metadata, you usually want to reduce data allocations but never metadata allocations.
The bad case is when metadata grows to fill all existing allocated metadata space, so there is no unallocated space left to allocate more metadata. The kernel will try to force the filesystem to be read-only before that happens. In extreme cases, there is too much metadata committed to the filesystem before the kernel detects the problem, then the fileystem cannot be mounted read-write any more. Generally, if metadata has been allocated in the past, the space was needed in order to handle some prior peak load of metadata usage. So it's not usually a good idea to reduce metadata allocation, even if there seems to be a lot of unused allocated space. Metadata balance should not be attempted if there are less than (3 + nr_disks) * 1G of free space in metadata allocations. If you have large snapshots, the limit for safe metadata balance may be significantly higher. These calculations are complex and the data required to make them is not easily available from the btrfs tools, so we generally recommend to users never balance metadata unless there's a major event like a change in the number of drives in the filesystem (and even then, only attempt it after ensuring that data block groups have been reclaimed). Data allocation has no such restrictions. You can deallocate as much data as you like, and btrfs will either allocate more, or give an ENOSPC error when writing data, without forcing the filesystem read-only. You can't overwrite data when data block groups are full, but you can always delete data as long as there is free metadata space. Partially filled data chunks are slower to write to, since the largest fragment sizes cannot be used when all of the free spaces are scattered and small. Balancing data helps with that by relocating data to consolidate free space into larger fragments.
"Free" space is the sum of:
Note that metadata allocations are never considered free space, regardless of how much metadata space is unused within the allocations. Also note that actual usable space for data may be less than the estimate if more metadata space is needed for the new data (e.g. crc32c csums are 0.1% of the data size, so if you add 1 TiB of new data, you'll also add at least 1 GiB of metadata). |
But deleting data may not result in getting another free chunk that could then be allocated to meta data. That makes it even more important to run some sort of balance from time to time (or by automation). But balance is a heavy operation. It doesn't always generate a lot of IO but it often keeps the FS busy with locks, leading to system stalls for different periods of time. Thus, I do this manually from time to time. But you may also want to run a cronjob to do this in less busy hours. E.g., my NAS does this some time during the day when I am at work, and it doesn't do this at night because my backups will run at this time. @Massimo-B I still think all of this is a concern for you because you use btrfs as a backup vault, and try using bees to keep the vault as small as possible, because btrfs-send isn't good at preventing duplication. There are tools like borg or restic which are a lot better with this job on traditional file systems: less overhead from exploding fragmentation levels, better data density, better error recovery, better deduplication, less seeking, higher performance. But yes, you cannot directly browse the snapshots in the vault as a tree of files and directories (but you can use a fuse mount). However, I think this is what to use local short-lived rotating snapshots for. Tools like |
But as long as there is free unallocated space, meta data chunks can still be created?
I did balance metadata already, my mistake, but I did not run into the read-only situation yet. How can I repair the harm of the metadata balancing now? |
I know and I keep that in mind. I'm going to do some cold backup with either borg or restic from my central btrfs soon. As btrfs on HDDs had quite bad performance after deduplication (might be better now after balancing and the new bees RC here...), the fuse mount onto a borg/restic might not be much slower.
But actually btrfs-send is as optimal as snapshots are. The dedupe I only need because different machines are contributing their snapshots to that central btrfs. The performance of btrfs-send is actually depending on the way of compression and decompression. I'm often collecting snapshots from machines over network. I did some benchmarks in the past using different zstd levels with and without adapt, having btrfs-send using --compressed-data or not...There were big differences, as my btrfs is already zstd-compressed and ĺeaving the compression without re-compression had advantages. I'm going to do it again soon and publish the results from commands like: |
Yes.
Prepare a large enough spare disk / partition, reboot to an environment that won't do anything to the filesystem automatically, and mount it and try |
Hi, I'm running several single device btrfs from 1TB up to 4TB, most are SSD, NVMe, some are 1TB HDDs.
I use btrbk a lot to store snapshots to other btrfs. I'm running bees on all btrfs, especially on the bigger btrfs for deduping snapshots from different machines.
I plan to get one single 20TB HDD for NAS data, server data and also all the snapshots from other machines centralized on this big btrfs.
But as all non-rotating SSDs are fine, the rotating HDDs performance are seriously decreasing. Most probably due to fragmentation of extents? Usually I can measure fragmentation only per file and I did not find any serious high fragmentation. How could I analyze the overall fragmentation of the whole btrfs?
For instance:
My 4TB NVMe receiving snapshots currently currently has 441 subvolumes. btrfs filesystem usage -g says 1759GiB free, min 960GiB. Runs fine.
The 1TB HDD has 74 subvolumes, 200GiB free. But running btrbk, transferring ~100MB takes 10 to 20 minutes and doing a noisy lot of IO. bees is almost never finishing even after days.
The planned 20TB HDD would be connected to a 24/7 thin-client with small CPU and only 8 GB of RAM. Due to the size there could be around 800 subvolumes over time... I doubt that will work if I'm already struggling with the small 1TB disk. I'm quite sure that bees is responsible for the degraded performance. But without bees I would never be able to dedupe snapshots received from different machines.
The text was updated successfully, but these errors were encountered: