-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call for testers - v0.11 release candidate #296
Comments
First, I think the progress table should be separated by a new line, or have a header like the other sections, e.g. Now I've updated bees, then changed scan mode to 4, and restarted it without touching any of the beeshome files. I've also setup various windows to watch beesstats, the bees log, and htop. I do not see how it possibly starts scanning all the subvols and snapshots again, I cannot imagine it being that fast:
There are no new lines in beescrawl.dat. I'd expect a new type of line appearing:
Maybe I just have to wait for some time? According to progress, it is probably done:
I've restarted bees hoping that it forces writing the beescrawl.dat but still no new entry format. I'll let it sit for a while now. According to htop, bees is mostly writing data and not reading anything. This is probably just writing to the state files. According to journalctl, bees writes absolutely no logs (except version and option parsing) which comes at a surprise. I'm using Is it maybe still scanning and the "progress" shows a bogus status until then? I can see the progress stats changing every now and then... It now looks like this:
|
Okay, I got one log now at least:
|
I guess it scans all the data, not all the snapshots that having a lot of same data repeatedly.
Yes, systemd reports only a small amount of data written instead (with
I'm using I wonder how often the stat is updated; it doesn't seem very often. |
There are two stat files: One in the persistent directory (beeshome), one in the runtime directory ( |
Something went wrong...it's about as noisy as it ever was. Oh, wait, my statement here is wrong:
Extent scan can restart from the lowest completed subvol transid, so if everything was absolutely up to date, the extent scanner won't start over. On the other hand, if any subvol still has transid 0 (like in issue #268), then extent scan starts over from the beginning. The #268 situation is where extent scan really makes a difference. |
Then I'll rename beescrawl.dat and let it start from the beginning. The progress estimates will come in handy here. :-) |
That makes the decision to update way easier on my large and slow single-subvol fs. |
I wrote the lines of code that create subvol crawlers two years ago. I guess I thought of this case then, and later forgot... :-P I've been mostly testing it on fresh mkfs filesystems, and huge filesystems with rotating snapshots where the subvol scans were stuck at zero. Both of those are cases where it does start over, or it's running for the first time so there's no previous progress.
Interesting...the periodic progress dump in the debug log does have the
but I took the prefix out to make the progress table fit in 80 columns. How important is 80 columns? Do people still use terminal windows that small, or could it be wider, like 90 or 120 columns? Or does everyone expect JSON or HTML output these days? At some point I want to add stats to this kind of table, or a separate table, to keep score: how many bytes were removed, average time for multiple scan cycles, reading throughput, and other performance metrics. I guess it comes down to whether it's better to make this table bigger, or make two tables, or two text rows per size (so 12 lines instead of 6). But that question can be answered after the data is available. |
@Zygo I mean this in |
Yay, looks good to me. You're on the x-mas path :-D |
I didn't explicitly state this above: In the 268 case with new snapshots appearing all the time, the subvol scanners are always stuck near zero, so there's negligible progress to lose by starting over with extent scan. That's the basis of my "there's no real inconvenience when starting over with extent scan" statement, but that statement doesn't apply in other cases like a "one big subvol" filesystem. Some of the improvements for processing individual extents make the subvol scanners faster too. Even starting over with a subvol scanner should take less time to reach the same point in the filesystem than in earlier versions. It's not quite starting over, even in the worst case: if an extent already has blocks in the hash table, bees can skip the dedupe/copy analysis because blocks are only added to the hash table when no dedupe/copy operations are needed. Only reads have to be repeated (without the reads, we don't know what blocks to look for in the hash table). That can skip some metadata lookups that eat a lot of time. |
A comparison of extent scan and subvol scan performance (bytes freed over time) in a filesystem with a single subvol. TL;DR extent scan seems to outperform everything. Duperemove added for scale. This data set is designed to challenge block-level dedupers on btrfs. The "none" data set is about 160 GiB in 3 Windows VM image raw files, copied to a single subvol on a freshly formatted filesystem without compression or prior dedupe. The "zstd" data set is the same data, but written with "zstd" is a challenging scenario for both bees and duperemove, as the data is already compressed, the uncompressed extents are shorter, and correct alignment of duplicate extents is critical. Other btrfs dedupers can't handle compressed extents or block-level dedupe at all, so they score zero bytes freed on the "zstd" test, and it doesn't matter how long they run. "none" is a more normal scenario with larger, uncompressed extents that don't slow down either bees or duperemove. Some other block-level btrfs dedupers can function on the "none" filesystem, but they free fewer bytes or take longer to run than bees or duperemove. I didn't include a result for "zstd" with subvol scan because the subvol scan is still running on the "none" image, and there's no way the subvol scan on "zstd" could be faster than it is on "none". |
Does this mean it can have an effect of recombining smaller extents into larger ones, aka defragmentation? Or does it look at extents only rather than chains of blocks? |
This is simple filtering: there's six scanners, and each one picks extents of a particular size from the input stream. So if the 128K scanner gets bogged down in compressed data, it doesn't stop bees from going ahead and finding some 128M extents for quick and easy free space. As a side effect, it means a dedupe will usually use a large extent as src, and make equal or smaller references to dst, so the biggest extents in the filesystem are preserved while the smallest extents get deduped out of existence.
No defragmentation yet. bees still only looks at one extent at a time, so it can't consider combining two or more extents together. bees also can't yet flip the order of src and dst if the second block it has seen is part of a larger extent. Both of these capabilities require holding locks on multiple extents at a time, and figuring out how to schedule tasks efficiently when two growing extents meet each other or want to use the same extent in different ways. Garbage collection will probably come before defrag. GC is easier to implement because it doesn't require combining multiple extents together. btrfs doesn't have any tool that does GC properly yet, so GC fills in a feature gap for btrfs. In its simplest form, GC is rewriting all of the blocks that are still reachable when bees detects that any block of the extent is unreachable. Conveniently, there's now a place in the bees code where we have all that information. |
@Zygo Is it correct behavior that the table never updates the transid until it reached 100% in that category:
Then, what happens in the 128k category if I restart bees? Will it start again from transid 0? |
The progress table is...a work in progress. It has a few quirks, like saying The progress table shows the position that would be saved in Extent scans have longer queues of extents than the subvol scanners did. btrfs locks entire files for dedupe, so if bees tried to run parallel dedupe tasks on the same inodes at the same time, btrfs would force them all to run one at a time. This was easy to avoid in the subvol scanners because they're naturally organized by inode, but it's not possible to know what inodes an extent belongs to in advance. When scanning a big file, the scanner will end up creating tasks that can't execute in parallel because there's dedupe happening somewhere else in the same files. In order to avoid stalling the worker threads, the extent scanner will put these extents on a queue in memory, and try again to find an extent that can be scanned in parallel. If you have a lot of different files, extents will usually belong to different files, and the queue will be short. If you have a few big files, extents will usually belong to the same files, and the queue will get long. If you have a mix of file sizes, the queue length will grow and shrink as needed. In all cases, the queue stops growing at a hard limit of 10,000 extents, and worker threads are allowed to become idle until the queue goes back below the limit. Because the task queue executes out of order, there can be longer delays between an extent task going into the queue and coming out of it in extent scan mode than subvol scan mode. That can freeze the data position in the progress (and in There's a couple of things that could improve this:
|
I observed a few desktop and application stalls which I didn't notice before. They were probably there pre-v0.11 but they were rare. Even with this RC, I didn't notice any until the progress table reached the 128k-only situation:
What does this mean? Does bees internally avoid concurrent access to the same file via locks, or does it really lock the file on the kernel level (you wrote "btrfs") so that applications would stall when accessing such files? The stalls are quite excessive... We are talking 1-2 minutes per stall, tho usually it's in the 10-20 seconds range. But since bees has the 128k category running for an estimated time of 3-5 days left, this happens multiple times per hour. I've seen that you also changed the toxic detection from 0.1 to 5 seconds if I've read the diffs correctly. So that may be an explanation, too. To see the situation better: I purged Here's my current table:
ETA for 128k usually lowers again after the other categories are done. Running on kernel 6.12.1. |
If you change it back, does it detect a lot of toxic extents? Maybe we should keep those.
Yes, it has fewer competitors so it can use all the worker threads. There might be a correlation between 128K and stalls. The stalls in btrfs are usually caused by delayed ref updates, and smaller extents = more extent refs = more delay. If that's the cause, the only solution is to dedupe those extents more slowly. When there's a mix of big extents to slow down the small extents, the system might be self-regulating, but when the big extents are done, maybe it starts flooding btrfs's delayed ref queues. If you can run A first-time run on snapshots is going to be brutally slow too, for much the same reason: unsharing a snapshot's metadata triggers ~300 delayed ref updates for each metadata page. That gets to millions in no time.
Both! btrfs reflink and dedupe operations lock both inodes for writing while they do their thing (they also force a data flush, so if you're writing to a file during dedupe, it gets pushed out to disk as well). If you bees knows that btrfs does this, so when it sees multiple dedupe operations queued up on the same file, bees arranges the queue tree so one worker thread does all the dedupes touching that file. Other worker threads are then free to do dedupes on other files. There is a fairly large space between dedupe operations while bees does hashing and duplicate lookup of the next extent, so it shouldn't be locking out other users of the files for too long at a time, but all the locking (and flushing) will slow down overall write throughput. |
Well, it looks like this, and it doesn't look much different during a short stall (I couldn't see it until the stall was over):
I'll try to capture a situation with a longer stall. This one was like 15-20 seconds, the web browser hang, and the console window opened with window decorations only and the console only rendered after the stall. I'm currently seeing micro-stalls of 1-2s multiple times while writing this text. OTOH, on another system, kernel 6.6, bees mostly idle, it looks like this (grepped for btrfs):
So |
So what are the options? I'm seeing a less pronounced behavior with 512k extents, too. Everything above seems to be okay. bees could throttle this, if only 128k and 512k is left... or bees could look a the slabs and throttle itself? Limit the number of threads working on those buckets? I think bees already throttles itself due to these stalls... It just doesn't do so early enough: Systems stalls, loadavg goes up above my loadavg limit set i in bees, bees will throttle down... This works for a while until loadavg does down enough (it currently varies between 2 and 20 like every few minutes), then the cycle repeats. It can read the 128k extents at 40-50 MB/s, tho. It's probably just when it does something with them. |
Yeah, delayed refs have been keeping btrfs slow and laggy for (checks watch) 14 years now. Maybe we want a throttle feature, "don't dedupe more than 150 extent refs per second" or something. The extent scan speeds things up so much that we can consider backing away from maximum performance and still get useful amounts of dedupe done. One experiment you can quickly try is to simply reduce the thread count to 2, or even 1. bees might now be able to keep up with filesystem updates with a single thread, and that will reduce the number of delayed refs in the transaction commit. |
I went from 6 threads down to 3 now. Let's see how it changes slabs... Currently the slab value is low. BTW: Restarting bees properly got back to 71% progress immediately in the progress table although showing transid 0. Yay! \o/ |
In extent scan mode, bees only looks at subvols 250-255, the virtual subvols that are used for the extent scan size tiers. There's some vestigial subvol state tracking that shouldn't be there in extent mode--it's creating new subvol crawlers for the new subvols, and also deleting subvols that no longer exist. Extent scan never runs the crawlers for the new subvols, so they sit in It's mostly harmless, and shouldn't affect testing of extent scan itself. In the progress table, you can see that the bottom end of the transid range is still recent:
The current transid is 4080962, and the min transid for this row is 4079030. It should be able to catch up soon (the "16h 2m" part is...not very accurate, in my experience). So the tracking for extent scan seems to be working (the I might fix this by just moving the extent scan "virtual subvols" into a different tracking file, further decoupling it from the parts of bees that track literal btrfs subvols.
More to the point, the entire 128k category is less than 10% of the total, so even if it was 100% done, it would only save so much space. One of my TODO list items is to make the size ranges configurable. That's another reason to put the extent scan tracking into its own file--I can add things like "min" and "max" size fields for each line. Lines removed by the user can stay removed, so you'd be able to remove the 128K category by removing the line. If you are planning to try removing the 128K size tier, first try changing the Another way to manage this is to simply stop bees for a while once all the higher size rows are marked "finished". If bees is truly keeping up with data in the higher sizes, it's consuming new data faster than it's appearing in the filesystem. It follows that bees doesn't need to run all of the time. It could be banished to a maintenance window.
The main problems with "toxic" extents in a post-5.7-kernel world are:
So most of the toxic extents that show up now are false positives, and we burn a lot more sys time checking for them than we save by finding one, now that they're more rare and bees avoids producing new ones. Extent scan uses the Higher sys% now might simply be the result of more efficient scheduling--bees keeps more threads active and doing productive work than before, resulting in higher CPU usage across the board. Existing users may need to reduce load parameters or thread counts, or use This also points toward the need for explicit throttling in bees itself. If we don't want bees to use more than 20% CPU, one easy way to achieve that is for bees to sleep 80% of the time. I've been experimenting with a throttle algorithm that tries to detect btrfs transaction latency spikes, and adds sleeps between dedupe calls to try to flatten them. Weirdly, it's coming up with "sleep 7-10% of the time is good enough", and the impact on benchmarks when doing that is to make dedupe faster. I guess those latency spikes hurt performance too? I'm still collecting data on this. |
Does a systemd config |
Oh, I see... Interesting. Yes, a separate file might make this more obvious for people not knowing the internals of btrfs.
I like this idea. But it should be well documented how this features should be used, e.g. why change min or max...
I'll try that if I don't see progress on the higher size ranges for a next days.
If this happens to random victim threads, and I'm guessing this applies to or affects user threads/processes that do IO, why can't the kernel just spawn its own "victim thread" which can block and wait?
I'm pretty sure I did a full restore at least once post-5.7, so my btrfs should contain no more such toxic extents? (restored from borg, not some btrfs send mirror or clone) Or asked differently: How recent are those changes to bees?
Yeah, I was suspecting something like this after looking at the performance graphs for some more time.
At least it looks like it hurts performance of bees because ETA spikes, too. But that's probably the nature of the ETA calculation, I don't think it has that much of an impact overall. At least, after the bigger sizes were done a few days ago, 128k took only 2 more days instead of proposed 6 weeks to 1 year. IOW, especially with the smaller sizes, the ETA is very inaccurate and depends a lot on the progress of the higher size classes. But I can say that I'm seeing very bad performance behavior on the desktop while bees is chewing through the data: Chrome randomly freezes when clicking a link, and gets stuck for multiple minutes or until I kill and restart the browser. Web video playback randomly freezes and eventually recovers after reconnecting to the stream. The KDE plasa-shell randomly freezes and usually doesn't recover on its own, instead it spends 100% CPU somewhere and needs to be restarted (which takes 30-60s using systemctl). If bees is done working, all of the above is hardly an issue. I'm seeing only very rare stalls, and only for like 1 or 2 seconds. This may not even be caused by bees itself. |
I was thinking exactly that when this popped up on my screen...
Pretty much. It's the cost of repeating a range of extents vs. scanning an entire subvol from scratch. Extent scan looks like this (one for each size tier):
For each extent, extent scan finds the list of references that are present at that time, and must act on all or none of them at that time. If we can't do all the dedupes then the extent remains in the filesystem, taking up space. Subvol scan looks like this (one for each subvol):
When there's a subvol that can't be modified, subvol scan can easily avoid the subvol--we simply stop the loop for the one subvol, and let all the other subvols continue to run. We can resume later by starting the loop again. Note that in both cases, an immutable subvol means dedupe of some extents can't happen. At best, we can remove all of the mutable references, so that only immutable references remain. Later when those references are deleted, so is the duplicate extent. Getting back to the fallback question:
There could be a fallback, but it wouldn't be scanmode 3. There's no way to map a scanmode 4 data pointer to a scanmode 3 data pointer, so it's only possible to scan the whole subvol across the transid range. If any crawler has transid 0, any subvol scan must start from transid 0 (you may recall this pattern from the subvol->extent scan transition. It works both ways). The end result is that we'd often end up reading everything in a snapshot if send happened to be running when bees tried to dedupe it. We would have to create a new kind of extent scan virtual subvol that scans only extents that have at least one reference from an immutable subvol (call this a "type B extent crawler") and put that crawler on pause immediately. The regular extent scan ("type A") would then ignore any extent that has a reference from an immutable subvol. If the subvol becomes mutable, then we put an upper limit on the "type B" extent scan crawler's position so it'll stop when it has caught up with "type A", pause the "type A" crawler, then start the "type B" extent scan crawler running (and switch back to "A" when "B" is done). If there are multiple subvols where send is coming and going, we'll need to dynamically create new "type B" crawlers every time a send starts. So these extent scan crawlers would have a state like "start at 20T in transid 12345..23456 and end at 5T in transid 23456..34567, only for extents that reference subvol ID 7115,7232,7453". Definitely can't put that in It's possible, but it's potentially like the optimization for parallelizing reads on block groups by drive: it's a complex subsystem that 99% of users won't be able to use, because it turns out all of their data is in the same schedulable unit. If all of your data is touched by btrfs send, the fancy optimization degenerates to stopping and starting the entire process, but with more incomprehensible progress tracking and more room for bugs. It'd probably burn a lot of CPU too--those extent maps aren't cheap, even if you don't read the data. |
Do you actually get any benefit from the lower amount of fragmentation on existing data on an an existing file system already scanned and deduped with older bees version? Is there any way to get that benefit? |
Currently, bees doesn't attempt to reduce existing fragmentation.
The new anti-fragmentation rules allow the Older versions of bees would attempt to replace all larger defragged extents with any surviving references to smaller fragmented extents, undoing any benefit from For filesystems with snapshots, it's possible to feed small batches of files (i.e. the same files from every snapshot) through |
That means bees itself introduces less fragmentation now. But the order of usage would still be first defragmentation then bees deduplication?
..sounds more like running defrag after bees? Doesn't defragmentation revert some of the achieved deduplication? I just remember never using "btrfs filesystem defragment". Can't remember why, it conflicted with either the old bees or send/receive by btrbk.
So there already is a list of skipped extents in the hash table for future dedupes. Can't that be used for the blocked btrfs-send areas in order to retry deduplication later?
Right now before trying the new bees RC, I'm balancing my very slow HDD btrfs (#298 (comment)). |
The question from KeinNiemand was about recovering after running earlier versions of bees, so the order is:
Steps 2 and 3 could be done at the same time, i.e. upgrade bees first, then run the defrag, and let bees follow behind the defrag with dedupe. bees always acts after everything else has.
If it's done with a crude tool like Once defrag is done, bees will treat the data the same way as if you had created a new file: if it finds existing duplicate copies of the data, bees replaces the new file with reflinks to the old files; however, it will not replace all duplicates if the resulting fragmentation exceeds the thresholds in the new bees version. Thus, dedupe undoes defrag, but not all of the defrag. The trick is to find the point where the fragmentation produces productive amounts of free space proportional to the cost of having the fragments, and bees is much closer to that point now. In theory, there will be less fragmentation and less dedupe. They are tightly coupled together. In practice it's not quite that simple. New bees makes better decisions than old bees when multiple matches are available or when matches overlap. Subvol scans are a bit "lossy" when it comes to partial matches and the extent scan doesn't have those kinds of losses. So it is possible, even likely, that the new bees dedupes to a smaller size than old bees, even as it uses fewer fragments to do so. Note this will only work once, and only when starting with a filesystem that has already been shaped by old bees--after that, theory takes over, and dedupe won't return all of the space that defrag takes up.
No, there is simply a list of locations for each hash. If the hash is matched by a future dedupe, that dedupe will see multiple possible locations for the data, and pick the one that results in fewer fragments. Usually that will be the longest one, but there are sometimes exceptions, e.g. a smaller existing extent of exactly the same size will be chosen if the alternative is to split a large extent into smaller pieces to make a match fit.
No. I mean, it's possible, but there are several known ways to handle blocked btrfs-send areas, and all of them are better than using the hash table do it. |
Here's something I don't understand:
The table tells me that "max" is finished, and deferred. But it still has threads running like Shouldn't the table tell me that it is still working on the max category? |
Check There's two main sources of progress lag:
|
For highlighting that bees was finished I usually had: |
I just tested the master branch on a set of 311GB (pre-bees) home directories, which went down to now 255GB. Sometimes the progress table seems to completely hang up, in which case only removing the hashtable helped. That happens mostly when I restart bees though, and dedup seems to continue working. |
It's "Crawl finished" for every subvol (all 6 for extent scan, or whatever number of subvols there are for subvol scan). I should put that message back, because we can hook into that message and terminate the process. It has to be written differently for the various scan modes. |
Can you clarify that? What does "the progress table seems to completely hang up" mean? Do you have a status file or debug log from such an event? Removing the hashtable would break dedupe of any old, previously scanned data. That doesn't sound like it would be particularly helpful, and I don't know how it would avoid a hang. |
Sadly no files left from that run, I removed them as a troubleshooting step, but it is just the table not changing anymore for two hours in my case, while the lowest ETA was around 3 minutes. Restarting bees didn't get it to update either.
Bees itself doesn't hang I think, at least from the still updating crawler messages in /run/bees/[uuid].status Shouldn't removing the hashtable force a rescan? Or is the state saved somewhere else too? |
Scan position is saved in The end stages of a first filesystem scan might not update progress for some time, possibly hours if there's a lot of small duplicate extents left in the queue at the end. All it takes is for one extent to get kicked to the back of the queue, then progress won't update until the entire queue is done. If new files are added to the filesystem, bees should go dedupe them and then come back to the smaller ones, so in normal daemon usage this shouldn't be a problem. If you're waiting for progress to say "finished", then it might be. The progress will also not update if there's no new transactions on the filesystem to scan. That doesn't seem likely if there are still active crawlers though. That case should still get listed in the docs. |
Ah, sorry then. I deleted |
I've been running this version for a few weeks (starting from an incomplete scan of the old version (incomplete becouse I added a drive, did a balance and had to delete hashtable) now and progress seems to go up very slowly, it also seems to have been at ~40% completion for the size with the most progress for a while now.
|
Yeah, I've seen this multiple times, too, now. It simply sits there seemingly doing nothing but it will eventually finish. You will probably see bees reading some data from time to time, and burning CPU calculating hashes and extending block ranges. What bothers me more, any maybe this is one instance of this problem: Sometimes the table says "I'm 100% done" but the transid clearly indicates it's still thousands of transids behind for one size category. Currently, it's saying:
So essentially there's 0 bytes of 2TB done, which correctly equals 0%, but the "ETA" says "finished", and the transid is what can be called "done" (it lacks 1 transid behind). To me this looks like the tables pulls data from completely unrelated runs, e.g. the "done" column (that 2nd columns) is from a current run, the total column (2nd col) is from some previous run or from some total counter, "%done" is calculated from who knows what, and "ETA" says finished even tho "%done" is not even near 100%. Additionally, the "%done" is exactly equal in multiple category rows. I don't think that is realistic. Well, maybe this means: In the current run, there was no data in the transid range. But in the whole filesystem there are 2 TB of data of extents over 32 MB. But then: What does "%done" mean? How do these columns related to each other? In the other categories, "%done" hardly reaches 100% and lacks behind all the time. At best, "%done" is a very bad guess. And that reflects over to the ETA. Or maybe there are just race conditions with stats maintenance of that table data. |
%done is calculated by:
Step 2 sometimes returns nothing if there aren't any btrfs extents in that transid range, so we get a zero there (e.g. if transid 4156823-4156824 contained only deletions or metadata modifications, no new data blocks). What the table above is saying is "there were no new extents <= 512K in size between transid 4156823 and 4156824." More often we get a small number of extents, less than 10000, so in a few seconds we've queued everything and get to All the sizes see the same extents, they just ignore anything outside of their own range. So if there's only one extent modified in transaction 4156822, and it's at 92.7924% of the distance between the beginning of the filesystem and the last allocated byte, then all the scanners will point there at the end of the cycle, because it's the last extent they've seen. Only one of the scanners will actually process that extent. Step 5 is very biased toward whatever was most recently written to the filesystem. If bees is idle and you write a thousand small files, then small extents appear to grow relative to the other size bands, because most of the extents bees sees will be small. This causes wild swings in the size-specific data size estimates, especially if some of the crawlers are idle--their extent size will no longer be represented in the statistics, so their size shrinks while the others grow. The size data will become important to a sysadmin when it becomes possible to delete unproductive crawlers from the table, but maybe the data shouldn't be collected the way it is now--maybe it would be worthwhile to make it a startup task, or even a separate tool that does a proper statistical sample. Mostly what we need to know here is the maximum possible benefit from allowing the crawler to continue to run. If you have 300TiB of data and the 128K size is only 6 TiB, you can only get 1% of the space out of that crawler if every single extent in that size range is a duplicate. That assessment doesn't change very much if the estimated size is double or half that number--2% is no better than 1% if it takes 3 years of IO time to get the space back. The progress table is a snapshot of the previous cycle taken just before a new cycle is started. If the snapshot was taken after the new cycle is started, we would never see If a crawler finishes in the middle of a scan cycle, but there are more transids available, the crawler will restart itself immediately. That can lead to the crawlers getting a little out of sync with each other (like the 32M and max rows), but they'll all eventually cover all transids. Scan cycles start from an idle task. If all of the worker threads are running non-idle tasks (i.e. actual dedupes) then the scan cycle task never gets run. There's no need to sign up for more work if all the workers are busy processing extents already in the queue. Using an idle task does freeze the progress table, though. Maybe the table generation should be split off into its own non-idle task, so the table is always up to date even if we didn't queue any new work. Apart from anything else, if we didn't do any new work, the ETA should be updated with a future time. I guess my question is whether this table is useful, or how it could be improved? A lot of the problem is just lack of available information at zero cost. I can get better information, but it costs more IO (it'll be reading big parts of the extent tree multiple times). Maybe the stand-alone tool is a better idea--something that takes a proper sample of the entire filesystem for a few minutes, then reads |
Yeah I think it's exactly that, and that confuses people - especially if they eagerly waited for the "progress bar". So maybe "%done" should be pulled out of the table if it is more or less a global position anyways, and display it as a global value instead. The other issues are where certain combinations of information per row do not give a conclusive view, e.g. saying "0" here but "finished" there. I'm not sure how to solve this. Maybe it should say "finished" or "waiting" in the first column then, and say "n/a" for ETA instead. And technically, if there's nothing to do because a transid range didn't have extents of that size, the progress is probably technically 100% and not 0% (more correctly, it's probably "n/a"). Maybe this already fixes the most confusing part of the table? While I like the idea of a separate tool, I don't think you should focus on such a tool too much. I think there are more useful things like recombining extents (aka defrag) - if that's in a state where you could make progress. The progress table is nice but currently it's confusing. To me, this looks like most recent reports focus around the progress table for that reason (but I may be biased because I'm active here). I'd prefer the "confusion" part improving first for a v0.11 release, and only later focus on accuracy. For me, bees works a lot better and faster now, maintains a better file system layout, and introduces fewer stalls. These are great changes and they are important. I'd prefer to get a "stable" release first, ignore accuracy, but at least fix the confusing parts of the progress table so we'd get fewer reports and questions about it. So what are the confusing parts?
So maybe, if a row doesn't make visible progress for some time, maybe it's best to just clear its ETA (and say "waiting" instead or something), and probably include some info about at which "position" the crawler currently is (if that is possible) instead of just having some static transid range for very long periods of time. |
To elaborate on that a little more: For me, flatpak updates seem to work faster now with fewer IO stalls after I defragged Also, many desktop actions now have fewer IO stalls when writing data, or copying large files from my local drives to my NAS (even if those large files - stream video recordings - are on my xfs partition). I'm not sure how KDE Plasma interacts with btrfs on my |
It's definitely an independent value that only sometimes synchronizes between scanners. The idea is that the large extent scanners cherry-pick the largest extents, giving back cheap free space quickly, while the smaller extent scanners take longer to free a possibly larger total amount of space. In terms of progress, they'll lap each other, like hands on a clock: the big extent size crawlers are hundreds of times faster than the small ones. The big ones are expected to run a full scan cycle and go into and out of the idle state many times over, in the time that the small extent size crawlers finish the first scan.
If you look through the commit history, one of the versions of the progress table blanked out the two "done" columns when the scan was finished. We could go back to doing that. The "done" column seems fairly useless--it's not a raw variable, or even a cooked one. It's overcooked. So I guess we can just drop it outright. The position always ends up as a number that looks like a size or a percentage. I could say "0.927" instead of "92.7924%" but it's still going to go from 0.000 to 1.000 on each pass. I guess one solution would be to drop it entirely, and have only the ETA. The ETA is the time until the end of the current cycle, i.e. when the transids will be updated and new data will be considered for dedupe. It's not the time until bees becomes idle, which is much harder to predict. bees becomes idle when all of its crawlers are idle. The amount of work within each transid is unpredictable--it's the amount of new data that was written to the filesystem, whatever that is. A percentage based on the transids will end up cycling from 99.999% to 100%, with more 9's added as the filesystem gets older.
I've added some timing data on extent dedupes and I'm finding there are a few cases where it does take hours or even days to process the extents in the queue. It doesn't seem to be causing any other problem--dedupe continues on other files, even new scan cycles start--but the progress can't move forward because there's an extent in the queue at the end of a long task tree branch. It might be a problem for the "run bees for an hour a day" model since it would restart the hours of processing at the beginning. There is also some unfair scheduling, particularly for reads that use an ordinary mutex lock for synchronization.
"idle" is shorter, and already used for bees status messages when a thread is intentionally not doing anything. Something like:
This has a new column to give the time when a crawl started. This would give an idea of how old the data is that will be scanned. Also moved the heading data into a footer row, and used "gen" instead of "transid" because it's shorter. |
Maybe you could create something like a deadline scheduler for the queue? Like "if an extent is in the queue for more than X minutes, move it to the front", then work from there with some sorting to prevent a feedback loop which moves extents to the front too often, or protect the queue from this deadline action for some minutes... Since there can be hours of work, X probably should be calculated dynamically or derived from some other value (like the size category).
Maybe fill the remaining empty columns, too, with the min and/or max times?
I like this change. Let's see how this works. I'll report back (when the change is pushed). Maybe I can recreate some situations which you described by this:
Thanks. :-) |
That's one possibility. We could set a minimum dedupe/scan rate, and if an extent sits in the queue too long, we just drop it. If we spend an hour on a 128M extent, that works out to deduping 20 TiB in about 20 years... It looks like the problem case is that we have thousands of extents on a single file, which wouldn't be a problem by itself--but the queue FIFO list is "extent 200-7000, then extent 12-199", because we got through extents 1-199 before discovering that extent 12 shares an inode with extent 200 that was already running in another worker thread. Extent 200 had been running for a while, and extents 201-7000 ended up in the queue waiting for extent 200's lock, while extents 1-11 were running in the other worker thread. Extents 12-199 that were already waiting for extent 11 get appended to the end of extent 200's lock queue, after extent 7000. That means that after extent 11, we're suddenly 6800 extents away from dequeuing extent 12. Maybe we can move the Task queue branch into the Exclusion object, and make it a sorted list instead of FIFO. Then in this example we would insert tasks 12-199 after task 200, so the queue is "extent 200, 12-199, 201-7000" (extent 200 is already running so it can't be moved), i.e. the lowest address is always next on the queue for the worker. I don't plan on fixing that. My plan is to get csum scan and extent dedupe up and running (we currently have extent scan, but it's still using extent-ref-based logic for dedupe, and the scan read optimization is a nasty hack). Extent dedupe maintains a sorted and optimized list of dedupe operations that is even more out-of-order, but it is also better equipped to avoid committing resources to extents that are not a productive use of time. csum scan doesn't get rid of the unfair scheduling on data reads, but it does make the individual Tasks faster by doing the dedupe hash search on the csums, thereby eliminating most of the data reads (only duplicate and compressed reads will remain) and also a lot of unnecessary LOGICAL_INO lookups. |
The new table looks good. I think renaming to "this cycle" and "next cycle" is a good change. I'm not sure what "point" is but it seems to be some sort of percentage just hiding in a disguise. That's probably a good change, too. "ctime" is probably the estimated remaining time. I've run |
Yeah, divide it by a million and multiply by 100 and it's a percentage. I think the column might only be useful for verifying the ETA calculation.
Suggestions for that column header that are 7 characters or less welcome! "tm_left" would fit...
The accuracy depends on the size of the data set and how uniformly distributed it is. So if you do some consistent operation (like defrag, which selects short extents) across a big enough part of the filesystem (like 1% or more), it should come out with smoothly increasing, predictable numbers. |
|
@Zygo I tested this build on another (much slower, both cpu+io) system. I defragged a lot of files (probably almost the whole system). I don't know what you changed during the last week but bees works a lot faster now and the new progress estimation is a lot more accurate. |
Okay, here's a table where ETA actually fell behind:
It's |
[nothing useful to say, except thank-you to all the devs and testers.] |
I can't answer that without knowing what's in your execution queue. If there's hundreds or thousands of tasks reported in the status, then it's likely that it will take a few minutes before a worker becomes available to update the progress tracker. What seems to be happening here is the progress/crawl restart task now has the same priority as the extent scanning tasks, but the extent scanning tasks won't release a worker until every reference to their extent is processed or they hit a lock conflict. This avoids thrashing the page cache with too many swaps between different extents and keeps the cache hot so we don't have to reread data. I could add a higher priority level, but priority doesn't matter if no worker thread even looks at the queue, and I don't want to start interrupting extent tasks because it's faster overall to wait them out. At some point I'd like to have an option where users can specify the minimum distance between transids, so e.g. you can tell bees to wait until there's 100 transids between gen_max and gen_now before restarting a crawler. That would avoid deduping extents in files that are frequently being overwritten, and maybe generally help with efficiency by not doing as many unproductive tree searches. So I'm not too worried about bees stopping some of the crawls for a few minutes at a time, because that's eventually going to be a feature for filesystems where bees is idle most of the time. If all of the crawls are showing "idle", then I'd expect a progress update every N seconds, where N is the average rate of recent transaction creation on the filesystem, because there's no queued work to delay running the progress update task. N is usually 30-60 seconds, but on a very idle filesystem, bees could go to sleep for up to an hour (the hardcoded limit). |
I'm getting ready to do the v0.11 release, and this one is larger than most of the previous releases, so I'd like to do more testing before putting a tag on it.
Please try it out, and comment here if it works for you, or open an issue if it doesn't.
Here's a list of the new things to try:
Extent scan mode
Extent scan mode replaces the historically problematic subvol scan modes.
Extent scan is now the default mode. If you are using a
-m
option, remove it or replace it with-m4
to enable extent scan.Extent scan enables several new features:
Extent size sorting
Skip over the small extents to dedupe the large extents first.
The filesystem is divided into 6 size groups: 0 to 128K (which includes all compressed extents), 512K, 2M, 8M, 32M, and larger than 32M.
There's a table in
bees-roots.cc
which lists the size groups. Ambitious users who want to avoid small extent dedupes entirely can remove the table entry for the small extent sizes without breaking the dedupe.Progress reporting
One of the most requested features!
Estimates how much data is done and how much remains at each size level, how long the remaining data will take, and when it might be finished.
The columns are:
finished
at 100%Unfortunately, it's about as accurate as the estimated time remaining in Windows Copy, i.e. not very accurate at all...
The progress report appears in
$BEESSTATUS
, the debug log, andbeesstats.txt
. It is only available in extent scan mode.Duplicate read elimination
With extent scan mode, bees can read an extent once, then never read it again. That saves a lot of time when scanning a filesystem that has a lot of snapshots and not much duplicate data.
Extent scan doesn't have to start over when new snapshots are created. It doesn't have to read every snapshot or reflink multiple times as the subvol scanners do.
Duplicate data still needs to be reread over and over, but that's a limitation of the kernel API.
Minimal temporary space usage and unreachable extents
Extent scan adds copy extents to the hash table and processes them immediately, without waiting for later scans to clean them up. This avoids leaving a temporary extent sitting in the hash table for so long that it is evicted before the extent can be completely deduped.
No support for btrfs send workaround
Currently the extent scan mode can't effectively implement
--workaround-btrfs-send
. Read-only snapshots are simply excluded from dedupe when the workaround is enabled.Whether the
--workaround
is enabled or not, ifbtrfs send
is active on the filesystem, no dedupe can be done on the subvol being sent. Any space savings will only appear when the read-only snapshots are deleted.One solution for active
btrfs send
is to terminate bees before startingbtrfs send
, and restart bees afterbtrfs send
is finished.Unlike subvol scan modes, there is no way to go back to rescan a read-only subvol that becomes read-write (other than deleting
beescrawl.dat
and starting over from the very beginning of the filesystem).Extent scan vs subvol scan history
Switching to extent scan starts over from the earliest completed subvol scan. If your filesystem was completely up to date with subvol scans, it will also be up to date with extent scan; however, if subvol scans were only partially completed, then extent scan will restart from the beginning.
It is possible to switch from extent to subvol scan modes and back. Extent scan will pause while subvol scan is active, and vice versa. If you needed to switch back to subvol scan mode, please tell us why!
Older bees versions will delete extent scan checkpoints from
beescrawl.dat
.The extent scan mode stores its checkpoint using virtual subvol IDs in
beescrawl.dat
, withroot
numbers 250 through 255. This may change in the future.[edited to correct "from the beginning" and fill in some answers from the comments]
Other enhancements and bug fixes
Less fragmentation and nuisance dedupes
bees will no longer dedupe a partial extent if the total amount of data saved is less than half of the extent size, or if the extent requires more than 100 dedupe and copy operations to remove.
There will be no more "we can save 24K if we make 128M of copy extents" nuisance dedupes that happen in media files, or "we can save 80% of the space in this extent, but only if we break the extent into thousands of tiny fragments" dedupes that happen in VM image files.
bees will still split an extent when the ratio is the other way around, e.g. "create a few dozen new 4K fragments if it enables dedupe to save 100M of space" or "split a 4M chunk into a 3M chunk and a 1M chunk so we can dedupe the 3M chunk."
bees now picks the longest available extent for dedupe if multiple choices exist.
Removing expensive toxic extent workarounds
Toxic extents are fixed in kernel 5.7 and later, and the workarounds for toxic extents in bees were extremely expensive. The worst of these workarounds have been removed. Some filesystems will see improvements of 100x to 1000x on highly duplicated data.
Improved IO scheduling
Only one thread reads from disk at any time, avoiding thrashing on spinning drives (and it even improves performance on some solid-state drives).
Dedupes run on data that has been preloaded into page cache, making dedupe operations substantially faster since the sequential read buffers were removed from kernel 4.20.
Dedupes now pass the full extent length to the kernel, instead of cutting them into 16M fragments. The kernel has been able to handle any length of dedupe request since 4.18.
There is now a simple filter to prevent duplicate prereads.
Improved dedupe task scheduling
Crawlers are now much better at finding work that worker threads can do at the same time.
Lower extent reference count limit and kernel memory usage
Decreased the maximum number of extent refs from 699050 to 9999. This uses much smaller buffers for kernel ioctls, which avoids triggering some annoying kernel VM bugs that forcibly evict pages unrelated to bees from memory.
New minimum kernel requirements
The extent scan mode won't work at all on kernel 4.14 and earlier. Only subvol scans can be used there.
Toxic extent workarounds have been removed. Kernels earlier than 5.7 may experience slowdowns and lockups with large files that have many duplicate extents.
The text was updated successfully, but these errors were encountered: