Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call for testers - v0.11 release candidate #296

Open
Zygo opened this issue Dec 1, 2024 · 66 comments
Open

Call for testers - v0.11 release candidate #296

Zygo opened this issue Dec 1, 2024 · 66 comments

Comments

@Zygo
Copy link
Owner

Zygo commented Dec 1, 2024

I'm getting ready to do the v0.11 release, and this one is larger than most of the previous releases, so I'd like to do more testing before putting a tag on it.

Please try it out, and comment here if it works for you, or open an issue if it doesn't.

Here's a list of the new things to try:

Extent scan mode

Extent scan mode replaces the historically problematic subvol scan modes.

Extent scan is now the default mode. If you are using a -m option, remove it or replace it with -m4 to enable extent scan.

Extent scan enables several new features:

Extent size sorting

Skip over the small extents to dedupe the large extents first.

The filesystem is divided into 6 size groups: 0 to 128K (which includes all compressed extents), 512K, 2M, 8M, 32M, and larger than 32M.

There's a table in bees-roots.cc which lists the size groups. Ambitious users who want to avoid small extent dedupes entirely can remove the table entry for the small extent sizes without breaking the dedupe.

Progress reporting

One of the most requested features!

Estimates how much data is done and how much remains at each size level, how long the remaining data will take, and when it might be finished.

    done     156G    %done size transid 1334   todo              ETA
-------- -------- -------- ---- ------- ---- ------ ----------------
  6.496G 120.717G  5.3814%  max       0 1276 9h 18m 2024-12-01 11:45
  4.171G  18.833G 22.1497%  32M       0 1276 2h 15m 2024-12-01 04:43
  1.831G  11.433G 16.0150%   8M       0 1276  3h 7m 2024-12-01 05:35
462.445M    2.83G 15.9563%   2M       0 1276  3h 8m 2024-12-01 05:36
162.649M   1.061G 14.9772% 512K       0 1276 3h 20m 2024-12-01 05:48
 22.105M   1.126G  1.9178% 128K       0 1276  1d 2h 2024-12-02 04:33

The columns are:

  1. Estimated amount of data scanned in this size tier. The estimate is based on a sample of extent data from the filesystem, so it won't appear for a few minutes after bees startup.
  2. Estimated total amount of data in this size tier (the heading of this column is the total allocated data size for the filesystem)
  3. Percent done, or finished at 100%
  4. The maximum size of extents in this row
  5. The lowest transid for this scan cycle
  6. The highest transid for this cycle (the heading of this column is the current transid of the filesystem)
  7. The estimated amount of time remaining. Calculated based on when the last cycle started.
  8. The date and time when the current scan will finish.

Unfortunately, it's about as accurate as the estimated time remaining in Windows Copy, i.e. not very accurate at all...

The progress report appears in $BEESSTATUS, the debug log, and beesstats.txt. It is only available in extent scan mode.

Duplicate read elimination

With extent scan mode, bees can read an extent once, then never read it again. That saves a lot of time when scanning a filesystem that has a lot of snapshots and not much duplicate data.

Extent scan doesn't have to start over when new snapshots are created. It doesn't have to read every snapshot or reflink multiple times as the subvol scanners do.

Duplicate data still needs to be reread over and over, but that's a limitation of the kernel API.

Minimal temporary space usage and unreachable extents

Extent scan adds copy extents to the hash table and processes them immediately, without waiting for later scans to clean them up. This avoids leaving a temporary extent sitting in the hash table for so long that it is evicted before the extent can be completely deduped.

No support for btrfs send workaround

Currently the extent scan mode can't effectively implement --workaround-btrfs-send. Read-only snapshots are simply excluded from dedupe when the workaround is enabled.

Whether the --workaround is enabled or not, if btrfs send is active on the filesystem, no dedupe can be done on the subvol being sent. Any space savings will only appear when the read-only snapshots are deleted.

One solution for active btrfs send is to terminate bees before starting btrfs send, and restart bees after btrfs send is finished.

Unlike subvol scan modes, there is no way to go back to rescan a read-only subvol that becomes read-write (other than deleting beescrawl.dat and starting over from the very beginning of the filesystem).

Extent scan vs subvol scan history

Switching to extent scan starts over from the earliest completed subvol scan. If your filesystem was completely up to date with subvol scans, it will also be up to date with extent scan; however, if subvol scans were only partially completed, then extent scan will restart from the beginning.

It is possible to switch from extent to subvol scan modes and back. Extent scan will pause while subvol scan is active, and vice versa. If you needed to switch back to subvol scan mode, please tell us why!

Older bees versions will delete extent scan checkpoints from beescrawl.dat.

The extent scan mode stores its checkpoint using virtual subvol IDs in beescrawl.dat, with root numbers 250 through 255. This may change in the future.

[edited to correct "from the beginning" and fill in some answers from the comments]

Other enhancements and bug fixes

Less fragmentation and nuisance dedupes

bees will no longer dedupe a partial extent if the total amount of data saved is less than half of the extent size, or if the extent requires more than 100 dedupe and copy operations to remove.

There will be no more "we can save 24K if we make 128M of copy extents" nuisance dedupes that happen in media files, or "we can save 80% of the space in this extent, but only if we break the extent into thousands of tiny fragments" dedupes that happen in VM image files.

bees will still split an extent when the ratio is the other way around, e.g. "create a few dozen new 4K fragments if it enables dedupe to save 100M of space" or "split a 4M chunk into a 3M chunk and a 1M chunk so we can dedupe the 3M chunk."

bees now picks the longest available extent for dedupe if multiple choices exist.

Removing expensive toxic extent workarounds

Toxic extents are fixed in kernel 5.7 and later, and the workarounds for toxic extents in bees were extremely expensive. The worst of these workarounds have been removed. Some filesystems will see improvements of 100x to 1000x on highly duplicated data.

Improved IO scheduling

Only one thread reads from disk at any time, avoiding thrashing on spinning drives (and it even improves performance on some solid-state drives).

Dedupes run on data that has been preloaded into page cache, making dedupe operations substantially faster since the sequential read buffers were removed from kernel 4.20.

Dedupes now pass the full extent length to the kernel, instead of cutting them into 16M fragments. The kernel has been able to handle any length of dedupe request since 4.18.

There is now a simple filter to prevent duplicate prereads.

Improved dedupe task scheduling

Crawlers are now much better at finding work that worker threads can do at the same time.

Lower extent reference count limit and kernel memory usage

Decreased the maximum number of extent refs from 699050 to 9999. This uses much smaller buffers for kernel ioctls, which avoids triggering some annoying kernel VM bugs that forcibly evict pages unrelated to bees from memory.

New minimum kernel requirements

The extent scan mode won't work at all on kernel 4.14 and earlier. Only subvol scans can be used there.

Toxic extent workarounds have been removed. Kernels earlier than 5.7 may experience slowdowns and lockups with large files that have many duplicate extents.

@kakra
Copy link
Contributor

kakra commented Dec 1, 2024

First, I think the progress table should be separated by a new line, or have a header like the other sections, e.g. PROGRESS: or ESTIMATES:.

Now I've updated bees, then changed scan mode to 4, and restarted it without touching any of the beeshome files. I've also setup various windows to watch beesstats, the bees log, and htop.

I do not see how it possibly starts scanning all the subvols and snapshots again, I cannot imagine it being that fast:

Switching to extent scan starts over from the beginning, even if previous subvol scans were up to date; however, extent scan mode is fast enough that this is not necessarily an inconvenience.

There are no new lines in beescrawl.dat. I'd expect a new type of line appearing:

Older bees versions will delete extent scan checkpoints from beescrawl.dat.

Maybe I just have to wait for some time? According to progress, it is probably done:

    done   4.814T    %done size transid 4059507 todo ETA
-------- -------- -------- ---- ------- ------- ---- ---
deferred   2.894T finished  max 4059506 4059507    -   -
deferred 799.222G finished  32M 4059506 4059507    -   -
deferred 337.118G finished   8M 4059506 4059507    -   -
deferred 226.404G finished   2M 4059506 4059507    -   -
deferred 221.007G finished 512K 4059506 4059507    -   -
deferred 382.818G finished 128K 4059505 4059506    -   -

I've restarted bees hoping that it forces writing the beescrawl.dat but still no new entry format. I'll let it sit for a while now.

According to htop, bees is mostly writing data and not reading anything. This is probably just writing to the state files.

According to journalctl, bees writes absolutely no logs (except version and option parsing) which comes at a surprise. I'm using --verbose 5 but I didn't expect it to be that silent now.

Is it maybe still scanning and the "progress" shows a bogus status until then? I can see the progress stats changing every now and then... It now looks like this:

    done   4.814T    %done size transid 4059523 todo ETA
-------- -------- -------- ---- ------- ------- ---- ---
deferred   2.429T finished  max 4059522 4059523    -   -
deferred 723.453G finished  32M 4059522 4059523    -   -
deferred 455.517G finished   8M 4059522 4059523    -   -
deferred 408.836G finished   2M 4059522 4059523    -   -
deferred 346.682G finished 512K 4059522 4059523    -   -
deferred 508.406G finished 128K 4059521 4059522    -   -

@kakra
Copy link
Contributor

kakra commented Dec 1, 2024

Okay, I got one log now at least:

Dez 01 13:32:32 jupiter bees[696816]: ref_9675ef6a000_4K_1[696923]: Opening /mnt/btrfs-pool/home/kakra/.local/share/fish/fish_history found wrong inode 28589377 instead of 28566132

@lilydjwg
Copy link

lilydjwg commented Dec 1, 2024

I do not see how it possibly starts scanning all the subvols and snapshots again, I cannot imagine it being that fast:

I guess it scans all the data, not all the snapshots that having a lot of same data repeatedly.

According to htop, bees is mostly writing data and not reading anything. This is probably just writing to the state files.

Yes, systemd reports only a small amount of data written instead (with IOAccounting=true):

         IO: 46G read, 1.2G written

According to journalctl, bees writes absolutely no logs (except version and option parsing) which comes at a surprise. I'm using --verbose 5 but I didn't expect it to be that silent now.

I'm using -v 4 and yes, no logs. I'm quite pleased to see it working quietly instead of filling up my journal.

I wonder how often the stat is updated; it doesn't seem very often.

@kakra
Copy link
Contributor

kakra commented Dec 1, 2024

I wonder how often the stat is updated; it doesn't seem very often.

There are two stat files: One in the persistent directory (beeshome), one in the runtime directory (/run/bees/bees.status or similar). The first is only written like every 15 minutes, the latter updates once per second.

@Zygo
Copy link
Owner Author

Zygo commented Dec 1, 2024

According to journalctl, bees writes absolutely no logs (except version and option parsing) which comes at a surprise. I'm using --verbose 5 but I didn't expect it to be that silent now.

Something went wrong...it's about as noisy as it ever was.

Oh, wait, my statement here is wrong:

Switching to extent scan starts over from the beginning, even if previous subvol scans were up to date

Extent scan can restart from the lowest completed subvol transid, so if everything was absolutely up to date, the extent scanner won't start over. On the other hand, if any subvol still has transid 0 (like in issue #268), then extent scan starts over from the beginning.

The #268 situation is where extent scan really makes a difference.

@kakra
Copy link
Contributor

kakra commented Dec 1, 2024

Then I'll rename beescrawl.dat and let it start from the beginning. The progress estimates will come in handy here. :-)

@manxorist
Copy link

Oh, wait, my statement here is wrong:

Switching to extent scan starts over from the beginning, even if previous subvol scans were up to date

Extent scan can restart from the lowest completed subvol transid, so if everything was absolutely up to date, the extent scanner won't start over.

That makes the decision to update way easier on my large and slow single-subvol fs.

@Zygo
Copy link
Owner Author

Zygo commented Dec 1, 2024

That makes the decision to update way easier on my large and slow single-subvol fs.

I wrote the lines of code that create subvol crawlers two years ago. I guess I thought of this case then, and later forgot... :-P

I've been mostly testing it on fresh mkfs filesystems, and huge filesystems with rotating snapshots where the subvol scans were stuck at zero. Both of those are cases where it does start over, or it's running for the first time so there's no previous progress.

First, I think the progress table should be separated by a new line, or have a header like the other sections, e.g. PROGRESS: or ESTIMATES:.

Interesting...the periodic progress dump in the debug log does have the PROGRESS prefix:

2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:     done     156G    %done size transid 1369    todo              ETA
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS: -------- -------- -------- ---- ------- ---- ------- ----------------
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS: deferred 110.962G finished  max       0 1274       -                -
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:  16.017G  23.959G 66.8515%  32M       0 1274  1h 14m 2024-12-01 07:16
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:   5.884G  14.211G 41.4034%   8M       0 1274  2h 48s 2024-12-01 08:02
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:   1.973G   3.849G 51.2648%   2M       0 1274  1h 37m 2024-12-01 07:39
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS: 415.105M   1.587G 25.5486% 512K       0 1274  3h 15m 2024-12-01 09:17
2024-12-01 06:02:09.847 14564.14616<6> crawl_new: PROGRESS:  56.074M   1.432G  3.8241% 128K       0 1274 21h 47m 2024-12-02 03:50

but I took the prefix out to make the progress table fit in 80 columns. How important is 80 columns? Do people still use terminal windows that small, or could it be wider, like 90 or 120 columns? Or does everyone expect JSON or HTML output these days?

At some point I want to add stats to this kind of table, or a separate table, to keep score: how many bytes were removed, average time for multiple scan cycles, reading throughput, and other performance metrics. I guess it comes down to whether it's better to make this table bigger, or make two tables, or two text rows per size (so 12 lines instead of 6). But that question can be answered after the data is available.

@kakra
Copy link
Contributor

kakra commented Dec 1, 2024

@Zygo I mean this in bees.status:
image

@Zygo
Copy link
Owner Author

Zygo commented Dec 1, 2024

@Zygo I mean this in bees.status:

Like this?

3a33a53 context: add a PROGRESS: header in $BEESSTATUS

@kakra
Copy link
Contributor

kakra commented Dec 1, 2024

Yay, looks good to me. You're on the x-mas path :-D

@Zygo
Copy link
Owner Author

Zygo commented Dec 1, 2024

The #268 situation is where extent scan really makes a difference.

I didn't explicitly state this above: In the 268 case with new snapshots appearing all the time, the subvol scanners are always stuck near zero, so there's negligible progress to lose by starting over with extent scan. That's the basis of my "there's no real inconvenience when starting over with extent scan" statement, but that statement doesn't apply in other cases like a "one big subvol" filesystem.

Some of the improvements for processing individual extents make the subvol scanners faster too. Even starting over with a subvol scanner should take less time to reach the same point in the filesystem than in earlier versions.

It's not quite starting over, even in the worst case: if an extent already has blocks in the hash table, bees can skip the dedupe/copy analysis because blocks are only added to the hash table when no dedupe/copy operations are needed. Only reads have to be repeated (without the reads, we don't know what blocks to look for in the hash table). That can skip some metadata lookups that eat a lot of time.

@Zygo
Copy link
Owner Author

Zygo commented Dec 2, 2024

A comparison of extent scan and subvol scan performance (bytes freed over time) in a filesystem with a single subvol.

TL;DR extent scan seems to outperform everything.

Duperemove added for scale.

00-all-df-summary

This data set is designed to challenge block-level dedupers on btrfs. The "none" data set is about 160 GiB in 3 Windows VM image raw files, copied to a single subvol on a freshly formatted filesystem without compression or prior dedupe. The "zstd" data set is the same data, but written with compress-force=zstd so it takes about 100 GiB on disk before dedupe. The same two btrfs filesystem images ("none" or "zstd") are used for all tests to ensure consistency across runs. The VM images have a lot of duplication within each image and also across images, so there is plenty of work for a deduper to do.

"zstd" is a challenging scenario for both bees and duperemove, as the data is already compressed, the uncompressed extents are shorter, and correct alignment of duplicate extents is critical. Other btrfs dedupers can't handle compressed extents or block-level dedupe at all, so they score zero bytes freed on the "zstd" test, and it doesn't matter how long they run.

"none" is a more normal scenario with larger, uncompressed extents that don't slow down either bees or duperemove. Some other block-level btrfs dedupers can function on the "none" filesystem, but they free fewer bytes or take longer to run than bees or duperemove.

I didn't include a result for "zstd" with subvol scan because the subvol scan is still running on the "none" image, and there's no way the subvol scan on "zstd" could be faster than it is on "none".

@kakra
Copy link
Contributor

kakra commented Dec 2, 2024

Skip over the small extents to dedupe the large extents first.

Does this mean it can have an effect of recombining smaller extents into larger ones, aka defragmentation? Or does it look at extents only rather than chains of blocks?

@Zygo
Copy link
Owner Author

Zygo commented Dec 2, 2024

Skip over the small extents to dedupe the large extents first.

This is simple filtering: there's six scanners, and each one picks extents of a particular size from the input stream. So if the 128K scanner gets bogged down in compressed data, it doesn't stop bees from going ahead and finding some 128M extents for quick and easy free space.

As a side effect, it means a dedupe will usually use a large extent as src, and make equal or smaller references to dst, so the biggest extents in the filesystem are preserved while the smallest extents get deduped out of existence.

Does this mean it can have an effect of recombining smaller extents into larger ones, aka defragmentation?

No defragmentation yet. bees still only looks at one extent at a time, so it can't consider combining two or more extents together. bees also can't yet flip the order of src and dst if the second block it has seen is part of a larger extent. Both of these capabilities require holding locks on multiple extents at a time, and figuring out how to schedule tasks efficiently when two growing extents meet each other or want to use the same extent in different ways.

Garbage collection will probably come before defrag. GC is easier to implement because it doesn't require combining multiple extents together. btrfs doesn't have any tool that does GC properly yet, so GC fills in a feature gap for btrfs. In its simplest form, GC is rewriting all of the blocks that are still reachable when bees detects that any block of the extent is unreachable. Conveniently, there's now a place in the bees code where we have all that information.

@kakra
Copy link
Contributor

kakra commented Dec 3, 2024

@Zygo I can confirm that scan mode 4 is able to catch up with snapshots. We can close #268.

@kakra
Copy link
Contributor

kakra commented Dec 4, 2024

@Zygo Is it correct behavior that the table never updates the transid until it reached 100% in that category:

    done   4.728T    %done size transid 4067300   todo              ETA
-------- -------- -------- ---- ------- ------- ------ ----------------
  2.397T   2.557T finished  max 4067297 4067298      -                -
577.043G 615.621G finished  32M 4067297 4067298      -                -
427.068G  455.62G finished   8M 4067297 4067298      -                -
 461.89G  492.77G finished   2M 4067297 4067298      -                -
324.829G 346.546G finished 512K 4067297 4067298      -                -
212.468G  312.04G 68.0899% 128K       0 4060025 3d 11h 2024-12-07 13:12

Then, what happens in the 128k category if I restart bees? Will it start again from transid 0?

@Zygo
Copy link
Owner Author

Zygo commented Dec 4, 2024

The progress table is...a work in progress.

It has a few quirks, like saying finished when it has queued the last extent in the filesystem, not when it has completed that extent. It's declaring completion a little too early, but usually only by a few hundred extents.

The progress table shows the position that would be saved in beescrawl.dat. It goes from 0 to 100% of the data in each row, then swaps in a new transid range and starts a new pass at 0% data. If the max transid is the same as the filesystem transid, that row stops crawling. If all the rows reach max transid, bees goes idle. When the filesystem transid goes up, bees comes out of idle, and finished crawlers start new passes. The new passes look only at data in the new transid range, so they should exclude the data from previous passes.

Extent scans have longer queues of extents than the subvol scanners did. btrfs locks entire files for dedupe, so if bees tried to run parallel dedupe tasks on the same inodes at the same time, btrfs would force them all to run one at a time. This was easy to avoid in the subvol scanners because they're naturally organized by inode, but it's not possible to know what inodes an extent belongs to in advance. When scanning a big file, the scanner will end up creating tasks that can't execute in parallel because there's dedupe happening somewhere else in the same files. In order to avoid stalling the worker threads, the extent scanner will put these extents on a queue in memory, and try again to find an extent that can be scanned in parallel.

If you have a lot of different files, extents will usually belong to different files, and the queue will be short. If you have a few big files, extents will usually belong to the same files, and the queue will get long. If you have a mix of file sizes, the queue length will grow and shrink as needed. In all cases, the queue stops growing at a hard limit of 10,000 extents, and worker threads are allowed to become idle until the queue goes back below the limit.

Because the task queue executes out of order, there can be longer delays between an extent task going into the queue and coming out of it in extent scan mode than subvol scan mode. That can freeze the data position in the progress (and in beescrawl.dat) until the lowest-numbered extent in memory is finally dequeued. There's a noticeably longer restart time as a result of this, as all the queued work at shutdown time will be repeated at startup. "5 or 10 restarts per day" is probably still OK, but "stop and start every 5 minutes" is probably too short to guarantee forward progress now.

There's a couple of things that could improve this:

  1. Shorten the queue limit to e.g. 100 tasks per worker thread
  2. Save the entire in-memory queue to a file on shutdown and reload it at startup

@kakra
Copy link
Contributor

kakra commented Dec 4, 2024

I observed a few desktop and application stalls which I didn't notice before. They were probably there pre-v0.11 but they were rare. Even with this RC, I didn't notice any until the progress table reached the 128k-only situation:

btrfs locks entire files for dedupe

What does this mean? Does bees internally avoid concurrent access to the same file via locks, or does it really lock the file on the kernel level (you wrote "btrfs") so that applications would stall when accessing such files?

The stalls are quite excessive... We are talking 1-2 minutes per stall, tho usually it's in the 10-20 seconds range. But since bees has the 128k category running for an estimated time of 3-5 days left, this happens multiple times per hour.

I've seen that you also changed the toxic detection from 0.1 to 5 seconds if I've read the diffs correctly. So that may be an explanation, too.

To see the situation better: I purged beescrawl.dat to make bees read the whole filesystem again, I kept beeshash.dat, tho.

Here's my current table:

    done   4.728T    %done size transid 4068303    todo              ETA
-------- -------- -------- ---- ------- ------- ------- ----------------
deferred   2.593T finished  max 4068301 4068302       -                -
deferred  613.84G finished  32M 4068301 4068302       -                -
deferred 439.003G finished   8M 4068301 4068302       -                -
deferred 470.305G finished   2M 4068301 4068302       -                -
136.464G 334.528G 40.7931% 512K 4068260 4068274 36m 31s 2024-12-04 11:05
223.392G 328.084G 68.0899% 128K       0 4060025  4d 17m 2024-12-08 10:46

ETA for 128k usually lowers again after the other categories are done. Running on kernel 6.12.1.

@Zygo
Copy link
Owner Author

Zygo commented Dec 4, 2024

I've seen that you also changed the toxic detection from 0.1 to 5 seconds if I've read the diffs correctly. So that may be an explanation, too.

If you change it back, does it detect a lot of toxic extents? Maybe we should keep those.

ETA for 128k usually lowers again after the other categories are done.

Yes, it has fewer competitors so it can use all the worker threads.

There might be a correlation between 128K and stalls. The stalls in btrfs are usually caused by delayed ref updates, and smaller extents = more extent refs = more delay. If that's the cause, the only solution is to dedupe those extents more slowly. When there's a mix of big extents to slow down the small extents, the system might be self-regulating, but when the big extents are done, maybe it starts flooding btrfs's delayed ref queues.

If you can run slabtop, see if the number of btrfs_delayed_ref_head objects is large (millions) during the stalls.

A first-time run on snapshots is going to be brutally slow too, for much the same reason: unsharing a snapshot's metadata triggers ~300 delayed ref updates for each metadata page. That gets to millions in no time.

btrfs locks entire files for dedupe

What does this mean? Does bees internally avoid concurrent access to the same file via locks, or does it really lock the file on the kernel level (you wrote "btrfs") so that applications would stall when accessing such files?

Both!

btrfs reflink and dedupe operations lock both inodes for writing while they do their thing (they also force a data flush, so if you're writing to a file during dedupe, it gets pushed out to disk as well). If you cp --reflink a file on btrfs, it's effectively an atomic operation, because the source file isn't allowed to be modified during a reflink clone operation.

bees knows that btrfs does this, so when it sees multiple dedupe operations queued up on the same file, bees arranges the queue tree so one worker thread does all the dedupes touching that file. Other worker threads are then free to do dedupes on other files.

There is a fairly large space between dedupe operations while bees does hashing and duplicate lookup of the next extent, so it shouldn't be locking out other users of the files for too long at a time, but all the locking (and flushing) will slow down overall write throughput.

@kakra
Copy link
Contributor

kakra commented Dec 4, 2024

Well, it looks like this, and it doesn't look much different during a short stall (I couldn't see it until the stall was over):

 Active / Total Objects (% used)    : 10077226 / 13946968 (72.3%)
 Active / Total Slabs (% used)      : 342761 / 342761 (100.0%)
 Active / Total Caches (% used)     : 165 / 224 (73.7%)
 Active / Total Size (% used)       : 2330344.12K / 3380032.05K (68.9%)
 Minimum / Average / Maximum Object : 0.01K / 0.24K / 14.17K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
2289636 1873215  81%    0.11K  63601       36    254404K btrfs_path
1936480 1199496  61%    0.57K  69160       28   1106560K radix_tree_node
1833280 1168558  63%    0.06K  28645       64    114580K Acpi-Namespace
884352 844308  95%    0.06K  13818       64     55272K kmalloc-64
858572 699246  81%    0.15K  33022       26    132088K btrfs_delayed_ref_head
778216 464242  59%    0.99K  24321       32    778272K btrfs_inode
762923 430665  56%    0.05K  10451       73     41804K file_lock_ctx
619956 590717  95%    0.23K  18234       34    145872K btrfs_extent_buffer
601600 377074  62%    0.01K   1175      512      4700K zs_handle-zswap1
514458 292094  56%    0.19K  12249       42     97992K dentry
434655 285452  65%    0.10K  11145       39     44580K btrfs_free_space
242688 236999  97%    0.03K   1896      128      7584K f2fs_bio_iostat_ctx
220116 158827  72%    0.04K   2158      102      8632K vma_lock
200375 157482  78%    0.16K   8015       25     32060K vm_area_struct
186400 116851  62%    0.25K   5825       32     46600K kmalloc-256
162816  64286  39%    0.03K   1272      128      5088K kmalloc-32
136600  93304  68%    0.31K   5464       25     43712K xfrm_dst
123072  92405  75%    0.06K   1923       64      7692K kmalloc-cg-64
111904  46814  41%    0.50K   3497       32     55952K kmalloc-512
103264 103264 100%    0.07K   1844       56      7376K vmap_area
 77983  69892  89%    0.68K   1661       47     53152K proc_inode_cache
 76440  55691  72%    0.10K   1960       39      7840K anon_vma
 65536  48090  73%    0.02K    256      256      1024K kmalloc-16
 58710  57874  98%    0.13K   1957       30      7828K kernfs_node_cache
 51812  50015  96%    1.00K   1620       32     51840K kmalloc-1k
 45984  45984 100%    1.00K   1570       32     50240K xfs_inode
 42924  25359  59%    0.19K   1022       42      8176K filp
 42168  37728  89%    0.19K   1004       42      8032K kmalloc-192
 32320  14698  45%    0.06K    505       64      2020K kmalloc-rcl-64
 27520  16112  58%    0.25K    860       32      6880K maple_node
 23040  21376  92%    0.01K     45      512       180K kmalloc-8
 22932  17907  78%    0.09K    546       42      2184K kmalloc-96
 21840  20330  93%    0.20K    546       40      4368K f2fs_dic_entry
 21294   8030  37%    0.09K    507       42      2028K kmalloc-rcl-96
 18070  16102  89%    0.61K    695       26     11120K inode_cache

I'll try to capture a situation with a longer stall. This one was like 15-20 seconds, the web browser hang, and the console window opened with window decorations only and the console only rendered after the stall. I'm currently seeing micro-stalls of 1-2s multiple times while writing this text.

OTOH, on another system, kernel 6.6, bees mostly idle, it looks like this (grepped for btrfs):

2185883 1328531  60%    1,10K  75525       29   2416800K btrfs_inode
351203 227548  64%    0,23K  10337       34     82696K btrfs_extent_buffer
122343  91930  75%    0,10K   3137       39     12548K btrfs_free_space
  2760   2760 100%    0,13K     92       30       368K btrfs_delayed_ref_head
  2484   2484 100%    0,11K     69       36       276K btrfs_path
  2052   1900  92%    0,41K     54       38       864K btrfs_ordered_extent

So btrfs_delayed_ref_head looks extremely high on my desktop system.

@kakra
Copy link
Contributor

kakra commented Dec 4, 2024

When there's a mix of big extents to slow down the small extents, the system might be self-regulating, but when the big extents are done, maybe it starts flooding btrfs's delayed ref queues

So what are the options? I'm seeing a less pronounced behavior with 512k extents, too. Everything above seems to be okay.

bees could throttle this, if only 128k and 512k is left... or bees could look a the slabs and throttle itself? Limit the number of threads working on those buckets?

I think bees already throttles itself due to these stalls... It just doesn't do so early enough: Systems stalls, loadavg goes up above my loadavg limit set i in bees, bees will throttle down... This works for a while until loadavg does down enough (it currently varies between 2 and 20 like every few minutes), then the cycle repeats.

It can read the 128k extents at 40-50 MB/s, tho. It's probably just when it does something with them.

@Zygo
Copy link
Owner Author

Zygo commented Dec 4, 2024

Yeah, delayed refs have been keeping btrfs slow and laggy for (checks watch) 14 years now.

Maybe we want a throttle feature, "don't dedupe more than 150 extent refs per second" or something. The extent scan speeds things up so much that we can consider backing away from maximum performance and still get useful amounts of dedupe done.

One experiment you can quickly try is to simply reduce the thread count to 2, or even 1. bees might now be able to keep up with filesystem updates with a single thread, and that will reduce the number of delayed refs in the transaction commit.

@kakra
Copy link
Contributor

kakra commented Dec 4, 2024

I went from 6 threads down to 3 now. Let's see how it changes slabs... Currently the slab value is low.

BTW: Restarting bees properly got back to 71% progress immediately in the progress table although showing transid 0. Yay! \o/

@kakra
Copy link
Contributor

kakra commented Dec 8, 2024

Something has backfired, look at the numbers:

    done   4.575T    %done size transid 4080962    todo              ETA
-------- -------- -------- ---- ------- ------- ------- ----------------
  2.703T   2.807T 96.2996%  max 4079030 4079054  16h 2m 2024-12-08 23:53
 54.867G 482.053G 11.3820%  32M 4079214 4080952 35m 52s 2024-12-08 08:26
125.462G 345.754G 36.2865%   8M 4079054 4079207  1d 15h 2024-12-09 22:57
203.537G 247.292G 82.3064%   2M 4079030 4079054 18h 46m 2024-12-09 02:37
  41.17G 382.309G 10.7687% 512K 4079030 4079054  5d 23h 2024-12-14 07:15
260.413M 353.565G  0.0719% 128K 4079027 4079030  2y 23w 2027-05-24 18:04

Previously, this system has completed all size categories. Meanwhile, a lot of new snapper snapshots have been created, older ones have been cleaned up, so I have a mostly constant number of rotating snapshots. Looking at beescrawl.dat, I can see that all of the newer snapshots - or probably even all of the snapper snapshots - start at transid 0. I'm not sure why that happens, it looks like bees never looked at them yet, maybe lost the beescrawl.dat during a crash? (it's stored on xfs which tends to zero out file contents if the crash recovery isn't sure if the contents belong to the new version of the file):

root 5 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 250 objectid 18226741014528 offset 0 min_transid 4079030 max_transid 4079054 started 1733585040 start_ts 2024-12-07-16-24-00
root 251 objectid 11819762167808 offset 0 min_transid 4079214 max_transid 4080952 started 1733640411 start_ts 2024-12-08-07-46-51
root 252 objectid 14731643215872 offset 0 min_transid 4079054 max_transid 4079207 started 1733589568 start_ts 2024-12-07-17-39-28
root 253 objectid 9468088360960 offset 0 min_transid 4079054 max_transid 4080962 started 1733640661 start_ts 2024-12-08-07-51-01
root 254 objectid 10666197102592 offset 0 min_transid 4079030 max_transid 4079054 started 1733585057 start_ts 2024-12-07-16-24-17
root 255 objectid 9472038404096 offset 0 min_transid 4079027 max_transid 4079030 started 1733584888 start_ts 2024-12-07-16-21-28
root 256 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 258 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 259 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 260 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 261 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
...
root 320 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 321 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 4275 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 4278 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 4279 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 13080 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
root 13081 objectid 0 offset 0 min_transid 0 max_transid 4060025 started 1733068536 start_ts 2024-12-01-16-55-36
...
root 78022 objectid 0 offset 0 min_transid 0 max_transid 4080902 started 1733638691 start_ts 2024-12-08-07-18-11
root 78023 objectid 0 offset 0 min_transid 0 max_transid 4080902 started 1733638691 start_ts 2024-12-08-07-18-11
root 78024 objectid 0 offset 0 min_transid 0 max_transid 4080902 started 1733638691 start_ts 2024-12-08-07-18-11
# 351 lines

So I think we still have the problem that bees cannot follow up on snapper snapshot creation.

The first pass after I've deliberately reset the crawl progress, it finished all categories like 2 days ago. And during the whole period, it probably did work on a lot of extents which where previously tagged "toxic" but because the heuristic was raised from 0.1s to 5s, it now has a lot more slow work to do, and that's why the 128k category went from 2d-8w up to 2yrs? Is that an explanation?

I'm not sure if some stats in bees could show how much of the 128k category has actually been deduplicated, but I'm guessing the effective rate is quite low and I should probably remove that from the scanner?

Maybe bees should use shorter "toxic" wait time heuristics for smaller extents? IOW, instead of disabling the smaller categories altogether, let the bees do it somewhat more dynamic, based on the capabilities of the system and storage?

On a server I'm using the new bees version since around week, I can see higher sys% usage, I cannot see a trend, tho. The first increase the week before has been caused externally to the VM (you can see that by the steal% spike) and slightly decreased over the following days. Then bees 0.11 was added on Dec, 1st:
image

I think toxic extents and sys% usage is tightly coupled?

Are my thoughts even correct? ;-)

@Zygo
Copy link
Owner Author

Zygo commented Dec 9, 2024

In extent scan mode, bees only looks at subvols 250-255, the virtual subvols that are used for the extent scan size tiers. There's some vestigial subvol state tracking that shouldn't be there in extent mode--it's creating new subvol crawlers for the new subvols, and also deleting subvols that no longer exist. Extent scan never runs the crawlers for the new subvols, so they sit in beescrawl.dat, inert, sitting at transid 0, until the subvol goes away, or the user switches to a subvol scan mode.

It's mostly harmless, and shouldn't affect testing of extent scan itself.

In the progress table, you can see that the bottom end of the transid range is still recent:

    done   4.575T    %done size transid 4080962    todo              ETA
-------- -------- -------- ---- ------- ------- ------- ----------------
  2.703T   2.807T 96.2996%  max 4079030 4079054  16h 2m 2024-12-08 23:53

The current transid is 4080962, and the min transid for this row is 4079030. It should be able to catch up soon (the "16h 2m" part is...not very accurate, in my experience). So the tracking for extent scan seems to be working (the transid column would revert to 0 otherwise).

I might fix this by just moving the extent scan "virtual subvols" into a different tracking file, further decoupling it from the parts of bees that track literal btrfs subvols.

I'm not sure if some stats in bees could show how much of the 128k category has actually been deduplicated, but I'm guessing the effective rate is quite low and I should probably remove that from the scanner?

More to the point, the entire 128k category is less than 10% of the total, so even if it was 100% done, it would only save so much space.

One of my TODO list items is to make the size ranges configurable. That's another reason to put the extent scan tracking into its own file--I can add things like "min" and "max" size fields for each line. Lines removed by the user can stay removed, so you'd be able to remove the 128K category by removing the line.

If you are planning to try removing the 128K size tier, first try changing the beescrawl.dat line for root 250: set both min_transid and max_transid to 4080962 (i.e. the current filesystem transid), so it will consider all the existing data already done. Then see if it keeps up after that, or falls behind again.

Another way to manage this is to simply stop bees for a while once all the higher size rows are marked "finished". If bees is truly keeping up with data in the higher sizes, it's consuming new data faster than it's appearing in the filesystem. It follows that bees doesn't need to run all of the time. It could be banished to a maintenance window.

Maybe bees should use shorter "toxic" wait time heuristics for smaller extents? IOW, instead of disabling the smaller categories altogether, let the bees do it somewhat more dynamic, based on the capabilities of the system and storage?

The main problems with "toxic" extents in a post-5.7-kernel world are:

  1. Since v5.0, btrfs occasionally dumps a lot of transaction commit work on some victim thread that happened to try to join a transaction.
  2. Since v5.7, toxic extents don't really exist any more, at least not for extents with fewer than a million reflinks.
  3. Some of the toxic extent checks were expensive, especially when searching for adjacent blocks to matching extents. Checking for toxic extents in this case required LOGICAL_INO lookups that served no other purpose but to measure extent toxicity.
  4. Most toxic extents were created by bees in the first place. The recent changes should prevent creation of any more.

So most of the toxic extents that show up now are false positives, and we burn a lot more sys time checking for them than we save by finding one, now that they're more rare and bees avoids producing new ones.

Extent scan uses the LOGICAL_INO system call more, which would increase sys usage; however, other parts of bees now use it less, so there should be either a net saving, or more productive dedupe per unit of sys time. Even if usage is higher, it's a reasonable tradeoff due to the much higher overall performance.

Higher sys% now might simply be the result of more efficient scheduling--bees keeps more threads active and doing productive work than before, resulting in higher CPU usage across the board. Existing users may need to reduce load parameters or thread counts, or use schedtool to set idle/nice 19 priority.

This also points toward the need for explicit throttling in bees itself. If we don't want bees to use more than 20% CPU, one easy way to achieve that is for bees to sleep 80% of the time.


I've been experimenting with a throttle algorithm that tries to detect btrfs transaction latency spikes, and adds sleeps between dedupe calls to try to flatten them. Weirdly, it's coming up with "sleep 7-10% of the time is good enough", and the impact on benchmarks when doing that is to make dedupe faster. I guess those latency spikes hurt performance too? I'm still collecting data on this.

@lilydjwg
Copy link

lilydjwg commented Dec 9, 2024

If we don't want bees to use more than 20% CPU, one easy way to achieve that is for bees to sleep 80% of the time.

Does a systemd config CPUQuota=xx% work? However I don't like to limit CPU usage and let it run idle. I prefer to use CPUWeight to (de-)prioritize instead.

@kakra
Copy link
Contributor

kakra commented Dec 9, 2024

In extent scan mode, bees only looks at subvols 250-255, the virtual subvols that are used for the extent scan size tiers

Oh, I see... Interesting. Yes, a separate file might make this more obvious for people not knowing the internals of btrfs.

One of my TODO list items is to make the size ranges configurable. That's another reason to put the extent scan tracking into its own file--I can add things like "min" and "max" size fields for each line. Lines removed by the user can stay removed, so you'd be able to remove the 128K category by removing the line.

I like this idea. But it should be well documented how this features should be used, e.g. why change min or max...

If you are planning to try removing the 128K size tier, first try changing the beescrawl.dat line for root 250: set both min_transid and max_transid to 4080962 (i.e. the current filesystem transid), so it will consider all the existing data already done. Then see if it keeps up after that, or falls behind again.

I'll try that if I don't see progress on the higher size ranges for a next days.

Since v5.0, btrfs occasionally dumps a lot of transaction commit work on some victim thread that happened to try to join a transaction.

If this happens to random victim threads, and I'm guessing this applies to or affects user threads/processes that do IO, why can't the kernel just spawn its own "victim thread" which can block and wait?

Most toxic extents were created by bees in the first place. The recent changes should prevent creation of any more.

I'm pretty sure I did a full restore at least once post-5.7, so my btrfs should contain no more such toxic extents? (restored from borg, not some btrfs send mirror or clone) Or asked differently: How recent are those changes to bees?

Higher sys% now might simply be the result of more efficient scheduling--bees keeps more threads active and doing productive work than before, resulting in higher CPU usage across the board.

Yeah, I was suspecting something like this after looking at the performance graphs for some more time.

I guess those latency spikes hurt performance too? I'm still collecting data on this.

At least it looks like it hurts performance of bees because ETA spikes, too. But that's probably the nature of the ETA calculation, I don't think it has that much of an impact overall. At least, after the bigger sizes were done a few days ago, 128k took only 2 more days instead of proposed 6 weeks to 1 year. IOW, especially with the smaller sizes, the ETA is very inaccurate and depends a lot on the progress of the higher size classes.

But I can say that I'm seeing very bad performance behavior on the desktop while bees is chewing through the data: Chrome randomly freezes when clicking a link, and gets stuck for multiple minutes or until I kill and restart the browser. Web video playback randomly freezes and eventually recovers after reconnecting to the stream. The KDE plasa-shell randomly freezes and usually doesn't recover on its own, instead it spends 100% CPU somewhere and needs to be restarted (which takes 30-60s using systemctl).

If bees is done working, all of the above is hardly an issue. I'm seeing only very rare stalls, and only for like 1 or 2 seconds. This may not even be caused by bees itself.

@Zygo
Copy link
Owner Author

Zygo commented Dec 11, 2024

could bees get a fallback for scanmode 4 where it queues a subvol for scanmode 3 and then later tries to work through that as a one-pass...

I was thinking exactly that when this popped up on my screen...

are both modes that incompatible that this couldn't be made work?

Pretty much. It's the cost of repeating a range of extents vs. scanning an entire subvol from scratch.

Extent scan looks like this (one for each size tier):

  1. Get the next extent from the extent tree. If no extents left, start a new scan or wait for filesystem transid to go up
  2. Compare the transid on that extent with the range of this scan, go back to step 1 if out of range
  3. Get a list of all references to this extent (all subvols and reflinks)
  4. Read one of them, make hashes
  5. Decide whether to do one of:
    5.1. Destroy the entire extent by replacing every block in every reference to the extent
    5.2. Leave the entire extent untouched, and put its hashes in the hash table
  6. Return to step 1.

For each extent, extent scan finds the list of references that are present at that time, and must act on all or none of them at that time. If we can't do all the dedupes then the extent remains in the filesystem, taking up space.

Subvol scan looks like this (one for each subvol):

  1. Get the next ref from the subvol tree. If no refs left, start a new scan or wait for filesystem transid to go up
  2. Compare the transid on the ref with the range of this scan, go back to step 1 if out of range
  3. Read the ref, make hashes.
  4. Check to see if the hashes are already in the hash table at this position. If so, go to step 1.
  5. Decide whether to do one of:
    5.1. Destroy the entire ref by replacing every block in this extent ref
    5.2. Leave the entire ref untouched, and put its hashes in the hash table
  6. Return to step 1.

When there's a subvol that can't be modified, subvol scan can easily avoid the subvol--we simply stop the loop for the one subvol, and let all the other subvols continue to run. We can resume later by starting the loop again.

Note that in both cases, an immutable subvol means dedupe of some extents can't happen. At best, we can remove all of the mutable references, so that only immutable references remain. Later when those references are deleted, so is the duplicate extent.

Getting back to the fallback question:

could bees get a fallback for scanmode 4 where it queues a subvol for scanmode 3

There could be a fallback, but it wouldn't be scanmode 3. There's no way to map a scanmode 4 data pointer to a scanmode 3 data pointer, so it's only possible to scan the whole subvol across the transid range. If any crawler has transid 0, any subvol scan must start from transid 0 (you may recall this pattern from the subvol->extent scan transition. It works both ways). The end result is that we'd often end up reading everything in a snapshot if send happened to be running when bees tried to dedupe it.

We would have to create a new kind of extent scan virtual subvol that scans only extents that have at least one reference from an immutable subvol (call this a "type B extent crawler") and put that crawler on pause immediately. The regular extent scan ("type A") would then ignore any extent that has a reference from an immutable subvol. If the subvol becomes mutable, then we put an upper limit on the "type B" extent scan crawler's position so it'll stop when it has caught up with "type A", pause the "type A" crawler, then start the "type B" extent scan crawler running (and switch back to "A" when "B" is done). If there are multiple subvols where send is coming and going, we'll need to dynamically create new "type B" crawlers every time a send starts.

So these extent scan crawlers would have a state like "start at 20T in transid 12345..23456 and end at 5T in transid 23456..34567, only for extents that reference subvol ID 7115,7232,7453". Definitely can't put that in beescrawl.dat, so we'd need a new state file (maybe even more than one).

It's possible, but it's potentially like the optimization for parallelizing reads on block groups by drive: it's a complex subsystem that 99% of users won't be able to use, because it turns out all of their data is in the same schedulable unit. If all of your data is touched by btrfs send, the fancy optimization degenerates to stopping and starting the entire process, but with more incomprehensible progress tracking and more room for bugs. It'd probably burn a lot of CPU too--those extent maps aren't cheap, even if you don't read the data.

@KeinNiemand
Copy link

Do you actually get any benefit from the lower amount of fragmentation on existing data on an an existing file system already scanned and deduped with older bees version? Is there any way to get that benefit?

@Zygo
Copy link
Owner Author

Zygo commented Dec 12, 2024

Do you actually get any benefit from the lower amount of fragmentation on existing data on an an existing file system already scanned and deduped with older bees version?

Currently, bees doesn't attempt to reduce existing fragmentation.

Is there any way to get that benefit?

The new anti-fragmentation rules allow the btrfs fi defrag tool to be more effective.

Older versions of bees would attempt to replace all larger defragged extents with any surviving references to smaller fragmented extents, undoing any benefit from btrfs fi defrag. The new bees version will only dedupe the new, larger extents if this can be done without excess fragmentation; otherwise, the new extent is kept as-is and recorded in the hash table for future dedupes. Future dedupes will select the larger extents instead of the smaller ones when both exist in the hash table.

For filesystems with snapshots, it's possible to feed small batches of files (i.e. the same files from every snapshot) through btrfs fi defrag, then let bees dedupe them to recover the space. That can take a long time, it might cause metadata or data expansion, and if the snapshots are older and not in active use, the benefit is negligible. I would leave old snapshots alone, and let any layout issues they might have disappear when they expire.

@Massimo-B
Copy link

Older versions of bees would attempt to replace all larger defragged extents with any surviving references to smaller fragmented extents, undoing any benefit from btrfs fi defrag. The new bees version will only dedupe the new, larger extents if this can be done without excess fragmentation;

That means bees itself introduces less fragmentation now. But the order of usage would still be first defragmentation then bees deduplication?

The new anti-fragmentation rules allow the btrfs fi defrag tool to be more effective.

..sounds more like running defrag after bees? Doesn't defragmentation revert some of the achieved deduplication?

I just remember never using "btrfs filesystem defragment". Can't remember why, it conflicted with either the old bees or send/receive by btrbk.

otherwise, the new extent is kept as-is and recorded in the hash table for future dedupes.

So there already is a list of skipped extents in the hash table for future dedupes. Can't that be used for the blocked btrfs-send areas in order to retry deduplication later?

Currently, bees doesn't attempt to reduce existing fragmentation.

Right now before trying the new bees RC, I'm balancing my very slow HDD btrfs (#298 (comment)).

@Zygo
Copy link
Owner Author

Zygo commented Dec 12, 2024

That means bees itself introduces less fragmentation now. But the order of usage would still be first defragmentation then bees deduplication?

The question from KeinNiemand was about recovering after running earlier versions of bees, so the order is:

  1. Run old bees, get a lot of fragmentation and some dedupe
  2. Run defrag to remove the fragmentation and the dedupe
  3. Run new bees to get some of the dedupe back, but without most of the fragmentation

Steps 2 and 3 could be done at the same time, i.e. upgrade bees first, then run the defrag, and let bees follow behind the defrag with dedupe. bees always acts after everything else has.

Doesn't defragmentation revert some of the achieved deduplication?

If it's done with a crude tool like btrfs fi defrag, yes. defrag is essentially "copy of the data to a new, unshared location, then redirect the file's data pointers to point to the new location." So defrag undoes dedupe.

Once defrag is done, bees will treat the data the same way as if you had created a new file: if it finds existing duplicate copies of the data, bees replaces the new file with reflinks to the old files; however, it will not replace all duplicates if the resulting fragmentation exceeds the thresholds in the new bees version. Thus, dedupe undoes defrag, but not all of the defrag. The trick is to find the point where the fragmentation produces productive amounts of free space proportional to the cost of having the fragments, and bees is much closer to that point now.

In theory, there will be less fragmentation and less dedupe. They are tightly coupled together.

In practice it's not quite that simple. New bees makes better decisions than old bees when multiple matches are available or when matches overlap. Subvol scans are a bit "lossy" when it comes to partial matches and the extent scan doesn't have those kinds of losses. So it is possible, even likely, that the new bees dedupes to a smaller size than old bees, even as it uses fewer fragments to do so. Note this will only work once, and only when starting with a filesystem that has already been shaped by old bees--after that, theory takes over, and dedupe won't return all of the space that defrag takes up.

So there already is a list of skipped extents in the hash table for future dedupes.

No, there is simply a list of locations for each hash. If the hash is matched by a future dedupe, that dedupe will see multiple possible locations for the data, and pick the one that results in fewer fragments. Usually that will be the longest one, but there are sometimes exceptions, e.g. a smaller existing extent of exactly the same size will be chosen if the alternative is to split a large extent into smaller pieces to make a match fit.

Can't that be used for the blocked btrfs-send areas in order to retry deduplication later?

No. I mean, it's possible, but there are several known ways to handle blocked btrfs-send areas, and all of them are better than using the hash table do it.

@kakra
Copy link
Contributor

kakra commented Dec 17, 2024

Here's something I don't understand:

THREADS (work queue 0 of 10 tasks, 3 workers, load: current 1.99427 target 3.9401 average 2.17969):
        tid 3456: bees: [90096.9s] waiting for signals
        tid 4547: progress_report: [96.853s] idle 3600
        tid 4548: status_report: writing status to file '/run/bees/bees.status'
        tid 4549: hash_writeback: flush rate limited after extent #6425 of 8192 extents
        tid 4550: hash_prefetch: [3566.61s] idle 3600s
        tid 4551: crawl_transid: [17.477s] waiting 30.7075s for next transid RateEstimator { count = 4109471, raw = 98.3342 / 3019.61, ratio = 98.3342 / 3037.09, rate = 0.0323778, duration(1) = 30.8853, seconds_for(1) = 13.4091 }
        tid 4555: crawl_writeback: [96.818s] idle, dirty
        tid 4556: ref_ecf70018000_256M_1: Reading BeesBlockData BeesBlockData { 4K 0x9b90000 fd = 8 '/mnt/btrfs-pool/gentoo/rootfs/var/lib/flatpak/repo/objects/87/409c230a2ca93715ea73aa60c6c9705ff148d0f4931c14244ff5f955478594.commitmeta', data[4096] }
PROGRESS:
    done    4.56T    %done size transid 4109471 todo ETA
-------- -------- -------- ---- ------- ------- ---- ---
deferred   2.494T finished  max 4098316 4098405    -   -
deferred 629.475G finished  32M 4109469 4109470    -   -
deferred 425.835G finished   8M 4109469 4109470    -   -
deferred 455.267G finished   2M 4109469 4109470    -   -
deferred 311.319G finished 512K 4109469 4109470    -   -
deferred 293.454G finished 128K 4109469 4109470    -   -

The table tells me that "max" is finished, and deferred. But it still has threads running like ref_ecf70018000_256M_1 and the transid clearly lacks behind.

Shouldn't the table tell me that it is still working on the max category?

@Zygo
Copy link
Owner Author

Zygo commented Dec 17, 2024

Check master, I moved the "finished" to the "ETA" column so it can show the %done value.

There's two main sources of progress lag:

  1. The progress table is a snapshot taken from the beginning of a transid cycle, not computed live with the rest of the status page. It will normally be at least one transid behind unless the filesystem is completely idle.

  2. The "finished" flag indicates that all extents have entered the task queue; however, the "done" columns indicate the lowest-numbered extent that has not yet exited the task queue. The queue can hold a few thousand extents and change their execution order so low-numbered extents get done later, and that holds back the progress report (and saved state in beescrawl.dat).

@Massimo-B
Copy link

For highlighting that bees was finished I usually had:
beesd mobiledata 2>&1 |grep -e "^" -e "ran out of data after"
That does not match anymore, with the RC it seems to be "Crawl finished" that notifies about bees finished, right?

@werdck
Copy link

werdck commented Dec 18, 2024

I just tested the master branch on a set of 311GB (pre-bees) home directories, which went down to now 255GB.
ETAs are wildly off, but at least the lowest one gives you a feeling of progress.
I noticed that, while both subvol as well as extent modes fully utilize my Laptop's CPU, extent mode runs a lot cooler and way faster.

Sometimes the progress table seems to completely hang up, in which case only removing the hashtable helped. That happens mostly when I restart bees though, and dedup seems to continue working.

@Zygo
Copy link
Owner Author

Zygo commented Dec 18, 2024

For highlighting that bees was finished I usually had:
beesd mobiledata 2>&1 |grep -e "^" -e "ran out of data after"
That does not match anymore, with the RC it seems to be "Crawl finished" that notifies about bees finished, right?

It's "Crawl finished" for every subvol (all 6 for extent scan, or whatever number of subvols there are for subvol scan).

I should put that message back, because we can hook into that message and terminate the process. It has to be written differently for the various scan modes.

@Zygo
Copy link
Owner Author

Zygo commented Dec 18, 2024

@werdck

Sometimes the progress table seems to completely hang up, in which case only removing the hashtable helped. That happens mostly when I restart bees though, and dedup seems to continue working.

Can you clarify that? What does "the progress table seems to completely hang up" mean? Do you have a status file or debug log from such an event?

Removing the hashtable would break dedupe of any old, previously scanned data. That doesn't sound like it would be particularly helpful, and I don't know how it would avoid a hang.

@werdck
Copy link

werdck commented Dec 18, 2024

Sometimes the progress table seems to completely hang up, in which case only removing the hashtable helped. That happens mostly when I restart bees though, and dedup seems to continue working.

Can you clarify that? What does "the progress table seems to completely hang up" mean? Do you have a status file or debug log from such an event?

Sadly no files left from that run, I removed them as a troubleshooting step, but it is just the table not changing anymore for two hours in my case, while the lowest ETA was around 3 minutes. Restarting bees didn't get it to update either.

Removing the hashtable would break dedupe of any old, previously scanned data. That doesn't sound like it would be particularly helpful, and I don't know how it would avoid a hang.

Bees itself doesn't hang I think, at least from the still updating crawler messages in /run/bees/[uuid].status

Shouldn't removing the hashtable force a rescan? Or is the state saved somewhere else too?

@Zygo
Copy link
Owner Author

Zygo commented Dec 18, 2024

Shouldn't removing the hashtable force a rescan? Or is the state saved somewhere else too?

Scan position is saved in beescrawl.dat. The hash table is only pairs of hashes and locations. If you removed everything in .beeshome then you got both of those.

The end stages of a first filesystem scan might not update progress for some time, possibly hours if there's a lot of small duplicate extents left in the queue at the end. All it takes is for one extent to get kicked to the back of the queue, then progress won't update until the entire queue is done.

If new files are added to the filesystem, bees should go dedupe them and then come back to the smaller ones, so in normal daemon usage this shouldn't be a problem. If you're waiting for progress to say "finished", then it might be.

The progress will also not update if there's no new transactions on the filesystem to scan. That doesn't seem likely if there are still active crawlers though. That case should still get listed in the docs.

@werdck
Copy link

werdck commented Dec 19, 2024

Shouldn't removing the hashtable force a rescan? Or is the state saved somewhere else too?

Scan position is saved in beescrawl.dat. The hash table is only pairs of hashes and locations. If you removed everything in .beeshome then you got both of those.

Ah, sorry then. I deleted beescrawl.dat when it happened, which made the progress table update again when I restarted bees.
Deleting the rest of .beeshome came later, when I wanted to get another independent dedupe over the then freshly balanced data.

@KeinNiemand
Copy link

I've been running this version for a few weeks (starting from an incomplete scan of the old version (incomplete becouse I added a drive, did a balance and had to delete hashtable) now and progress seems to go up very slowly, it also seems to have been at ~40% completion for the size with the most progress for a while now.
Progress Table:

PROGRESS:
    done  10.154T    %done size transid 652877   todo   inaccurate ETA
-------- -------- -------- ---- ------- ------ ------ ----------------
915.967G   6.417T 13.9387%  max       0 618420 22w 3d 2025-06-09 09:24
 425.14G   3.021T 13.7448%  32M       0 618420 22w 5d 2025-06-11 14:46
207.464G 501.875G 41.3378%   8M       0 618420  7w 4d 2025-02-24 20:48
 36.288G 138.458G 26.2090%   2M       0 618420 11w 6d 2025-03-27 13:10
  2.252G  92.263G  2.4412% 512K       0 618420 2y 24w 2027-06-21 19:26
  8.366M  904.55M  0.9249% 128K       0 618420 6y 26w 2031-07-06 08:24

@kakra
Copy link
Contributor

kakra commented Jan 2, 2025

it also seems to have been at ~40% completion for the size with the most progress for a while now.

Yeah, I've seen this multiple times, too, now. It simply sits there seemingly doing nothing but it will eventually finish. You will probably see bees reading some data from time to time, and burning CPU calculating hashes and extending block ranges.

What bothers me more, any maybe this is one instance of this problem: Sometimes the table says "I'm 100% done" but the transid clearly indicates it's still thousands of transids behind for one size category.

Currently, it's saying:

    done    4.51T    %done size transid 4156824 todo inaccurate ETA
-------- -------- -------- ---- ------- ------- ---- --------------
       0    2.02T  0.0000%  max 4156823 4156824    -       finished
       0 859.718G  0.0000%  32M 4156823 4156824    -       finished
336.375G 362.503G 92.7924%   8M 4156822 4156823    -       finished
320.478G 345.371G 92.7924%   2M 4156822 4156823    -       finished
293.683G 316.495G 92.7924% 512K 4156822 4156823    -       finished
617.092G 665.024G 92.7924% 128K 4156822 4156823    -       finished

So essentially there's 0 bytes of 2TB done, which correctly equals 0%, but the "ETA" says "finished", and the transid is what can be called "done" (it lacks 1 transid behind).

To me this looks like the tables pulls data from completely unrelated runs, e.g. the "done" column (that 2nd columns) is from a current run, the total column (2nd col) is from some previous run or from some total counter, "%done" is calculated from who knows what, and "ETA" says finished even tho "%done" is not even near 100%. Additionally, the "%done" is exactly equal in multiple category rows. I don't think that is realistic.

Well, maybe this means: In the current run, there was no data in the transid range. But in the whole filesystem there are 2 TB of data of extents over 32 MB. But then: What does "%done" mean? How do these columns related to each other?

In the other categories, "%done" hardly reaches 100% and lacks behind all the time.

At best, "%done" is a very bad guess. And that reflects over to the ETA. Or maybe there are just race conditions with stats maintenance of that table data.

@Zygo
Copy link
Owner Author

Zygo commented Jan 2, 2025

%done is calculated by:

  1. Get total number of bytes in all data block groups
  2. Get the last extent pointer returned from a btrfs search in the transid range
  3. Map that extent pointer to its relative position in the total number of bytes from step 1
  4. Divide by the total in step 1 and multiply by 100% to get %done
  5. Calculate everything else from %done using statistics collected from previously processed extent sizes

Step 2 sometimes returns nothing if there aren't any btrfs extents in that transid range, so we get a zero there (e.g. if transid 4156823-4156824 contained only deletions or metadata modifications, no new data blocks). What the table above is saying is "there were no new extents <= 512K in size between transid 4156823 and 4156824."

More often we get a small number of extents, less than 10000, so in a few seconds we've queued everything and get to finished immediately (the actual scan and dedupe takes longer, but the metadata gets loaded instantly).

All the sizes see the same extents, they just ignore anything outside of their own range. So if there's only one extent modified in transaction 4156822, and it's at 92.7924% of the distance between the beginning of the filesystem and the last allocated byte, then all the scanners will point there at the end of the cycle, because it's the last extent they've seen. Only one of the scanners will actually process that extent.

Step 5 is very biased toward whatever was most recently written to the filesystem. If bees is idle and you write a thousand small files, then small extents appear to grow relative to the other size bands, because most of the extents bees sees will be small. This causes wild swings in the size-specific data size estimates, especially if some of the crawlers are idle--their extent size will no longer be represented in the statistics, so their size shrinks while the others grow.

The size data will become important to a sysadmin when it becomes possible to delete unproductive crawlers from the table, but maybe the data shouldn't be collected the way it is now--maybe it would be worthwhile to make it a startup task, or even a separate tool that does a proper statistical sample. Mostly what we need to know here is the maximum possible benefit from allowing the crawler to continue to run. If you have 300TiB of data and the 128K size is only 6 TiB, you can only get 1% of the space out of that crawler if every single extent in that size range is a duplicate. That assessment doesn't change very much if the estimated size is double or half that number--2% is no better than 1% if it takes 3 years of IO time to get the space back.

The progress table is a snapshot of the previous cycle taken just before a new cycle is started. If the snapshot was taken after the new cycle is started, we would never see finished (a new cycle clears that bit) and all the "done" columns will be zero (a new cycle sets all finished crawlers to zero).

If a crawler finishes in the middle of a scan cycle, but there are more transids available, the crawler will restart itself immediately. That can lead to the crawlers getting a little out of sync with each other (like the 32M and max rows), but they'll all eventually cover all transids.

Scan cycles start from an idle task. If all of the worker threads are running non-idle tasks (i.e. actual dedupes) then the scan cycle task never gets run. There's no need to sign up for more work if all the workers are busy processing extents already in the queue. Using an idle task does freeze the progress table, though. Maybe the table generation should be split off into its own non-idle task, so the table is always up to date even if we didn't queue any new work. Apart from anything else, if we didn't do any new work, the ETA should be updated with a future time.

I guess my question is whether this table is useful, or how it could be improved? A lot of the problem is just lack of available information at zero cost. I can get better information, but it costs more IO (it'll be reading big parts of the extent tree multiple times). Maybe the stand-alone tool is a better idea--something that takes a proper sample of the entire filesystem for a few minutes, then reads beescrawl.dat and says "yes you're 82% done, 3 more hours" and actually hits that target time.

@kakra
Copy link
Contributor

kakra commented Jan 2, 2025

I guess my question is whether this table is useful, or how it could be improved? A lot of the problem is just lack of available information at zero cost. I can get better information, but it costs more IO (it'll be reading big parts of the extent tree multiple times)

Yeah I think it's exactly that, and that confuses people - especially if they eagerly waited for the "progress bar".

So maybe "%done" should be pulled out of the table if it is more or less a global position anyways, and display it as a global value instead.

The other issues are where certain combinations of information per row do not give a conclusive view, e.g. saying "0" here but "finished" there. I'm not sure how to solve this. Maybe it should say "finished" or "waiting" in the first column then, and say "n/a" for ETA instead.

And technically, if there's nothing to do because a transid range didn't have extents of that size, the progress is probably technically 100% and not 0% (more correctly, it's probably "n/a"). Maybe this already fixes the most confusing part of the table?

While I like the idea of a separate tool, I don't think you should focus on such a tool too much. I think there are more useful things like recombining extents (aka defrag) - if that's in a state where you could make progress.

The progress table is nice but currently it's confusing. To me, this looks like most recent reports focus around the progress table for that reason (but I may be biased because I'm active here). I'd prefer the "confusion" part improving first for a v0.11 release, and only later focus on accuracy. For me, bees works a lot better and faster now, maintains a better file system layout, and introduces fewer stalls. These are great changes and they are important. I'd prefer to get a "stable" release first, ignore accuracy, but at least fix the confusing parts of the progress table so we'd get fewer reports and questions about it.

So what are the confusing parts?

  1. Numbers across columns of a row seemingly do not match what one would expect.
  2. It looks like there's no progress at all for very long periods of time (which is probably the lack of info part).
  3. (anything else to add by participants here?)

So maybe, if a row doesn't make visible progress for some time, maybe it's best to just clear its ETA (and say "waiting" instead or something), and probably include some info about at which "position" the crawler currently is (if that is possible) instead of just having some static transid range for very long periods of time.

@kakra
Copy link
Contributor

kakra commented Jan 2, 2025

For me, bees works a lot better and faster now, maintains a better file system layout, and introduces fewer stalls

To elaborate on that a little more:

For me, flatpak updates seem to work faster now with fewer IO stalls after I defragged /var/lib/flatpak. There seems to be a lot of deduplication. bees invested a lot of time on these files.

Also, many desktop actions now have fewer IO stalls when writing data, or copying large files from my local drives to my NAS (even if those large files - stream video recordings - are on my xfs partition). I'm not sure how KDE Plasma interacts with btrfs on my $HOME while doing IO operations but things certainly improved.

@Zygo
Copy link
Owner Author

Zygo commented Jan 3, 2025

So maybe "%done" should be pulled out of the table if it is more or less a global position anyways, and display it as a global value instead.

It's definitely an independent value that only sometimes synchronizes between scanners.

The idea is that the large extent scanners cherry-pick the largest extents, giving back cheap free space quickly, while the smaller extent scanners take longer to free a possibly larger total amount of space. In terms of progress, they'll lap each other, like hands on a clock: the big extent size crawlers are hundreds of times faster than the small ones. The big ones are expected to run a full scan cycle and go into and out of the idle state many times over, in the time that the small extent size crawlers finish the first scan.

So maybe "%done" should be pulled out of the table if it is more or less a global position anyways, and display it as a global value instead.

Numbers across columns of a row seemingly do not match what one would expect.

If you look through the commit history, one of the versions of the progress table blanked out the two "done" columns when the scan was finished. We could go back to doing that.

The "done" column seems fairly useless--it's not a raw variable, or even a cooked one. It's overcooked. So I guess we can just drop it outright.

The position always ends up as a number that looks like a size or a percentage. I could say "0.927" instead of "92.7924%" but it's still going to go from 0.000 to 1.000 on each pass. I guess one solution would be to drop it entirely, and have only the ETA.

The ETA is the time until the end of the current cycle, i.e. when the transids will be updated and new data will be considered for dedupe. It's not the time until bees becomes idle, which is much harder to predict. bees becomes idle when all of its crawlers are idle.

The amount of work within each transid is unpredictable--it's the amount of new data that was written to the filesystem, whatever that is. A percentage based on the transids will end up cycling from 99.999% to 100%, with more 9's added as the filesystem gets older.

It looks like there's no progress at all for very long periods of time (which is probably the lack of info part).

I've added some timing data on extent dedupes and I'm finding there are a few cases where it does take hours or even days to process the extents in the queue. It doesn't seem to be causing any other problem--dedupe continues on other files, even new scan cycles start--but the progress can't move forward because there's an extent in the queue at the end of a long task tree branch. It might be a problem for the "run bees for an hour a day" model since it would restart the hours of processing at the beginning.

There is also some unfair scheduling, particularly for reads that use an ordinary mutex lock for synchronization.

say "waiting" instead or something

"idle" is shorter, and already used for bees status messages when a thread is intentionally not doing anything.

Something like:

extsz  datasz  pos'n gen_min gen_max ctime   cycle end ETA      cycle begin
----- -------- ----- ------- ------- ------ ---------------- ----------------
 128K 899.475G 0.201       0 1069083 46w 3d 2025-11-24 12:35 2024-10-29 11:17
 512K 543.013G 0.981 1228184 1228362 3d 10h 2025-01-06 11:32 2024-12-30 16:04
   2M 409.057G 0.991 1228636 1230371 2d 15h 2025-01-05 16:55 2024-12-31 09:39
   8M   1.494T 0.992 1228507 1230520 2d 14h 2025-01-05 15:31 2024-12-31 10:59
  32M   2.712T     - 1228353 1229846      -             idle 2024-12-31 04:46
  max  21.424T     - 1228319 1229971      -             idle 2024-12-31 05:54
total  27.438T       cur_gen 1237969

This has a new column to give the time when a crawl started. This would give an idea of how old the data is that will be scanned. Also moved the heading data into a footer row, and used "gen" instead of "transid" because it's shorter.

@kakra
Copy link
Contributor

kakra commented Jan 3, 2025

There is also some unfair scheduling, particularly for reads that use an ordinary mutex lock for synchronization.

Maybe you could create something like a deadline scheduler for the queue? Like "if an extent is in the queue for more than X minutes, move it to the front", then work from there with some sorting to prevent a feedback loop which moves extents to the front too often, or protect the queue from this deadline action for some minutes... Since there can be hours of work, X probably should be calculated dynamically or derived from some other value (like the size category).

Also moved the heading data into a footer row

Maybe fill the remaining empty columns, too, with the min and/or max times?

This has a new column to give the time when a crawl started. This would give an idea of how old the data is that will be scanned. Also moved the heading data into a footer row, and used "gen" instead of "transid" because it's shorter.

I like this change. Let's see how this works. I'll report back (when the change is pushed). Maybe I can recreate some situations which you described by this:

but the progress can't move forward because there's an extent in the queue at the end of a long task tree branch

Thanks. :-)

@Zygo
Copy link
Owner Author

Zygo commented Jan 3, 2025

Maybe you could create something like a deadline scheduler for the queue?

That's one possibility. We could set a minimum dedupe/scan rate, and if an extent sits in the queue too long, we just drop it. If we spend an hour on a 128M extent, that works out to deduping 20 TiB in about 20 years...

It looks like the problem case is that we have thousands of extents on a single file, which wouldn't be a problem by itself--but the queue FIFO list is "extent 200-7000, then extent 12-199", because we got through extents 1-199 before discovering that extent 12 shares an inode with extent 200 that was already running in another worker thread. Extent 200 had been running for a while, and extents 201-7000 ended up in the queue waiting for extent 200's lock, while extents 1-11 were running in the other worker thread. Extents 12-199 that were already waiting for extent 11 get appended to the end of extent 200's lock queue, after extent 7000. That means that after extent 11, we're suddenly 6800 extents away from dequeuing extent 12.

Maybe we can move the Task queue branch into the Exclusion object, and make it a sorted list instead of FIFO. Then in this example we would insert tasks 12-199 after task 200, so the queue is "extent 200, 12-199, 201-7000" (extent 200 is already running so it can't be moved), i.e. the lowest address is always next on the queue for the worker.

I don't plan on fixing that. My plan is to get csum scan and extent dedupe up and running (we currently have extent scan, but it's still using extent-ref-based logic for dedupe, and the scan read optimization is a nasty hack). Extent dedupe maintains a sorted and optimized list of dedupe operations that is even more out-of-order, but it is also better equipped to avoid committing resources to extents that are not a productive use of time. csum scan doesn't get rid of the unfair scheduling on data reads, but it does make the individual Tasks faster by doing the dedupe hash search on the csums, thereby eliminating most of the data reads (only duplicate and compressed reads will remain) and also a lot of unnecessary LOGICAL_INO lookups.

@kakra
Copy link
Contributor

kakra commented Jan 5, 2025

The new table looks good. I think renaming to "this cycle" and "next cycle" is a good change. I'm not sure what "point" is but it seems to be some sort of percentage just hiding in a disguise. That's probably a good change, too.

"ctime" is probably the estimated remaining time.

I've run btrfs fi defrag -r in my $HOME/.config to trigger a bees cycle (around 20GB of data). I don't know what else you changed but the numbers from the table look mostly accurate, or at least not completely off. But this may also be purely a side effect of the touched data set. It's been pretty fast, too, with no stalls at all.

@Zygo
Copy link
Owner Author

Zygo commented Jan 6, 2025

I'm not sure what "point" is but it seems to be some sort of percentage just hiding in a disguise

Yeah, divide it by a million and multiply by 100 and it's a percentage. I think the column might only be useful for verifying the ETA calculation.

"ctime" is probably the estimated remaining time.

Suggestions for that column header that are 7 characters or less welcome! "tm_left" would fit...

the numbers from the table look mostly accurate, or at least not completely off. But this may also be purely a side effect of the touched data set

The accuracy depends on the size of the data set and how uniformly distributed it is. So if you do some consistent operation (like defrag, which selects short extents) across a big enough part of the filesystem (like 1% or more), it should come out with smoothly increasing, predictable numbers.

@kakra
Copy link
Contributor

kakra commented Jan 6, 2025

tm_left would probably be good. Otherwise, this looks good to me. Let's see what others from this thread say...

@kakra
Copy link
Contributor

kakra commented Jan 6, 2025

@Zygo I tested this build on another (much slower, both cpu+io) system. I defragged a lot of files (probably almost the whole system). I don't know what you changed during the last week but bees works a lot faster now and the new progress estimation is a lot more accurate.

@kakra
Copy link
Contributor

kakra commented Jan 6, 2025

Okay, here's a table where ETA actually fell behind:

PROGRESS:
extsz   datasz  point gen_min gen_max this cycle start ctime   next cycle ETA
----- -------- ------ ------- ------- ---------------- ----- ----------------
  max  638.46G   idle  496157  496159 2025-01-06 16:22     -                -
  32M 109.843G   idle  496157  496159 2025-01-06 16:22     -                -
   8M  60.829G   idle  496157  496159 2025-01-06 16:22     -                -
   2M  27.104G 885935  496157  496159 2025-01-06 16:22   34s 2025-01-06 16:23
 512K  43.693G 812025  496157  496159 2025-01-06 16:22   35s 2025-01-06 16:23
 128K 715.079G 852665  496157  496159 2025-01-06 16:22   30s 2025-01-06 16:23
total   1.558T        gen_now  496162

It's 16:24 16:27 now... Should ETA say "unknown" or "1 minute 4 minutes late" then?

@businessBoris
Copy link

[nothing useful to say, except thank-you to all the devs and testers.]

@Zygo
Copy link
Owner Author

Zygo commented Jan 7, 2025

Should ETA say "unknown" or "1 minute 4 minutes late" then?

I can't answer that without knowing what's in your execution queue. If there's hundreds or thousands of tasks reported in the status, then it's likely that it will take a few minutes before a worker becomes available to update the progress tracker.

What seems to be happening here is the progress/crawl restart task now has the same priority as the extent scanning tasks, but the extent scanning tasks won't release a worker until every reference to their extent is processed or they hit a lock conflict. This avoids thrashing the page cache with too many swaps between different extents and keeps the cache hot so we don't have to reread data. I could add a higher priority level, but priority doesn't matter if no worker thread even looks at the queue, and I don't want to start interrupting extent tasks because it's faster overall to wait them out.

At some point I'd like to have an option where users can specify the minimum distance between transids, so e.g. you can tell bees to wait until there's 100 transids between gen_max and gen_now before restarting a crawler. That would avoid deduping extents in files that are frequently being overwritten, and maybe generally help with efficiency by not doing as many unproductive tree searches. So I'm not too worried about bees stopping some of the crawls for a few minutes at a time, because that's eventually going to be a feature for filesystems where bees is idle most of the time.

If all of the crawls are showing "idle", then I'd expect a progress update every N seconds, where N is the average rate of recent transaction creation on the filesystem, because there's no queued work to delay running the progress update task. N is usually 30-60 seconds, but on a very idle filesystem, bees could go to sleep for up to an hour (the hardcoded limit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants