Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
bees: Avoid unused result with -Werror=unused-result
Fixes: commit 20b8f8a ("bees: use helper function for readahead") Signed-off-by: Kai Krakow <[email protected]>
- Loading branch information
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Zygo Would it be better to use
posix_fadvise()
instead withPOSIX_FADV_SEQUENTIAL
(which should double the readahead window to 128kB for the file descriptor) and/orPOSIX_FADV_WILLNEED
(which would preload the file range into the cache using large or optimized IO size), and maybe also discard written data from cache withPOSIX_FADV_DONTNEED
(which initiates an immediate dirty writeback of the data). The man page on this seems to be quite inaccurate or even wrong, looking at the kernel source, this seems to do what we need.081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never got around to testing them all and figuring out which was best. At least they're all in one function now, though, which would make that testing easier...for someone else to get around to before I do, probably.
Some notes:
FADV_DONTNEED
because it drops the cached pages for every process, not just bees. Memory limits on cgroups seem to handle that much better--a page that is in use by some other cgroups will hold onto the page for itself, while bees furiously thrashes through its own private page cache in the cgroup for the rest of the filesystem.FADV_DONTNEED
anyway, because memcg is a pretty heavy dependency (though systemd uses it, so it can't be that hard for users to adopt).FADV_DONTNEED
is that bees doesn't have any call sites for it. This is mostly because bees can't really know (or I can't figure out) when it won't need to see a piece of data any more. It might never see that data again, or it might see the same data over and over and need to keep the blocks in cache for dedupe. Maybe they could be put on an async discard LRU list and a bees thread will free them when they haven't been used in a while? But that sounds like reimplementing the memcg...081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly what the readahead function does now is prevent two stupid things:
Doing the readahead before dedupe makes it run almost an order of magnitude faster, especially on spinning drives.
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I just remember you already explained that before.
I found that using memcg limits with bees starts blowing up swap somehow. The kernel seems to prefer cache too much, and then starts thrashing other memory of bees. Maybe it's the same problem: Cache pages may be shared with other processes, and the kernel can't or doesn't want to drop the cache "just from bees", and keeps it associated to it, thus force other other memory to swap. Or it's the other way around: It triggers other processes sharing the same page cache to be swapped out, or dropping their cache, resulting in thrashing of seemingly unrelated processes. OTOH, that was with kernel 5.4, didn't try again with 5.10. I did that with systemd, btw.
Thinks generally improved since KDE Plasma started running apps in their own systemd scope. This seems to give each application a higher overall memory weight, so I completely removed any memcg settings from services. memcg min limits seem to generally work better than max limits, so I kept the min memory settings for some important systemd scopes which tend to thrash under memory pressure. But if anything starts to thrash due to memory pressure, it usually badly affects any process that wants to read from btrfs. Processes only accessing other file systems seem to stay in much better shape under such conditions. FWIW, the problems with memcg limits seem to be more related to btrfs than other file systems.
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's often faster to type stuff again than to find a link.... :-P
Yeah, somewhere between 5.1 and 5.9 memcg became pretty much unusable with btrfs in general, and I had to run without it for a while. It seems to work again in 5.10, though tbh memcg only really works properly on about 50% of kernels since 3.0...
I have an experimental patch which calls
mlockall(MCL_FUTURE)
that helps the kernel figure out which pages to swap (and works around some kernel bugs that were fixed in 5.8). It also totally locks the system if you underestimate how much memory is available for bees and the hash table, so I haven't pushed it.I guess that means we should probably figure out when to
FADV_DONTNEED
some day...081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-ups:
It helps the kernel find a lot of deadlock/livelock/just-come-to-a-halt bugs when memory runs out. Sometimes I can SIGKILL the bees process and the system recovers, but quite often it's reboot time.
Apparently..."when" is..."never"?
I tried it at the end of
scan_one_extent
, the end of the hash table writeback loop, and after writing each extent of the hash table. All reduced performance. Forced writeback on the hash table is crippling for other users of the system as it blocks other IO to the same filesystem (on 5.10 and 5.13). Dropping the extent from the cache after reading slows bees benchmarks down about 10% (on 5.14):memcg thrashing could be due to use of
sysctl vm.dirty_ratio
instead ofvm.dirty_bytes
(similar forvm.dirty_background_bytes)
, as the former is computed based on changing available memory sizes, not a fixed RAM or memcg size.081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dirty ratio is probably not really a useful knob to turn on modern systems with GB of RAM in the 2 or 3 digits range. If it starts to block, it'll block for a very long time until dirty is flushed. So I'm using the bytes setting instead of ratio.
Usually, I'm quite happy with performance, kernel 5.10 LTS seems to have gotten a lot of fixes around lock contention. I added some IO prio patches per bees thread (#135) but I haven't really measured the impact, it just "feels" fine on the system that otherwise does desktop and gaming workloads.
Also, I've created a systemd slice for maintenance workloads which has
{CPU,IO}Weight=25
which seems to work well enough. I stopped using memcg limits for it because it may spike latency.I've also disabled bcache writeback because it caused more headache than benefit: While write throughput is better, latency seems worse under heavy write workloads. I'm using just read caching now. Also, I've almost lost btrfs again due to hiccups in my SSD firmware: Sometimes, the SSD would become detected on boot, bcache detects it and loads, starts dirty writeback, then suddenly, the SSD would detach from the bus, now bcache detaches the btrfs partitions from its dirty cache (which is a real dumb thing to do here), resulting in a broken btrfs even after re-attaching the partitions to btrfs, because dirty data has been purged from bcache. There are no SSD hiccups during normal operation, just after reboots. May be a board firmware bug, or a SSD firmware bug. I'm not sure.
Last time I tried, I could run a Gentoo system upgrade (which compiles from source on RAM disk) while playing a game, and bees kicking in after compiled packages were installed. I believe it's bees that caused some occasional stutter in the game, but overall it was quite a smooth experience.
Conclusion: I really do not care much if bees takes one hour more or less to do its job. But I do care about system responsiveness while it is doing it. If your test patches involve a lot more IO overhead, those patches should probably not go mainline, if they just increase the time but freeing resources at the same time, that behavior should maybe at least be optionally available.
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to only increase bees iops and CPU. There's a tiny improvement in cache usage at the beginning and end of the run:
The unreadahead version does get hugepages much faster, and that might help some large-memory processes...or it might be the thing that is burning CPU and slowing bees down.
I'll leave it alone for now and get csum scanning up and running instead. We don't have to worry about page cache if we're not reading any pages.
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll think it's not worth the effort, it's probably just an example of the 80:20 rule. But csum scanning means we no longer can dedup compressed data with non-compressed data, right?
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still read the compressed data and compute btrfs-compatible csums to match uncompressed data.
One conclusion from my experiments with csum scanning is that bees really needs to intelligently schedule the dedupes. The csum scanner can sift through a filesystem in extent bytenr order to find duplicates at rates approaching 100GB/s--fast enough to scan 100TB in less than an hour--but the dedupe commands resulting from that scan will take months to a year to execute in random order (just reading the data in order, with scrub, takes over a week). Nothing beats "read the csum tree in order" for scanning speed on a single core, and one core can flood us with so many duplicate candidates that we have to stop scanning for days to catch up, so there's no need to do any better (also I tried several methods to do better, and they were all at least an order of magnitude worse because of higher tree-search costs per extent).
My current design approach is to run multiple scan tiers in parallel, with threads allocated to tiers at the top of the list with unscanned extents:
Each scan has its own cycle state, so if we are scanning tier 3 and some new data appears in tier 1, we go scan the new tier 1 and 2 data, and then come back to where we were in tier 3.
beescrawl.dat
wouldn't be used for this--the new scan state file (name TBD, suggestions welcome) would have 3 lines in it, for the 3 scanners' extent selection parameters, min/max transid, start time and bytenr position. The csum scanning time is negligible compared to the dedupe time, so it doesn't matter that we are reading the entire csum tree 3 times (and we can still select csum tree pages by transid, so "the entire tree" is really "all the pages of the tree that are new since the start of the scan" anyway).The cost of deduping an extent is dominated by the relatively constant costs of updating the subvol trees, so bigger extents dedupe faster. Extents in tier 1 dedupe at above-average speeds (large size divided by constant dedupe cost), so we do them first. Extents in tier 2 are somewhere in the middle, deduping at average speeds. Extents in tier 3 dedupe at speeds that start slow and get worse exponentially, and also cause nasty side-effects like free space fragmentation and metadata tree growth (one possible future bees feature would be to notice when the tier scanner exits a block group after freeing a lot of small extents, and throw a btrfs balance at it on the way out).
Big, high-write-volume filesystems may never complete their first bees pass at tier 3. Some may never even_enter_ tier 3, spending all of their lives catching up to new data in tiers 1 and 2. Keeping the dedupe rate high is important in these cases.
Extents in the first 2 size tiers cannot be compressed (max compressed extent is 128K, therefore anything larger is uncompressed), so we can dedupe them by csums without looking at anything else. In tier 3, if we didn't get a csum match on the encoded data (which can happen some of the time, so dedupe hit rate isn't zero on csums alone), we have to look at the extent refs for each extent to see if it is compressed, and if it is, we would read the data blocks and compute the btrfs-equivalent csums to try to match those. If it's not compressed, then the csums were sufficient and we move on to the next extent without reading data.
This structure makes some tuning knobs easy, e.g.
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a bloom filter could help here? Could it be able to find csums that are most likely duplicates, and throwing those away that aren't? If I got it right, bloom filters allow for identifying one set of items with a 100% hit/confidence rate while putting the rest into a set of "maybe they match but maybe not, do your expensive tests". But it may explode memory usage... It could eliminate some of the brute-force attempts at your planned three tiers. Or you could use it to prefer likely duplicates over brute-force duplicates. Or use bloom filters for the cost function. Or to avoid hash collisions. I'm not sure, you're the algorithmic genius here. ;-)
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the tiers is to emit dedupe commands in decreasing order of extent size, so that when bees runs in a fixed amount of time (e.g. one hour overnight in a maintenance window) it dedupes the maximum amount of data possible during that time.
In theory, we can do this by scanning the filesystem and making a list of dedupe commands based on the csum data, then sorting the commands by extent size, then executing the dedupe commands in decreasing size order; however, the command queue for a big filesystem can require many gigabytes to store, and it starts to become out of date as soon as it is created. Also, while we are building this sorted command queue, we are not doing any dedupes, so our dedupe rate at the start of our maintenance window is zero. The longer the queue is, the more out of date the commands become, so that when we finally execute commands the data might no longer be there.
So instead, I'm changing the scan order so that it visits extents in decreasing size order (aka increasing cost-per-byte order). Dedupe commands can be executed as soon as they are created (or on a very short queue just long enough to keep dedupe worker threads busy) because they will be created in close-enough-to-optimal execution order. We don't need to be precise with the sorting--a few power-of-2-sized buckets are adequate, and we also want to maintain roughly sequential IO patterns so we prefer to minimize order changes. We want "millions of bytes" extents to go first, and "thousands of bytes" extents to go last, and for the rest we don't care very much.
It turns out that "brute force scan" (where tier 1 discards 99% of the extents it reads sequentially from the tree and tier 2 discards 50%) is faster than sampling or sorting algorithms because seeking is extremely expensive on btrfs (even on NVMe). We can scan and discard up to 500,000 extent metadata items per second on one CPU core. So tier 1 will read 100 extent records, discard 99 of them, read csums for the 100th extent and check for duplicates, and waste only 198 microseconds per extent. Tier 2 and tier 3 will read the same 100 records, discard half of them, read csums for the other half, and waste only 2 microseconds per extent. The tiers can run on different cores so the time wasted can overlap. A dedupe command takes milliseconds to run (in the worst cases it takes seconds for each dedupe, and holds locks all over the filesystem preventing any other work from being done), so the amount of time wasted by tiered scanning is far below 1%.
This moves all the high-throughput dedupe to the front, so we can get more freed space in the first hour. If there's no new data then we'll continue processing older data that costs more per byte of freed space; however, if there is new data, we can quickly scan it for cheap free space at the beginning of the next maintenance window.
It's the opposite. Bloom filters are a bitmap that tells us whether a hash may or may not be in the table (1) or is definitely not in the table (0). They are useful when:
A good Bloom use case is a malware URL query service, where a web browser checks a URL against a web service before retrieving it. Most URLs are not (known) malware, so most URLs we might check are not in the hash table. A Bloom filter can eliminate most of those on the client side. The web service publishes a Bloom filter table (which is mostly 0 bits so it compresses well) which the web browser downloads from time to time. The browser checks each URL against the Bloom filter and avoids a web service lookup request most of the time.
bees doesn't meet any of those criteria:
There are some relatives of the Bloom filter that can handle deletes and store additional information, but they all cost more time and memory than Bloom filter. Bloom is already too expensive even if its cost is zero.
In theory it could, but any memory we spend on Bloom takes memory away from the hash table. It's better to make the hash table bigger, or increase the number of entries by making each individual entry smaller, or select a subsample of csums per extent. Small filesystems need only about 40 bits of hash and 48 bits of physical address (and usually the bottom 12 bits of those are zero). Big (for btrfs) filesystems need only a few bytes more. We could also optionally keep only blocks with
hash % 4 == 0
to stretch the hash table further (and drop small extent matches in the process, but if the user turns off tier 3 then those extents won't be deduped anyway). We could also do things like detect when there aren't enough hash bits, pause dedupe, resize the hash table, and resume dedupe. Lots of micro-optimizations like this are possible.Given the raw speed of csum matching, we can also run the csum scan multiple times, with different permutations of block group order. The hash matching effectively has a sliding window over the data, so changing the scan order will put different data in the window at the same time, and eventually hit a match that would not be possible if the data was scanned in a different order. The first pass will eliminate most of the big duplicate extents (because they're large and bees needs only one matching csum), so the later scans will be mostly tier 3 data--a lot of IO effort for not very much disk space. The cost of each scan will be the same, while the number of undetected duplicate blocks will progress toward zero. At some point the user will have to decide to give up. This is a way to dedupe 100TB of disk space with a 16MB hash table--a neat trick, but maybe not that practical?
What's the difference and how would we tell?
Cost estimation kicks in after we've identified an extent to modify, and we're trying to decide which of several possible command sequences is the best one. This is after we've found extents with matching blocks, we are trying to decide which one is better to keep, or whether the recovered space is worth the IO effort.
Ordinary statistical sampling will suffice to estimate tier boundaries. bees will have the extent metadata in memory already, it just has to keep stats on the number of extents of each size as it goes.
We can only avoid hash collisions by using a different hash function. That doesn't necessarily mean we have to abandon the btrfs csums--we can combine multiple btrfs csums into larger blocks to get better collision resistance.
We can also maintain a blacklist of very common collisions. e.g. for database formats that use crc32c on their own pages, so every page has the same csum, we'd see that csum thousands of times. It would end up on the "top 10 most popular csums" list, and we'd know that any time we saw that csum we need to switch to data block reading.
Something like that already has to be done for zero-filled data blocks--they all have the same csum, but we must make sure we dedupe them with holes instead of creating billions of references to zero-filled data blocks.
081a6af
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the elaborate explanation, it looks all very well thought. I think I'm understanding your plans much better now, it looks promising. Looking forward to testing the new implementation. :-)