Skip to content

Commit

Permalink
bees: use helper function for readahead
Browse files Browse the repository at this point in the history
There seem to be multiple ways to do readahead in Linux, and only some
of them work.  Hopefully reading the actual data is one of them.

This is an attempt to avoid page-by-page reads in the generic dedupe code.
We load both extents into the VFS cache (read sequentially) and hope they
are still there by the time we call dedupe on them.

We also call readahead(2) and hopefully that either helps or does nothing.

Signed-off-by: Zygo Blaxell <[email protected]>
  • Loading branch information
Zygo Blaxell committed Jun 12, 2021
1 parent 0afd285 commit 20b8f8a
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 5 deletions.
2 changes: 1 addition & 1 deletion src/bees-context.cc
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
}

// OK we need to read extent now
readahead(bfr.fd(), bfr.begin(), bfr.size());
bees_readahead(bfr.fd(), bfr.begin(), bfr.size());

map<off_t, pair<BeesHash, BeesAddress>> insert_map;
set<off_t> noinsert_set;
Expand Down
8 changes: 4 additions & 4 deletions src/bees-types.cc
Original file line number Diff line number Diff line change
Expand Up @@ -385,8 +385,8 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESTRACE("e_second " << e_second);

// Preread entire extent
readahead(second.fd(), e_second.begin(), e_second.size());
readahead(first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());
bees_readahead(second.fd(), e_second.begin(), e_second.size());
bees_readahead(first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());

auto hash_table = ctx->hash_table();

Expand All @@ -405,7 +405,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESCOUNT(pairbackward_hole);
break;
}
readahead(second.fd(), e_second.begin(), e_second.size());
bees_readahead(second.fd(), e_second.begin(), e_second.size());
#else
// This tends to repeatedly process extents that were recently processed.
// We tend to catch duplicate blocks early since we scan them forwards.
Expand Down Expand Up @@ -514,7 +514,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESCOUNT(pairforward_hole);
break;
}
readahead(second.fd(), e_second.begin(), e_second.size());
bees_readahead(second.fd(), e_second.begin(), e_second.size());
}
BEESCOUNT(pairforward_try);

Expand Down
25 changes: 25 additions & 0 deletions src/bees.cc
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,31 @@ bees_sync(int fd)
BEESCOUNTADD(sync_ms, sync_timer.age() * 1000);
}

void
bees_readahead(int const fd, off_t offset, size_t size)
{
Timer readahead_timer;
BEESNOTE("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
BEESTOOLONG("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
// This might not do anything?
DIE_IF_NON_ZERO(readahead(fd, offset, size));

This comment has been minimized.

Copy link
@kakra

kakra Jun 22, 2021

Contributor

readahead() may ("may", because I didn't check the kernel lately) simply ignore offset and size, and just set a flag in the kernel FD to increase RA size from 64k to 128k for this file descriptor. Also, if size is below a page boundary, it will be rounded down, effectively making sub-page sizes into a no-op. The man pages about these functions seem generally wrong, a look at the kernel source reveals how these functions really work.

This comment has been minimized.

Copy link
@Zygo

Zygo Jun 22, 2021

Owner

ksys_readahead() since kernel 4.19 (and also in the previous implementation) is just a bunch of error checking followed by:

ret = vfs_fadvise(f.file, offset, count, POSIX_FADV_WILLNEED);

So readahead and POSIX_FADV_WILLNEED are identical (on btrfs--they do have different "this filesystem doesn't support readahead" error behavior elsewhere).

We probably don't want POSIX_FADV_SEQUENTIAL because it's not particularly helpful . FADV_WILLNEED tells the kernel exactly what to read anyway, so there is no need to help the kernel guess better. We could even call readahead from the crawl master thread so that the data is in the cache by the time the worker task executes.

The question that remains is whether the pread() loop is helping or not:

  • The kernel can discard WILLNEED requests if there is not enough cache space available, so the data isn't in cache and bees will do seeking reads when it comes to the final read operation.
  • readahead is asynchronous--timing data from the current code indicates bees often ends up blocked waiting for the read, because the kernel can't get the data from the storage before bees needs it.
  • The kernel can't discard a read() request due to low memory, which is why the loop seemed better--but the kernel could discard the cached pages immediately after the read() returns, if the kernel uses its own heuristics and determines the pages are not likely to be accessed again. Since bees doesn't keep the pages itself, that would be bad--two reads from backing store instead of one.
  • While looking at the kernel code, I noticed btrfs also does some weird things to readahead pages: lowering their IO priority, so they always run at idle priority, regardless of the requesting process's IO priority? That could have other side effects, like not using queues in block devices, that would significantly reduce performance with readahead compared to pread (the latter would submit read requests at bees's process IO priority instead of idle priority).

Reverse engineering the kernel code is non-trivial due to all the layers of behavior bees will interact with. We still need some controlled tests, i.e. run bees on a prepared filesystem image, measure the performance, reset to the original image, for each variation of the code:

  1. Try the existing code (readahead() and pread() loop)
  2. Try readahead() alone without the pread() loop.
  3. Try just the pread() loop, no readahead or posix_fadvise.
  4. Try an empty bees_readahead() function (i.e. a baseline performance for normalization).

Hopefully there's a clear winner, and it's not something ambiguous like "option 1 dedupes fastest but burns more CPU per block deduped," or "option 2 wins with bfq while option 3 wins with mq-deadline."

Obviously it would be better to make bees schedule its IO properly, sort the read requests in physical order and have per-drive worker threads execute them. This is just a hack to make the existing terrible code work a little better in the short term.

// Make sure this data is in page cache
// Note spelling: readahead vs read ahead
BEESNOTE("read ahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
while (size) {
static uint8_t dummy[BEES_READAHEAD_SIZE];
size_t this_read_size = min(size, sizeof(dummy));
// Ignore errors and short reads.
// It turns out our size parameter isn't all that accurate.
pread(fd, dummy, this_read_size, offset);
BEESCOUNT(readahead_count);
BEESCOUNTADD(readahead_bytes, this_read_size);
offset += this_read_size;
size -= this_read_size;
}
BEESCOUNTADD(readahead_ms, readahead_timer.age() * 1000);
}

BeesStringFile::BeesStringFile(Fd dir_fd, string name, size_t limit) :
m_dir_fd(dir_fd),
m_name(name),
Expand Down
4 changes: 4 additions & 0 deletions src/bees.h
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,9 @@ const bool BEES_SERIALIZE_RESOLVE = false;
// Workaround for tree mod log bugs
const bool BEES_SERIALIZE_BALANCE = false;

// Workaround for silly dedupe / ineffective readahead behavior
const size_t BEES_READAHEAD_SIZE = 1024 * 1024;

// Flags
const int FLAGS_OPEN_COMMON = O_NOFOLLOW | O_NONBLOCK | O_CLOEXEC | O_NOATIME | O_LARGEFILE | O_NOCTTY;
const int FLAGS_OPEN_DIR = FLAGS_OPEN_COMMON | O_RDONLY | O_DIRECTORY;
Expand Down Expand Up @@ -880,6 +883,7 @@ extern const char *BEES_USAGE;
extern const char *BEES_VERSION;
string pretty(double d);
void bees_sync(int fd);
void bees_readahead(int fd, off_t offset, size_t size);
string format_time(time_t t);

#endif

0 comments on commit 20b8f8a

Please sign in to comment.