Instead of doing traditional reference counting for shareable trees, we will instead track whether not individual blocks are shared via the snapshot tree and the drop trees.
Upon snapshot we will insert an item to map the origin and snapshot
objectid
's, as well as their generation
and snapshot id
. The root item of
the origin will be updated with the last snapshot generation
and snapshot id
and this will be used to determine if a block is shared.
When free'ing a shared block we will simply increment that blocks drop count in the drop tree. Once the drop count == the theoretical minimum reference count for that block we will insert a free object into the garbage collection tree to do the work to see if the block can be freed.
Data will work the same way it current does, if we COW down from the owning root
then it will unconditionally use the FULL_BACKREF
logic on the leaf and add
extent references for the bytenr of the leaf. When the leaf is free'd via the
GC tree we will drop the extent references for that bytenr.
The btrfs_header
will now include a snapshot_id
, which will be taken from
the owning root. Additionally, the btrfs_file_extent_item
will also grow a
snapshot_id
.
The snapshot_id
will be taken from the owning root's snapshot_id
that will
be added to the btrfs_root_item
. It will start at 0 and be incremented at
every snapshot of that root. When a new tree block is created, it's
generation
will be the transaction->transid
and it's snapshot_id
will be
btrfs_root_item_snapshot_id(&root->root_item)
.
This also will allow us to implement transaction commit-less snapshotting.
Because each block will be updated with their own snapshot_id
we can easily
track modifications within the same transaction to a root, and thus allow us to
decouple the work of snapshot creation from the transaction commit.
The snapshot_id
will be placed in the btrfs_snapshot_item
, as described
below, when a snapshot is taken. The target of the snapshot will have its
btrfs_root_item_snapshot_id(&root->root_item)
increased, thus causing all new
blocks that are created after the snapshot to carry a brand new snapshot_id
.
The snapshot_id
in the btrfs_file_extent_item
will be used to calculate the
potential owners of the file extent using the btrfs_snapshot_item
as described
below.
On snapshot we will no longer push references for metadata, as metadata will not be reference counted. Instead a new item will be inserted into the snapshot tree at snapshot creation time
struct btrfs_key {
.objectid = target_subvol_id,
.type = BTRFS_SNAPSHOT_ITEM_KEY,
.offset = snapshot_id,
};
struct btrfs_snapshot_item {
/* The generation when the snapshot was taken. */
u64 generation;
/* The snapshot_id of the source root when the snapshot was taken */
u64 snapshot_id;
/*
* The number of metadata bytes in the source subvolume at snapshot
* time, this will be used to determine when we can delete this item.
*/
u64 shared_bytes;
};
struct btrfs_key {
.objectid = snapshot_id,
.type = BTRFS_SNAPSHOT_REF_ITEM_KEY,
.offset = target_subvol_id,
};
This will allow us to find who references any block based on it's
btrfs_header_owner
, generation
, and snapshot_id
. should_cow_block
will
be changed to be essentially
if (btrfs_header_generation(buf) < trans->transid)
return 1;
if (btrfs_header_snapshot_id(buf) <= root->last_snapshot_id)
return 1;
We will no longer need to rely on BTRFS_ROOT_FORCE_COW
as any modification
that occurs after the snapshot operation will require a COW. snapshot_id
for
new blocks will be set to root->last_snapshot_id + 1
.
The following is an example snapshot graph
gen 1
┌───┐ snap 1 ┌───┐
│ A │◄────────────┤ B │
└─▲─┘ └───┘
│
│ gen 10 gen 10
│ snap 2 ┌───┐ snap 1 ┌───┐
└───────────────┤ C │◄───────────┤ D │
└───┘ └───┘
If a block owned by A had a generation
of 1 and a snapshot_id
of 1 then we
know that B, C, and D potentially have references to that block. D is
questionable, because we could have COW'ed that block in C before we took the
snapshot of D. In order to determine if it points at our block we must look it
up and check. This will be handled by the garbage collection tree.
The BTRFS_SNAPSHOT_REF_ITEM_KEY
will simply allow us to find the original
subvolume a snapshot was taken of, this will be used for snapshot deletion.
The drop tree is what will be used to determine if a block can potentially be
freed. At update_ref_for_cow
time we will determine if the block is shared.
If the block is shared we will modify the drop count for the block. Drop items
will look like this
struct btrfs_key {
.objectid = bytenr,
.type = BTRFS_DROP_ITEM_KEY,
.offset = owner,
};
struct btrfs_drop_item {
/* The number of trees that no longer reference this block. */
u64 drop_count;
/* The first key in this node/leaf. */
struct btrfs_disk_key first_key;
};
At every update_ref_for_cow
on a shared block we will increase the
drop_count
for the block we are COW'ing.
When we update the drop_count
we will calculate the minimum reference count
for the block, and if our drop_count
is equal to the minimum reference count
of the block then we insert an item into our garbage collection root to process
the free'ing of the block. The garbage collection behavior is documented in the
garbage collection trees document.
As indicated above, we use the minimum reference count for our block based on the dependency graph from the snapshot tree. Take the following tree as our example file system
┌───┐
│ B ├┬───────────────┐
├───┴┤ │
│ │ ┌───┐ │ ┌───┐
│ │ │ A │ │ │ C │
┌┼────┼─┼───┴──┐ ├┬─────┬┴───┴─┐
││ │ │ │ ││ │ │
││ │ │ │ ││ │ │
┌─▼▼┐ ┌▼─▼┐ ┌──▼┐ ┌─▼▼┐ ┌─▼─┐ ┌─▼─┐
│ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 6 │
└───┘ └───┘ └───┘ └───┘ └───┘ └───┘
A - original subvolume B - snapshot of A at generation 10, snapshot 1 C - snapshot of B at generation 20, snapshot 1
For this example we'll assume block 2 has a generation and snapshot id of 1. We use the directed graph to know that our maximum reference count is 3, because the generation is < 10 and < 20. However, for this case, the minimum reference count is 2. This is because C is a snapshot of B, and so we can't know for certain if 2 was every referenced by C without checking.
Block 3 is an example of this case. Let's assume that block 3 also has a
generation and snapshot id of 1. This gives it the same theoretical maximum
reference count of 3. However at generation 20 we COW'ed that block from B, and
now it has a drop_count
of 1. Let's say that 4 is the COW of 3, and has a
generation of 20 and a snapshot id of 1. At this point the snapshot C is taken.
By our directed graph we know that C could possibly reference 3, but we know it
doesn't in fact reference it. This is what we use as the minimum reference
count.
First degree snapshots, that is direct snapshots of a subvolume, are guaranteed reference counts based on their generation+snapshot id pair. However chained snapshots, that is a snapshot of a snapshot, may not actually refer to blocks in the original subvolume.
This is the reason we require the garbage collection roots, as determining who still refers to the block in this case is complicated.
The normal case where minimum reference count == maximum reference count means
we could bypass the garbage collection tree if we desired, because we know that
once our drop_count
== our reference count that the block can be freed. The
GC tree is only strictly necessary for the case where minimum reference count !=
maximum reference count.
Data is more complicated, as we can have things like reflink and bookend extents create a special kind of chaos that is hard to reason out cleanly with the snapshot tree and drop trees.
Instead we will keep the rules for data relatively the same in practice, with just a few slight tweaks.
update_ref_for_cow
will continue to callbtrfs_inc_ref
on shared leaf nodes. This is to make sure that new roots get their appropriate references propagated.update_ref_for_cow
will callbtrfs_inc_ref
withfull_backref
set unconditionally when we COW from the owner root.- The
FULL_BACKREF
reference will be dropped at garbage collection time for that particular metadata block.
This will keep the data reference counting relatively unchanged from the extent tree v1 setup.
We will be adding a snapshot_id
to the btrfs_file_extent_item
in order to
facilitate the mechanics of finding referencing roots, but it won't be used for
anything beyond that.
This is a con of the snapshot tree and drop trees design, we will have to walk the entire tree to clean up all blocks we reference. For unshared metadata blocks they will simply be freed, for shared metadata blocks they will have their drop count updated. Data extents will have their references dropped if we have a root reference for them, otherwise they'll remain untouched.
update_ref_for_cow
- if the block is shared, update the drop count for the block. We no longer callbtrfs_inc_ref
for nodes, only for leaves. We also no longer do theFULL_BACKREF
conversion for nodes, we only add aFULL_BACKREF
reference for any file extents in leaves.update_ref_for_cow
will also update theshared_bytes
counter on the snapshot item if we are not the owner, this will allow us to garbage collect the snapshot items themselves.- Updating the
drop_count
on a block will calculate the min and max reference counts at thedrop_count
time, and if ourdrop_count
matches the minimum reference count then we can insert a garbage collection item.