Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downstairs: switch to unix-excl VFS #773

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Conversation

faithanalog
Copy link
Contributor

@faithanalog faithanalog commented Jun 2, 2023

see #771 for the details on this. This is the implementation.

This is the code that actually does the change:

    let metadb: Connection = Connection::open_with_flags_and_vfs(
        path,
        OpenFlags::default(),
        "unix-excl",
    )?;

The rest of this PR is stripping out the shm file from the rest of the codebase since it doesn't exist when using unix-excl, and updating a few test scripts to shut down running downstairs before running dumps.

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but unfortunately this doesn't work: any snapshot of a disk / volume we take means the ZFS snapshot will be taken when the database is open, and trying to launch a read-only downstairs from this snapshot doesn't work:

[ Jun  2 09:31:47 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
{"msg":"Opened existing region file \"/data/regions/eeb75f44-c435-4cff-8c93-3d6d4dc5ff58/.zfs/snapshot/580a7b89-f74d-4d2d-87f9-78f83b58e946/region.json\"","v":0,"name":"crucible","level":30,"time":"2023-06-02T09:31:47.376914861-07:00","hostname":"oxz_crucible_oxp_ed9a1cbd-55d2-ce10-eed0-f6f4aed4bcf7","pid":15654}
{"msg":"Error: Open of db file \"/data/regions/eeb75f44-c435-4cff-8c93-3d6d4dc5ff58/.zfs/snapshot/580a7b89-f74d-4d2d-87f9-78f83b58e946/00/000/000.db\" for extent#0 returned: unable to open database file","v":0,"name":"crucible","level":50,"time":"2023-06-02T09:31:47.377533137-07:00","hostname":"oxz_crucible_oxp_ed9a1cbd-55d2-ce10-eed0-f6f4aed4bcf7","pid":15654}
Error: Open of db file "/data/regions/eeb75f44-c435-4cff-8c93-3d6d4dc5ff58/.zfs/snapshot/580a7b89-f74d-4d2d-87f9-78f83b58e946/00/000/000.db" for extent#0 returned: unable to open database file
[ Jun  2 09:31:47 Stopping because service exited with an error. ]
[ Jun  2 09:31:47 Failing too quickly, throttling. ]
[ Jun  2 09:31:48 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]

@@ -2943,6 +2914,10 @@ mod test {

#[test]
fn validate_repair_files_quad_duplicate() {
// XXX I don't know why this test exists. we definitely shouldnt be
// involving shm in anything anymore. but what was this supposed to even
// be doing? idk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a test for the case where there are 4 files (previously expected to be 2 or 4), but one is a duplicate of the other, instead of 4 unique files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly, it was checking to be sure that the count was 4 and that we had 4 unique file names.

wait "$ds2_pid" || true
fi

# Stop everything before dump so dump can access the extent metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous comment's mention of not allowing IOs to complete is good context here, that should be kept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i deleted it because im pretty sure it's actually wrong and that downstairs isn't actually processing any IO at the time. Willing to be told that's not the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a "--lossy" downstairs, the intent was that that downstairs was doing IO slower
than the other three downstairs. Slow enough that indeed it had not completed all
the IOs that the other two downstairs had completed. This functionality is important
for the test as we are testing a repair here, so to perform a repair, there must be some
difference between the three downstairs :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought the point of lossy was that it would straight up drop the IOs rather than just do them slower?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps poorly named, yeah.
No, we don't drop any IOs, but the downstairs will "skip over" some as it walks the work list
It will also add a tiny delay into completing them. So, eventually the IOs will finish, but it
will take longer than the other downstairs.

@faithanalog
Copy link
Contributor Author

any snapshot of a disk / volume we take means the ZFS snapshot will be taken when the database is open, and trying to launch a read-only downstairs from this snapshot doesn't work:

ah bullshit, ok. I'll ponder/poke at that

@leftwo leftwo added this to the FCS milestone Jun 6, 2023
@leftwo leftwo added FCS stuff for FCS mvp Required for MVP labels Jun 6, 2023
@faithanalog
Copy link
Contributor Author

faithanalog commented Jun 7, 2023

the change just pushed does work for reading from RO filesystems. but i have some misgivings (see the comments). I really don't like that it would allow you to run crucible-downstairs dump without error against a currently-open region. feedback requested + suggestions for how we can implement extent or region-level locking correctly

personally im thinking of giving a try on pulling a fnctl lock on the extent block data file and see how that goes.

@faithanalog faithanalog marked this pull request as draft June 7, 2023 01:48
@jclulow
Copy link
Collaborator

jclulow commented Jun 7, 2023

I would use flock, not fcntl. It's difficult to guarantee that fcntl locks are being used correctly in library routines and in the face of multiple threads.

@jclulow
Copy link
Collaborator

jclulow commented Jun 7, 2023

I'm also not sure if you'd be able to lock files in a ZFS snapshot, so it may first be worth investigating a way to tell if what you're looking at is a snapshot or a live file system, and only doing the immutable open in the snapshot case?

@faithanalog
Copy link
Contributor Author

I'm also not sure if you'd be able to lock files in a ZFS snapshot, so it may first be worth investigating a way to tell if what you're looking at is a snapshot or a live file system, and only doing the immutable open in the snapshot case?

yeah i'd prefer that I think. I could scan all the snapshots' mount points and check if its a descendant of any of those. Does that seem reasonable or is there something better I ought to do?

@jclulow
Copy link
Collaborator

jclulow commented Jun 7, 2023

I'm also not sure if you'd be able to lock files in a ZFS snapshot, so it may first be worth investigating a way to tell if what you're looking at is a snapshot or a live file system, and only doing the immutable open in the snapshot case?

yeah i'd prefer that I think. I could scan all the snapshots' mount points and check if its a descendant of any of those. Does that seem reasonable or is there something better I ought to do?

I have written a demonstration program of one way that I think is probably acceptable for determining whether a particular path is part of a live file system or a snapshot, and which one:

https://github.com/jclulow/junk-zfswtf/blob/main/src/main.rs

$ target/release/zfswtf \
        /rpool/images/jclulow/work/gimlet/ramdisk/.zfs/snapshot/os/etc/motd \
        /rpool/images/jclulow/work/gimlet/ramdisk/etc/motd \
        / \
        /.zfs/snapshot/2021-05-03-23\:40\:12/

looking at path: "/rpool/images/jclulow/work/gimlet/ramdisk/.zfs/snapshot/os/etc/motd"
f_basetype = "zfs"
f_fsid = 0x40d00d1
st_dev = 0x103000100d8
st_dev major() -> 259 (0x103)
st_dev minor() -> 65752 (0x100d8)
snapshot of:      rpool/images/jclulow/work/gimlet/ramdisk

looking at path: "/rpool/images/jclulow/work/gimlet/ramdisk/etc/motd"
f_basetype = "zfs"
f_fsid = 0x40d00d1
st_dev = 0x103000100d1
st_dev major() -> 259 (0x103)
st_dev minor() -> 65745 (0x100d1)
live file system: rpool/images/jclulow/work/gimlet/ramdisk

looking at path: "/"
f_basetype = "zfs"
f_fsid = 0x40d0002
st_dev = 0x10300010002
st_dev major() -> 259 (0x103)
st_dev minor() -> 65538 (0x10002)
live file system: rpool/ROOT/helios-1.0.20582-12

looking at path: "/.zfs/snapshot/2021-05-03-23:40:12/"
f_basetype = "zfs"
f_fsid = 0x40d0002
st_dev = 0x103000100da
st_dev major() -> 259 (0x103)
st_dev minor() -> 65754 (0x100da)
snapshot of:      rpool/ROOT/helios-1.0.20582-12

I have tried to make the comments in the code cover how this works. Let me know what you think, or if any of it doesn't make sense, etc.

@faithanalog
Copy link
Contributor Author

read through that today and it all makes sense to me. ill look at implementing it into crucible

@leftwo leftwo removed the FCS stuff for FCS label Jul 7, 2023
@leftwo leftwo modified the milestones: FCS, MVP Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mvp Required for MVP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants