-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
downstairs: switch to unix-excl VFS #773
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but unfortunately this doesn't work: any snapshot of a disk / volume we take means the ZFS snapshot will be taken when the database is open, and trying to launch a read-only downstairs from this snapshot doesn't work:
[ Jun 2 09:31:47 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
{"msg":"Opened existing region file \"/data/regions/eeb75f44-c435-4cff-8c93-3d6d4dc5ff58/.zfs/snapshot/580a7b89-f74d-4d2d-87f9-78f83b58e946/region.json\"","v":0,"name":"crucible","level":30,"time":"2023-06-02T09:31:47.376914861-07:00","hostname":"oxz_crucible_oxp_ed9a1cbd-55d2-ce10-eed0-f6f4aed4bcf7","pid":15654}
{"msg":"Error: Open of db file \"/data/regions/eeb75f44-c435-4cff-8c93-3d6d4dc5ff58/.zfs/snapshot/580a7b89-f74d-4d2d-87f9-78f83b58e946/00/000/000.db\" for extent#0 returned: unable to open database file","v":0,"name":"crucible","level":50,"time":"2023-06-02T09:31:47.377533137-07:00","hostname":"oxz_crucible_oxp_ed9a1cbd-55d2-ce10-eed0-f6f4aed4bcf7","pid":15654}
Error: Open of db file "/data/regions/eeb75f44-c435-4cff-8c93-3d6d4dc5ff58/.zfs/snapshot/580a7b89-f74d-4d2d-87f9-78f83b58e946/00/000/000.db" for extent#0 returned: unable to open database file
[ Jun 2 09:31:47 Stopping because service exited with an error. ]
[ Jun 2 09:31:47 Failing too quickly, throttling. ]
[ Jun 2 09:31:48 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
@@ -2943,6 +2914,10 @@ mod test { | |||
|
|||
#[test] | |||
fn validate_repair_files_quad_duplicate() { | |||
// XXX I don't know why this test exists. we definitely shouldnt be | |||
// involving shm in anything anymore. but what was this supposed to even | |||
// be doing? idk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a test for the case where there are 4 files (previously expected to be 2 or 4), but one is a duplicate of the other, instead of 4 unique files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly, it was checking to be sure that the count was 4 and that we had 4 unique file names.
wait "$ds2_pid" || true | ||
fi | ||
|
||
# Stop everything before dump so dump can access the extent metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous comment's mention of not allowing IOs to complete is good context here, that should be kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i deleted it because im pretty sure it's actually wrong and that downstairs isn't actually processing any IO at the time. Willing to be told that's not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a "--lossy" downstairs, the intent was that that downstairs was doing IO slower
than the other three downstairs. Slow enough that indeed it had not completed all
the IOs that the other two downstairs had completed. This functionality is important
for the test as we are testing a repair here, so to perform a repair, there must be some
difference between the three downstairs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought the point of lossy was that it would straight up drop the IOs rather than just do them slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps poorly named, yeah.
No, we don't drop any IOs, but the downstairs will "skip over" some as it walks the work list
It will also add a tiny delay into completing them. So, eventually the IOs will finish, but it
will take longer than the other downstairs.
ah bullshit, ok. I'll ponder/poke at that |
the change just pushed does work for reading from RO filesystems. but i have some misgivings (see the comments). I really don't like that it would allow you to run personally im thinking of giving a try on pulling a fnctl lock on the extent block data file and see how that goes. |
I would use flock, not fcntl. It's difficult to guarantee that fcntl locks are being used correctly in library routines and in the face of multiple threads. |
I'm also not sure if you'd be able to lock files in a ZFS snapshot, so it may first be worth investigating a way to tell if what you're looking at is a snapshot or a live file system, and only doing the immutable open in the snapshot case? |
yeah i'd prefer that I think. I could scan all the snapshots' mount points and check if its a descendant of any of those. Does that seem reasonable or is there something better I ought to do? |
I have written a demonstration program of one way that I think is probably acceptable for determining whether a particular path is part of a live file system or a snapshot, and which one: https://github.com/jclulow/junk-zfswtf/blob/main/src/main.rs
I have tried to make the comments in the code cover how this works. Let me know what you think, or if any of it doesn't make sense, etc. |
read through that today and it all makes sense to me. ill look at implementing it into crucible |
see #771 for the details on this. This is the implementation.
This is the code that actually does the change:
The rest of this PR is stripping out the
shm
file from the rest of the codebase since it doesn't exist when using unix-excl, and updating a few test scripts to shut down running downstairs before running dumps.