-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TRIM support - replaces #5925 and #7363 #8255
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8255 +/- ##
==========================================
- Coverage 78.33% 78.29% -0.05%
==========================================
Files 380 382 +2
Lines 115719 116678 +959
==========================================
+ Hits 90653 91350 +697
- Misses 25066 25328 +262
Continue to review full report at Codecov.
|
This has been refreshed with fixes to issues discovered during testing. There are still some deadlock issues which occur mainly when the |
@@ -6649,6 +6700,53 @@ zpool_do_resilver(int argc, char **argv) | |||
return (for_each_pool(argc, argv, B_TRUE, NULL, scrub_callback, &cb)); | |||
} | |||
|
|||
/* | |||
* zpool trim [-s|-r <rate>] <pool> ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing -p
flag
print_trim_status(trim_prog, (vs->vs_space - | ||
vs->vs_alloc) + zpool_slog_space(nvroot), | ||
trim_rate, trim_start_time, trim_stop_time); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's model getting the trim stats off the way this is handled for checkpoint, scan, and removal stats. Specifically, with a new ZPOOL_CONFIG_TRIM_STATS
key containing a uint64 array. For consistency let's try and get all this code to work in basically the same way.
@@ -539,6 +539,40 @@ bio_is_fua(struct bio *bio) | |||
#endif | |||
} | |||
|
|||
/* | |||
* bio_set_discard - Set the appropriate flags in a bio to indicate | |||
* that the specific random of sectors should be discarded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/random/range
@@ -383,6 +383,8 @@ typedef struct dmu_buf { | |||
#define DMU_POOL_OBSOLETE_BPOBJ "com.delphix:obsolete_bpobj" | |||
#define DMU_POOL_CONDENSING_INDIRECT "com.delphix:condensing_indirect" | |||
#define DMU_POOL_ZPOOL_CHECKPOINT "com.delphix:zpool_checkpoint" | |||
#define DMU_POOL_TRIM_START_TIME "trim_start_time" | |||
#define DMU_POOL_TRIM_STOP_TIME "trim_stop_time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need to converted this to a single entry containing an array of uint64's for TRIM. Just like DMU_POOL_SCAN
or DMU_POOL_REMOVING
. Something like this (I may have missed some needed entries).
#define DMU_POOL_TRIM "org.open-zfs:trim"
...
/*
* All members must be uint64_t, for byteswap purposes.
*/
typedef struct spa_trim_phys {
uint64_t tr_state; /* dsl_scan_state_t */
uint64_t tr_start_time;
uint64_t tr_end_time;
uint64_t tr_rate;
uint64_t tr_to_trim; /* bytes that need to be trimmed */
uint64_t tr_trimmed; /* bytes that have been trimmed */
} spa_trim_phys_t;
#define ZPOOL_CONFIG_TRIM_PROG "trim_prog" | ||
#define ZPOOL_CONFIG_TRIM_RATE "trim_rate" | ||
#define ZPOOL_CONFIG_TRIM_START_TIME "trim_start_time" | ||
#define ZPOOL_CONFIG_TRIM_STOP_TIME "trim_stop_time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be dropped since we want to move this all in to spa_trim_phys_t
.
kcondvar_t spa_man_trim_done_cv; /* manual trim has completed */ | ||
/* For details on trim start/stop times see spa_get_trim_prog. */ | ||
uint64_t spa_man_trim_start_time; | ||
uint64_t spa_man_trim_stop_time; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a spa_trim_phys_t
here instead and use those values. Like spa_removing_phys
above. We might also want to consider moving the manual and auto trim state in to their own structures as well.
@@ -118,6 +118,8 @@ extern "C" { | |||
#define ESC_ZFS_BOOTFS_VDEV_ATTACH "bootfs_vdev_attach" | |||
#define ESC_ZFS_POOL_REGUID "pool_reguid" | |||
#define ESC_ZFS_HISTORY_EVENT "history_event" | |||
#define ESC_ZFS_TRIM_START "trim_start" | |||
#define ESC_ZFS_TRIM_FINISH "trim_finish" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but it reminds be that we forget to add these for removal and initialize.
kmutex_t vdev_trim_zios_lock; | ||
kcondvar_t vdev_trim_zios_cv; | ||
uint64_t vdev_trim_zios; /* # of in-flight async trim zios */ | ||
boolean_t vdev_trim_zios_stop; /* see zio_trim_should_bypass */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to move these up under the vdev_initialize_*
members, and even slightly renaming the new fields for consistency would be help but isn't critical.
* a couple unused pointers is too low to justify that risk. | ||
*/ | ||
void (*dfl_ck_func)(uint64_t, uint64_t, void *); | ||
void *dfl_ck_arg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constraint really doesn't apply to us, nothing on Linux will call this, We should either drop this or wrap it and the assigned zio_trim_check
functions with __sun__
to make it clear this is dead code.
zc.zc_cookie = (uintptr_t)&tci; | ||
|
||
if (zfs_ioctl(hdl, ZFS_IOC_POOL_TRIM, &zc) == 0) | ||
return (0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be converted to a new in/out nvlist style ioctl and use libzfs_core.c
and lzc_ioctl
.
metaslab_t *ms = msp; | ||
boolean_t held = MUTEX_HELD(&ms->ms_lock); | ||
if (!held) | ||
mutex_enter(&ms->ms_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this routine is a no-op on nondebug bits, can we also make it not grab/drop the mutex on nondebug?
* ========================================================================== | ||
*/ | ||
static uint64_t | ||
metaslab_ff_alloc(metaslab_t *msp, uint64_t size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the changes to metaslab_*_alloc() related to TRIM? Could you break them out into a separate PR? (FYI, I would support this change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're referring to the removal of metaslab_ff_alloc()
, this appears to have been done as part of porting your openzfs/openzfs@8629fdc patch from the OpenZFS PR. The removal of this function does certainly seem like something that ought to be in its own standalone PR.
} | ||
} | ||
/* | ||
* If there's a trim ongoing, punch out the holes that will | ||
* be filled back in in metaslab_trim_done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about be filled in by metaslab_trim_done
?
*/ | ||
if (msp->ms_trimming_ts != NULL) { | ||
range_tree_walk(msp->ms_trimming_ts, range_tree_remove, | ||
msp->ms_allocatable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we aren't able to allocate space that is free but in the process of being trimmed. Do we update ms_weight and ms_max_size to reflect that (and be consistent with ms_allocatable)? If not, I think metaslab_group_alloc_normal() will get confused.
Rather than deal with the implications of ms_trimming_ts everywhere, could we instead prevent allocations from metaslabs that we are in the middle of trimming? Like ms_initializing
does? (see vdev_initialize_ms_mark()
)
@@ -3265,8 +3287,7 @@ metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal, | |||
* we may end up in an infinite loop retrying the same | |||
* metaslab. | |||
*/ | |||
ASSERT(!metaslab_should_allocate(msp, asize)); | |||
|
|||
/* ASSERT(!metaslab_should_allocate(msp, asize)); XXX */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is where we will get confused because some space is not allocatable, but that is not reflected in all the metadata about the metaslab.
@@ -3937,6 +3969,7 @@ metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size, | |||
VERIFY3U(range_tree_space(msp->ms_allocatable) - size, <=, | |||
msp->ms_size); | |||
range_tree_remove(msp->ms_allocatable, offset, size); | |||
metaslab_trim_remove(msp, offset, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that the trimsets should all be empty at this point. Can we assert that, instead of trying to remove this?
* 1) Keeping the trim commands reasonably small. | ||
* 2) Keeping the ability to rollback back for as many txgs as possible. | ||
* 3) Waiting around too long that the user starts to get uneasy about not | ||
* seeing any space being freed after they remove some files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that space which has been freed but not yet trimmed is in fact available for allocation (and thus "free"). That's good. However, this comment implies that this space is not counted as free in the user-visible accounting somewhere. That seems wrong to me - if the space is available for allocation then it is by definition free and should be displayed as such.
d70be42
to
9efe6c0
Compare
Status update: This has just been refreshed with some fixes to the various deadlocks observed while running ztest (which have been left as separate commits). Reviewers should look at 1eda737 and the related follow-on commits in particular. With these changes in place, zloop can now generally survive an overnight run. I've not yet done any actual testing to make sure this plays well (i.e. actually works as opposed to simply not panicking) with allocation classes or non-trivial cases of device removal (other than those which are performed via ztest, which are plenty). We (@behlendorf and myself) will begin addressing the review feedback shortly. Also, some of the new ZTS tests need fixing. [EDIT]: Note that there were actually 3 (force) pushes in rapid succession because there was an experimental fix attempt left in by accident and also to rebase on current master with recent changes to metaslab.c. Also, I'll note that as new fixes are added, they're commited to the dweeezil:ntrim3-next branch. |
Update dkio.h from Nexenta's version to pick up DKIOCFREE and add their dkioc_free_util.h header for TRIM support. Requires-builders: none
Ported by: Tim Chase <[email protected]> Porting notes: The trim kstats are in zfs/<pool> along with the other per-pool stats. The kstats can be cleared by writing to the kstat file. Null format parameters to strftime() were replaced with "%c". Added vdev trace support. New dfl_alloc() function in the SPL is used to allocate arrays of dkioc_free_list_t objects since they may be large enough to require virtual memory. Other changes: Suppressed kstat creation for pools with "$" names. The changes to vdev_raidz_map_alloc() have been minimized in order to allow more conflict-free merging with future changes (ABD). Added the following module parameters: zfs_trim - Enable TRIM zfs_trim_min_ext_sz - Minimum size to trim zfs_txgs_per_trim - Transaction groups over which to batch trims Requires-builders: none
Requires-builders: none
The extended zpool iostat options -wlqr will display information about automatic and manual TRIMs. This commit also fixes a completely unrelated bug in which the IOS_LATENCY row in the vsx_type_to_nvlist array was missing an entry for the scrub nvlist. Requires-builders: none
…ents. Requires-builders: none
The blkdev_issue_discard() function has been available for a long time by the kernel but it only supports synchronous discards. The __blkdev_issue_discard() function provides an asynchronous interface but was added in the 4.6 kernel. Only supporting synchronously discards can potentially limit performance when processing a large number of small extents. To avoid this an asynchronous discard implementation has been added to vdev_disk.c which builds on existing functionality. The kernel provided synchronous version remains the default pending additional functional and performance testing. Due to different mechamism used for submitting TRIM commands there were not being properly accounted for in the extended statistics. Resolve this by allow for aggregated stats to be returned as part of the TRIM zio. This allows for far better visibility in to the discard request sizes. Minor documentation updates. Signed-off-by: Brian Behlendorf <[email protected]> Requires-builders: none
Requires-builders: none
Signed-off-by: Brian Behlendorf <[email protected]> Requires-builders: none
Porting Notes: Man page changes dropped for the moment. This can be reconsiled when the final version is merged to OpenZFS. They are accurate now, only worded a little differently. Requires-builders: none
1) Removed the first-fit allocator. 2) Moved the autotrim metaslab scheduling logic into vdev_auto_trim. 2a) As a consequence of openzfs#2, metaslab_trimset_t was rendered superfluous. New trimsets are simple range_tree_t's. 3) Made ms_trimming_ts remove extents it is working on from ms_tree and then add them back in. 3a) As a consequence of openzfs#3, undone all the direct changes to the allocators and removed metaslab_check_trim_conflict and range_tree_find_gap. Porting Notes: * Removed WITH_*_ALLOCATOR macros and aligned remaining allocations with OpenZFS. Unused wariables warnings resolved with the gcc __attribute__ ((unused__ keyword. * Added missing calls for ms_condensing_cv. Signed-off-by: Brian Behlendorf <[email protected]> Requires-builders: none
Porting Notes: * metaslab_sync changes already applied. * resync of test cases needed Requires-builders: none
1) Simplified the SM_FREE spacemap writing while a trim is active. 2) Simplified the range_tree_verify in metaslab_check_free. 3) Clarified comment above metaslab_trim_all. 4) Substituted 'flust out' with 'drop' in comment in metaslab_trim_all. 5) Moved ms_prev_ts clearing up to ms_cur_ts claring in metaslab_trim_all. 6) Added recomputation of metaslab weight when metaslab is loaded. 7) Moved dmu_tx_commit inside of spa_trim_update_time. 8) Made the smallest allowable manual trim rate 1/1000th of a metaslab size. 9) Switched to using hrtime_t in manual trim timing logic. 10) Changed "limited" to "preserve_spilled" in vdev_auto_trim. 11) Moved vdev_notrim setting into zio_vdev_io_assess.a Porting Notes: * vdev_disk.c and zio.c hunks already applied. * nsec_per_tick -> MSEC2NSEC(1) Requires-builders: none
… wanting to condense. Requires-builders: none
Requires-builders: none
Requires-builders: none
Some storage backends such as large thinly-provisioned SANs are very slow for large trims. Manual trim now supports "zpool trim -p" (partial trim) to skip metaslabs for which there is no spacemap. Signed-off-by: Tim Chase <[email protected]> Requires-builders: none
The existing test cases were split in to multiple test cases and refactored. There are now test cases for the following: zpool_trim_001_pos - Verify manual TRIM zpool_trim_002_pos - Verify manual trim can be interrupted zpool_trim_003_pos - Verify 'zpool trim -s' rate limiting zpool_trim_004_pos - Verify 'zpool trim -p' partial TRIM works zpool_trim_005_neg - Verify bad parameters to 'zpool trim' zpool_trim_006_neg - Verify bad parameters to 'zpool trim -r' autotrim_001_pos - Verify 'autotrim=on' pool data integrity autotrim_002_pos - Verify various pool geometries manualtrim_001_pos - Verify manual trim pool data integrity manualtrim_002_pos - Verify various pool geometries manualtrim_003_pos - Verify 'zpool import|export' manualtrim_004_pos - Verify 'zpool online|offline|replace' manualtrim_005_pos - Verify TRIM and scrub run concurrently Signed-off-by: Brian Behlendorf <[email protected]> Requires-builders: none
* Rename TRIM taskq threads to be more concise for Linux. * Fix divide by zero panic Signed-off-by: Brian Behlendorf <[email protected]> Requires-builders: none
* Fixed missing taskq_destroy when exporting a pool which is being actively trimmed. * Add auto/manual TRIM coverage to ztest. * Temporarily disable manualtrim_004_pos. Signed-off-by: Brian Behlendorf <[email protected]> Requires-builders: none
Signed-off-by: Chunwei Chen <[email protected]> Requires-builders: none
Signed-off-by: Chunwei Chen <[email protected]> Requires-builders: none
Signed-off-by: Chunwei Chen <[email protected]> Requires-builders: none
Signed-off-by: Tim Chase <[email protected]> Requires-builders: none
Requires-builders: none
Requires-builders: none
NOTE: should be squashed into a previous commit Requires-builders: none
Also, comment-out the ASSERT(!metaslab_should_allocate(msp, asize)); in metaslab_group_alloc_normal(). It seems that the additional metaslab_group_sort() performed by trim makes this assertion invalid. Requires-builders: none
Requires-builders: none
Requires-builders: none
Also, stop trimming in places where initialization is stopped: spa_vdev_remove_log() and spa_vdev_remove_top(). Requires-builders: none
NOTE: This should be its own separate PR. It was discovered during debugging of the TRIM work. Requires-builders: none
Requires-builders: none
In trim_stop_set() and trim_stop_wait(). Requires-builders: none
... and related locking Don't take the spa_auto_trim_lock in the sync path: The auto trim taskq is now stopped and started in response to changes in the "autotrim" property asynchronously by the spa async task. If the spa_auto-trim_lock is taken in the sync task, the system can deadlock as follows: Non-sync context task: Acquires spa_auto_trim_lock via spa_vdev_enter() or some other path. Sync task: syncing a change to the "autotrim" property attempts to take spa_auto_trim_lock and blocks. Non-sync context task: blocks attempting to take a spa config lock which won't be released until the sync task completes. Deadlock. Also, since the auto trim taskq is now started asynchronously, it may not yet be ready yet when the sync task calls spa_auto_trim(). Modified spa_auto_trim so it will simply return if the taskq is not available yet (it will do its thing on the next pass). Also, avoid starting auto trim taskqs for the "import$" pseudo-pool. Also, don't attempt to create the taskq if a spa unload is in-progress. Requires-builders: none
The change in the previous commit to not attempt to create the taskq if a spa unload is in-progress caused the auto trim taskq to not be started upon import. Requires-builders: none
Rather than using the default value for the property, we need to use the value which might be set by "zpool create -o autotrim=". Requires-builders: none
Long zloop runs were occasionally hitting the "ASSERT(spa->spa_man_trim_taskq != NULL)" in spa_man_trim_taskq_destroy(). It's not clear to me how this was happening because the only place "spa->spa_man_trim_taskq" is cleared is in spa_man_trim_taskq_destroy() itself which is only called from the (single) spa_async_thread() or from spa_unload(). To that end, this commit adds a non-NULL check in spa_async_thread() analagous to the tests which were added when support for stopping/starting the auto trim taskqs were added to spa_async_thread(). NOTE: Yes, that means I consider this to be a band-aid.
@@ -2110,6 +2159,70 @@ again. | |||
Starts a resilver. If an existing resilver is already running it will be | |||
restarted from the beginning. Any drives that were scheduled for a deferred | |||
resilver will be added to the new one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add for newlines:
Starts a resilver. If an existing resilver is already running it will be
restarted from the beginning. Any drives that were scheduled for a deferred
resilver will be added to the new one.
+.It Xo
+.Nm
.Cm trim
.Op Fl p
.Op Fl r Ar rate | Fl s
Include
|
You'll want to add the new |
# You may not use this file except in compliance with the License. | ||
# | ||
# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE | ||
# or http://www.opensolaris.org/os/licensing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opensolaris.org is no more. Please refer to http://opensource.org/licenses/CDDL-1.0
This comment has been minimized.
This comment has been minimized.
tests = ['autotrim_001_pos', 'autotrim_002_pos', 'manualtrim_001_pos', | ||
'manualtrim_002_pos', 'manualtrim_003_pos', 'manualtrim_004_pos', | ||
'manualtrim_005_pos'] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tags = ['functional', 'trim']
This comment has been minimized.
This comment has been minimized.
I'm going to be closing this PR. @behlendorf has been doing some major re-working of the original TRIM work to, among other things:
Expect a new PR to be posted soon. |
UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <[email protected]> Reviewed-by: Tim Chase <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Contributions-by: Saso Kiselkov <[email protected]> Contributions-by: Tim Chase <[email protected]> Contributions-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #8419 Closes #598
Motivation and Context
Frequently-requested feature to help performance on SSDs and on various other SAN-like storage back-ends which support UNMAP/TRIM.
Description
This is a port of Nextenta's TRIM support to the current master branch - post-device removal and post-device initialization.
How Has This Been Tested?
New ZTS tests. Also additions to ztest.
Types of changes
Checklist:
Signed-off-by
.