-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenZFS - 6363 Add UNMAP/TRIM functionality (v2) #7363
Conversation
This happens a lot in ztest. Also |
@kpande, @tuxoko was nice enough to offer to help address some of the remaining issues with TRIM. Thanks! Step one is getting all the old and new tests passing.
This looks like one of those remaining issues which needs to be understood and fixed. In the master branch we aren't seeing anything like this so it's likely TRIM related.
It also looks like several of the ZTS runs resulted in kernel panics. So there's definitely some work to do root causing these issues. |
module/zfs/vdev_queue.c
Outdated
|
||
/* trim I/Os have no single meaningful offset */ | ||
if (zio->io_priority != ZIO_PRIORITY_AUTO_TRIM || | ||
zio->io_priority != ZIO_PRIORITY_MAN_TRIM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per previous review, the comparison should be &&
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, ||
doesn't make sense.
Update dkio.h from Nexenta's version to pick up DKIOCFREE and add their dkioc_free_util.h header for TRIM support.
Ported by: Tim Chase <[email protected]> Porting notes: The trim kstats are in zfs/<pool> along with the other per-pool stats. The kstats can be cleared by writing to the kstat file. Null format parameters to strftime() were replaced with "%c". Added vdev trace support. New dfl_alloc() function in the SPL is used to allocate arrays of dkioc_free_list_t objects since they may be large enough to require virtual memory. Other changes: Suppressed kstat creation for pools with "$" names. The changes to vdev_raidz_map_alloc() have been minimized in order to allow more conflict-free merging with future changes (ABD). Added the following module parameters: zfs_trim - Enable TRIM zfs_trim_min_ext_sz - Minimum size to trim zfs_txgs_per_trim - Transaction groups over which to batch trims
The extended zpool iostat options -wlqr will display information about automatic and manual TRIMs. This commit also fixes a completely unrelated bug in which the IOS_LATENCY row in the vsx_type_to_nvlist array was missing an entry for the scrub nvlist.
The blkdev_issue_discard() function has been available for a long time by the kernel but it only supports synchronous discards. The __blkdev_issue_discard() function provides an asynchronous interface but was added in the 4.6 kernel. Only supporting synchronously discards can potentially limit performance when processing a large number of small extents. To avoid this an asynchronous discard implementation has been added to vdev_disk.c which builds on existing functionality. The kernel provided synchronous version remains the default pending additional functional and performance testing. Due to different mechamism used for submitting TRIM commands there were not being properly accounted for in the extended statistics. Resolve this by allow for aggregated stats to be returned as part of the TRIM zio. This allows for far better visibility in to the discard request sizes. Minor documentation updates. Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Porting Notes: Man page changes dropped for the moment. This can be reconsiled when the final version is merged to OpenZFS. They are accurate now, only worded a little differently.
1) Removed the first-fit allocator. 2) Moved the autotrim metaslab scheduling logic into vdev_auto_trim. 2a) As a consequence of openzfs#2, metaslab_trimset_t was rendered superfluous. New trimsets are simple range_tree_t's. 3) Made ms_trimming_ts remove extents it is working on from ms_tree and then add them back in. 3a) As a consequence of openzfs#3, undone all the direct changes to the allocators and removed metaslab_check_trim_conflict and range_tree_find_gap. Porting Notes: * Removed WITH_*_ALLOCATOR macros and aligned remaining allocations with OpenZFS. Unused wariables warnings resolved with the gcc __attribute__ ((unused__ keyword. * Added missing calls for ms_condensing_cv. Signed-off-by: Brian Behlendorf <[email protected]>
Porting Notes: * metaslab_sync changes already applied. * resync of test cases needed
1) Simplified the SM_FREE spacemap writing while a trim is active. 2) Simplified the range_tree_verify in metaslab_check_free. 3) Clarified comment above metaslab_trim_all. 4) Substituted 'flust out' with 'drop' in comment in metaslab_trim_all. 5) Moved ms_prev_ts clearing up to ms_cur_ts claring in metaslab_trim_all. 6) Added recomputation of metaslab weight when metaslab is loaded. 7) Moved dmu_tx_commit inside of spa_trim_update_time. 8) Made the smallest allowable manual trim rate 1/1000th of a metaslab size. 9) Switched to using hrtime_t in manual trim timing logic. 10) Changed "limited" to "preserve_spilled" in vdev_auto_trim. 11) Moved vdev_notrim setting into zio_vdev_io_assess.a Porting Notes: * vdev_disk.c and zio.c hunks already applied. * nsec_per_tick -> MSEC2NSEC(1)
… wanting to condense.
Some storage backends such as large thinly-provisioned SANs are very slow for large trims. Manual trim now supports "zpool trim -p" (partial trim) to skip metaslabs for which there is no spacemap.
The existing test cases were split in to multiple test cases and refactored. There are now test cases for the following: zpool_trim_001_pos - Verify manual TRIM zpool_trim_002_pos - Verify manual trim can be interrupted zpool_trim_003_pos - Verify 'zpool trim -s' rate limiting zpool_trim_004_pos - Verify 'zpool trim -p' partial TRIM works zpool_trim_005_neg - Verify bad parameters to 'zpool trim' zpool_trim_006_neg - Verify bad parameters to 'zpool trim -r' autotrim_001_pos - Verify 'autotrim=on' pool data integrity autotrim_002_pos - Verify various pool geometries manualtrim_001_pos - Verify manual trim pool data integrity manualtrim_002_pos - Verify various pool geometries manualtrim_003_pos - Verify 'zpool import|export' manualtrim_004_pos - Verify 'zpool online|offline|replace' manualtrim_005_pos - Verify TRIM and scrub run concurrently Signed-off-by: Brian Behlendorf <[email protected]>
* Rename TRIM taskq threads to be more concise for Linux. * Fix divide by zero panic Signed-off-by: Brian Behlendorf <[email protected]>
Rather than hacking `vdev_raidz_map_alloc()` to get the child offsets calculate the values directly. Signed-off-by: Isaac Huang <[email protected]>
* Fixed missing taskq_destroy when exporting a pool which is being actively trimmed. * Add auto/manual TRIM coverage to ztest. * Temporarily disable manualtrim_004_pos. Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Hi there, After some investigation it turns out there is ENOSPC returned by metaslab_alloc. I am creating zfs pool in a sparse file image. I can easily reproduce it by copying two sets of small files - the linux kernel tree (including its git) and my ccache directory. I hope this helps. Best regards, |
VERIFY(!msp->ms_condensing); | ||
if (msp->ms_loaded) { | ||
range_tree_walk(msp->ms_trimming_ts, range_tree_add, | ||
msp->ms_tree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this has already been addressed elsewhere. During the time the trimming was being done (initiated by metaslab_exec_trim
), ms_trimming_ts
is removed from msp->ms_tree
. If an allocation request occurs on such an mg
, isn't it possible that mg->mg_no_free_space
could be set (because of the unavailable range---the range being trimmed)?
And if so, is there code that would trigger a re-evaluation of the validity of that mg_no_free_space
flag (now that said extent is again available for allocation)? If not, that could explain @KernelKyupCom's ENOSPACE
issue.
Hi guys,
The important part here is to export the pool immediately after the trim is issued. The issue occurs when the trim complete event is rised and SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY is being displatched to have the spa_async_thread already suspended or not yet scheduled when it gets suspended by the export ioctl. This way the export path will not take care of destroying the man_trim_task at all and it remains out there forever. diff --git a/module/zfs/spa.c b/module/zfs/spa.c
index 9702df0c26f3..f609d834acb1 100644
--- a/module/zfs/spa.c
+++ b/module/zfs/spa.c
@@ -4624,7 +4624,34 @@ spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
*/
spa_open_ref(spa, FTAG);
mutex_exit(&spa_namespace_lock);
+
+ /*
+ * Before we suspend the async thread wait if there is ongoing trim.
+ * Otherwise we may not execute the man_trim_tasq_destroy scheduled by
+ * man_trim_done because the async thread is suspended. This results in
+ * tasq thread leak. Drop the mutex after the trim completes and
+ * reacquire it later when checking for still existing taskq with async
+ * thread suspended. This way we give chance to get it regullary cleaned
+ * by the done callback async request if the asyn thread is still
+ * running. Eventually if the request havent been executed before we
+ * suspend the async_thread explixitly clean up the man_trim taskq.
+ */
+ mutex_enter(&spa->spa_man_trim_lock);
+ while (spa->spa_num_man_trimming > 0)
+ cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock);
+ mutex_exit(&spa->spa_man_trim_lock);
+
spa_async_suspend(spa);
+ /*
+ * After the suspend, chech if there is still man trim taskq which is
+ * about to leak otherwise adn destroy it. This way its work will be
+ * commited by the txg sync work flushed below.
+ */
+ mutex_enter(&spa->spa_man_trim_lock);
+ if (spa->spa_man_trim_taskq)
+ spa_man_trim_taskq_destroy(spa);
+ mutex_exit(&spa->spa_man_trim_lock);
+
if (spa->spa_zvol_taskq) {
zvol_remove_minors(spa, spa_name(spa), B_TRUE);
taskq_wait(spa->spa_zvol_taskq); Also I suppose the same should be done to address the auto trim ? I have also changed the trim task que thread pool name formatting in order to get closer what is going on. That way it is much easier to identify who is who and what happens, since the task->comm is being truncated to 16 chars by default and the "_man_trim"/"_auto_trim" suffix gets truncated at all (or at least in my case). That's just a suggestion about the naming convention. diff --git a/module/zfs/spa_misc.c b/module/zfs/spa_misc.c
index 8a91b40510da..ab0443f5537e 100644
--- a/module/zfs/spa_misc.c
+++ b/module/zfs/spa_misc.c
@@ -2257,7 +2257,7 @@ spa_auto_trim_taskq_create(spa_t *spa)
ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
ASSERT(spa->spa_auto_trim_taskq == NULL);
- (void) snprintf(name, MAXPATHLEN, "%s_auto_trim", spa->spa_name);
+ (void) snprintf(name, MAXPATHLEN, "atrim_%s", spa->spa_name);
spa->spa_auto_trim_taskq = taskq_create(name, 1, minclsyspri, 1,
spa->spa_root_vdev->vdev_children, TASKQ_DYNAMIC);
VERIFY(spa->spa_auto_trim_taskq != NULL);
@@ -2283,7 +2283,7 @@ spa_man_trim_taskq_create(spa_t *spa)
*/
return;
}
- (void) snprintf(name, MAXPATHLEN, "%s_man_trim", spa->spa_name);
+ (void) snprintf(name, MAXPATHLEN, "mtrim_%s", spa->spa_name);
spa->spa_man_trim_taskq = taskq_create(name, 1, minclsyspri, 1,
spa->spa_root_vdev->vdev_children, TASKQ_DYNAMIC);
VERIFY(spa->spa_man_trim_taskq != NULL); I am open for any critics about this fix. Best regards, |
@tuxoko Would it be possible to update this PR to @dweeezil's post-removal rebase? It might be helpful to see the test suite results. |
@aerusso |
is there any chance this will be ready for 0.8.0 release? or 0.8.x? |
Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part? |
switch to freebsd https://blog.plein.org/2014/04/15/freebsd-9-2-supports-zfs-trimunmap/ |
Hmm, apparently it’s possible to do this from a VM ... and I have TrueOS on a partition I can either boot from or run in a VM. I’ll just have to remember to export the pool from linux first, and maybe unmount the swap partition too.
Sent from a pocket terminal
… On 25 Oct 2018, at 13:08, mailinglists35 ***@***.***> wrote:
Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?
switch to freebsd https://blog.plein.org/2014/04/15/freebsd-9-2-supports-zfs-trimunmap/
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
It would be only the non zfs part |
It would be only the non zfs part
Well, that's not a big deal if I can do the ZFS part by importing the pool under a FreeBSD clone. Would importing the pool be enough, or would I'd have to do a scrub or something similar?
|
I suggest we stop discussing this here and now before a mod will lock this thread. Use the mailing lists for this. |
Replaced with #8255. |
From #5925
Rebased from @dweeezil 's ntrim2-next
Added trim related manpage changes.
Description
Motivation and Context
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.