-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move ARC data buffers out of vmalloc #2129
Conversation
Call me intrigued. How much testing have you done yourself? |
@DeHackEd Not much actually. |
@tuxoko Excellent work. I had started working on something called sgbuf that was intended to replace the zio buffers with entirely, but that stalled as higher priority things took time away from completion. Targeting just the arc buffers like you have done is not only simpler, but strikes at the heart of the problem. I will review what you have done over the next day to provide code comments. |
I'll be taking a closer look as well. Just before going into detail, I want to say that metadata buffers really should be considered for scatter/gathering as well. A lot of people (myself included) have metadata-heavy workloads. I very nearly allocate all of my ARC to metadata on my backup target. I understand the need for contiguous buffers for things like ZAP data structures but it may be possible to find a happy middle ground. |
@tuxoko I had a brief chat with @ahrens about this in #zfsonlinux on freenode. Since it was a public conversation, I have reproduced it below, with noise from parallel conversations omitted:
|
Thanks for your time. P.S. Didn't notice ({ bla }) is GCC only though |
@tuxoko Nice work! I've only skimmed the patches thus far but this looks like a very workable approach in general. In particular, leaving the meta data linear was a nice way to simplify the problem and minimize the required changes. Since the vast majority of meta data is small (16k or less) we could just shift that to the Linux slab or kmalloc it for now. Further optimization may be needed but that can certainly happen in a second or third step. One other thing I like about this proposed change is that it lays some of the groundwork which is needed to integrate the ARC with the page cache. That's still going to be a complicated job and tricky to get right but having everything already tracked in the ARC as pages helps considerably. I'd like to dig in to these further but to be honest it may be a while before I'm able to give these changes the detailed attention they deserve. I need to focus on getting the next tag finalized. However, when that's done memory management was the next area I wanted to focus on so your timing is good. In the meanwhile, if others such as @ryao and @ahrens have the time to weigh in on the changes that would be helpful! |
So I just installed this onto my workstation. I told it to use 3/4 my RAM (6 GB of 8 total) as ARC. I did:
IO from the gzip operation was pretty bad and this is just a simple mirror of 2 7200RPM 1TB disks so the system had it pretty rough, but it was stable and the ARC shrink cleanly for the Windows VM to launch. First impressions actually running the patch are good! More to come later. |
One big problem I've so far is trying to test it. Since the code is kernel-space only (see I wrote some glue to provide the kernel scatterlist functionality to userspace and tried forcing use of abd at all times but now it's faulting because the pointer bit trick appears to be causing scatterlist abd_t to be matched when they are not. Option no 2 is to just build a kernel driver with all the assertions enabled and play with it. I've tried that as well, and so far so good, but it's not as thorough as ztest. I still plan to try making this work. Just wanted to provide an update. Slight edit: experimenting with github issue references |
Update: I added my own testing layer which SEEMS to work out. If you're interested my patch is (deleted) -- it comes with no warranty and the duct tape is a design decision. I've had a stack trace involving abd itself #4 0x00007f9cd5dc8048 in dbuf_sync_leaf (dr=0x7f9ca4094bf0, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2464 2464 ASSERT(ABD_TO_LINEAR(db->db.db_data) != dr->dt.dl.dr_data); ... #28 0x00007f9cd5dc7d31 in dbuf_sync_indirect (dr=0x7f9ca4059e90, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2430 #29 0x00007f9cd5dc89d1 in dbuf_sync_list (list=0x7f9ca40a7ed8, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2600 #30 0x00007f9cd5dc7d31 in dbuf_sync_indirect (dr=0x7f9ca40a7e60, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2430 #31 0x00007f9cd5dc89d1 in dbuf_sync_list (list=0x7f9ca40beb10, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2600 #32 0x00007f9cd5df7fb9 in dnode_sync (dn=0x7f9ca40be950, tx=0x7f9c9c3ecf30) at ../../module/zfs/dnode_sync.c:705 #33 0x00007f9cd5dd845e in dmu_objset_sync_dnodes (list=0x1353830, newlist=0x0, tx=0x7f9c9c3ecf30) at ../../module/zfs/dmu_objset.c:947 #34 0x00007f9cd5dd8d6f in dmu_objset_sync (os=0x1353460, pio=0x7f9c9c414800, tx=0x7f9c9c3ecf30) at ../../module/zfs/dmu_objset.c:1067 #35 0x00007f9cd5dfd939 in dsl_dataset_sync (ds=0x1353010, zio=0x7f9c9c414800, tx=0x7f9c9c3ecf30) at ../../module/zfs/dsl_dataset.c:1353 #36 0x00007f9cd5e0f3ea in dsl_pool_sync (dp=0x12463a0, txg=123) at ../../module/zfs/dsl_pool.c:471 #37 0x00007f9cd5e44c2e in spa_sync (spa=0x1243ad0, txg=123) at ../../module/zfs/spa.c:6234 #38 0x00007f9cd5e52a78 in txg_sync_thread (dp=0x12463a0) at ../../module/zfs/txg.c:562 .. (gdb) ... some checksum errors as reported by (unpatched) ZDB and some "buffer modified while frozen" errors. Anyone else want to try? |
@DeHackEd I didn't use ABD in userspace because I was lazy and didn't bother to dig into userspace code. But if you think it would help debugging then it would certainly worth to come up with a userspace ABD.
And lastly the pointer trick was also a product of my laziness, because I didn't want to touch too much things where only linear buffer exists. I'll clean it up and use a proper structure to hold both linear and scatter buffer when I have time... Thanks |
Thanks for the feedback. I've updated that line in my own code. One other little thing I should mention. Right about the same time you posted your pull request there was an update pushed to the main repository that updates arc.c fairly extensively to the point that your tree doesn't apply cleanly anymore. Hand merging is straight-forward but you may want to consider rebasing your tree if you have the time to do so. My ztest runs have been encouraging enough that I'm going to try this patch on one or two more (real/bare metal) machines. |
There were some ARC changes so it would be helpful if you refreshed these patches. We also absolutely want this code to be stressed in the userspace builds if possible. Being able to generate random workloads with ztest has proven to be a very good way to catch subtle issues which might otherwise be missed. |
Rebased and add userspace scatter ABD |
Remove container_of kernel code |
Sorry about that... |
I think this last update (tuxoko/zfs@b003f01) fixed one ztest error I was getting in which zdb checking the pool gives checksum errors on some blocks. There's a few more suspicious ztest errors though. For exmaple: lt-ztest: ../../cmd/ztest/ztest.c:2263: Assertion `dmu_read(zd->zd_os, object, offset, blocksize, data, 1) == 0 (0x34 == 0x0)' failed. (0x34 == 52 == EBADE == checksum error) It's hard to tell if this patch is responsible or if it's something in ZFS itself. There are a few ztest bugs open. |
Hi @DeHackEd. What the difference between lt-ztest and ztest? I keep getting this then running ztest: |
That assertion is a known issue ( #1964 ) lt-ztest means that the built binary was re-linked to support running from the build directory instead of being installed into /usr/local/sbin or whatever prefix you used. |
@tuxoko There are a few known ztest failures which need to be resolved. You hit one of them, it would be great if someone had the time to run these to ground. |
Hi @behlendorf
Here the db->db.db_data is freed. Could you shed some light on this matter? |
@tuxoko To my knowledge there aren't any outstanding issues regarding this. Although, very occasionally I have seen people report non-reproducable crashes in the user evict code so there may be something subtle lurking. |
Rebase to master and add two patch for missing things There're still a lot of checksum error when running ztest, and all of them are data blocks. |
I've seen that, and have been trying to diagnose them myself. (This is the sort of reason why I wanted ztest to run in the first place). Thanks for the rebase. I'm going to be giving it a try as well. |
So the original sha256 implementation cannot be used as-is for scatter abd. |
Nice find. Glad I pressed for ztest support. SHA256 problems would probably bite anyone using dedup. I've pulled a copy and am running ztest on it extensively right now. Will report back in a few hours if anything turns up. |
Though I got side-tracked quite a bit yesterday I want to say that testing results have been largely positive and I intend to continue using this patch on my own systems. |
@tuxoko the large block patches have been merged to master. If you could refresh this patch stack to apply cleanly, and handle the larger block size, we can focus on getting this in next. I'd really like to push to get this in the next tag. |
@behlendorf |
@tuxoko
|
|
|
Overall seems to be working fine alongside the large block changes (local manual merge). side note: When observing _abd_alloc_scatter() calls while using large record sizes (16M), in addition to the expected large allocations, I see an unexpected number of 12K and 16K scatter allocations coming through. These appear to be mostly coming from vdev io aggregations of 4K metadata (note my vdev ashift is 12). Just curious if we want/need these transient metadata aggregation buffers to be scatter? I don’t think it’s a problem, I just wasn’t expecting to see them. |
@don-brady I think this is best. If we're going to allow the aggregation then my gut feeling would be that it makes the most sense to use the scatter lists for this. Allocating pages and mapping them atomically to be zeroed/copied (as is done) should be very efficient and avoids any potential lock contention is a shared allocator like the slab. We can always revisit this is that turns out not to be the case! As another related aside it's never been 100% clear that the existing vdev queue aggregation under Linux is always a win. In many cases this might be wasted work which the block device elevator would have efficiently taken care of. Although there are certainly exceptions which may make the whole thing worth while like the gap spanning. But that's neither here nor there for this issue! |
@don-brady @behlendorf |
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Merge upto 5dc8b73 Illumos 5765 - add support for estimating send stream size with lzc_s... Signed-off-by: Chunwei Chen <[email protected]> Conflicts: module/zfs/arc.c module/zfs/dmu.c module/zfs/dmu_send.c module/zfs/vdev_disk.c module/zfs/zap_micro.c module/zfs/zil.c
Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
@tuxoko: it appears that this stack is somehow affecting space calculation for L2ARC devices. Could you take a look @ #3400 and weigh in? I'm now seeing the problem off of abd_next without any of the alterations brought by #3115 or additional changes from @kernelOfTruth's work. |
Also use the same optimization of abd_buf_segment in abd_get_offset. Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
In dsl_scan_scrub_cb and spa_load_verify_cb, we originally always allocated linear ABD. Now we try to allocate scatter ABD according to the BUFC type of the blkptr to reduce unnecessary spl slab allocation. Also in zio_ddt_read_start, we match the parent zio->io_data ABD type for the same reason. Signed-off-by: Chunwei Chen <[email protected]>
Hi guys, I've just updated this pull request. I'll submit a new and rebased pull request after I see some result from the buildbot. |
@tuxoko nice work. Hopefully latter this week I'll have some time to start on a careful review. In the meanwhile I'll instruct the buildbot to go through and test all the patches in this stack. |
adaption to openzfs#2129 in @ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr) /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ + if (rounded > csize) { + bzero((char *)cdata + csize, rounded - csize); + csize = rounded; + } to /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ if (rounded > csize) { abd_zero_off(cdata, rounded - csize, csize); csize = rounded; }
adapted to openzfs#3216, adaption to openzfs#2129 in @ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr) /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ + if (rounded > csize) { + bzero((char *)cdata + csize, rounded - csize); + csize = rounded; + } to /* * Compression succeeded, we'll keep the cdata around for * writing and release it afterwards. */ if (rounded > csize) { abd_zero_off(cdata, rounded - csize, csize); csize = rounded; } ZFSonLinux: openzfs#3114 openzfs#3400 openzfs#3433
From the current head I ended up need this in order to build it properly: |
Signed-off-by: Chunwei Chen <[email protected]>
Closing, refreshed as pull request #3441. Let's move addition review and comments in to the updated version. |
…to be written to l2arc device If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit e14bb3258d05c1b1077e2db7cf77088924e56919 Also, consistently use asize / a_sz for the allocated size, psize / p_sz for the physical size. Where the latter accounts for possible size reduction because of compression, whereas the former accounts for possible size expansion because of alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. modified to account for the changes of openzfs#2129 (ABD) , openzfs#3115 (Illumos - 5408 managing ZFS cache devices requires lots of RAM) and Illumos - 5701 zpool list reports incorrect "alloc" value for cache devices
_Warning_
This patchset requires further testing and review.
DO NOT USE IN PRODUCTION.
Introduce ABD: linear/scatter dual typed buffer for ARC
zfsolinux currently uses vmalloc backed slab for ARC buffers. There are some
major problems with this approach. One is that 32-bit system have only a
handful of vmalloc space. Another is that the fragmentation in slab will easily
trigger OOM in busy system.
With ABD, we use scatterlist to allocate data buffers. In this approach we can
allocate in HIGHMEM, which alleviates vmalloc space pressure on 32-bit. Also,
we don't have to rely on slab, so there's no fragmentation issue.
But for metadata buffers, we still uses linear buffer from slab. The reason for
this is that there are a lot of *_phys pointers directly point to metadata
buffers. So it's kind of impractical to change all those code. And metadata
should only be a small percentage so we probably don't have to worry about
the 32-bit and fragmentation issue, right?
This patchset is divided into the following part:
Edit:
I forgot to mention...
There are some parts that hasn't been able to handle scatter buffer,
compression and raidz gaussian elimination to name a few.
Enable them will cost you extra copy than without this patch set.
Another thing is the xuio thing, which I totally skip.
Todo: