-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new option: cpu_cache #120
Conversation
Hey, thanks for the pr! This is an interesting idea grouping contiguous uploads together. Having to upload a whole row for a single glyph change doesn't seem ideal though. If we go further with this, I wonder if we could have a separate optional layer that providers an Regarding the alignment issues it might be worth looking at (or maybe working with) https://github.com/hecrj/wgpu_glyph to see how that was addressed there. |
I also had an idea of a "lite" version of this. That simply checks if the draw-cache is empty, if so skips all individual uploads performs a single rect update covering all rows. This has the advantage of not requiring texture information to be held in memory (at least not long term). If we cut down the uploading use cases to:
The lite optimisation will help 1 & 4 directly. 2 is already fairly optimal, at least when it comes to uploads. Leaving just 3 to be less efficiently handled with multiple uploads. On the face of it this idea seems attractive and should similarly improve your "first time" benchmark. What do you think? |
I'm glad this stimulated some interesting thoughts :)
I had considered the idea of more "optimally" batching these uploads, but consciously decided against it. My intuition here is that, in this sort of situation where you're trying to maintain a GPU cache, the synchronization cost of telling the GPU, "I want you to copy this data," far outweighs the cost of performing the actual transfer for pretty much any row size you'd have. In other words, I think that in practice the cost of sending a whole row for a single (or small number of) glyph change(s) will work out to be less than the cost of either 1) sending the small changes individually (with some alignment logic thrown in), or 2) figuring out how to group the small changes together, and potentially sending a larger number of total updates.
With this premise, I'd argue that 2 is actually non-optimal (and also does not address alignment issues). The thought here is that keeping the CPU-side cache both lets you limit GPU calls and fixes alignment issues with almost no additional overhead other than the RAM usage. However, I could very well be wrong! I'm basing this all on my intuition formed around some tangential benchmarks. I'm happy to work to prove this intuition out, though, by creating some benchmarks that would be representative of the different use-cases we can hit with different approaches. Separately, thanks for this recommendation:
This https://github.com/hecrj/wgpu_glyph/blob/master/src/pipeline/cache.rs#L46 has made me realize that the important part of this buffer-size restriction is actually the buffer size and not the size of the region of the texture being copied into. This is something that could be, in theory, addressed in draw-cache by having some buffer-alignment option that forces the region being drawn onto into alignment while otherwise keeping everything else the same (or else, perhaps less efficiently, by copying the region returned into your own buffer first). |
I agree that more complex partial update cases require real world benchmarks to guide optimisation. But doing a single rect update on initial population & full-reorder should be a win in any case. It should be an "easy" optimisation to what already exists too. |
Hi Alex, I spent the week running a few benchmarks on this branch and on master (including #121). I wanted to make the scenarios being tested at least somewhat "realistic" but a consequence of this is that I created the tests in an ad-hoc fashion and did not make them readily reproducible. Nonetheless I think the results are illuminating. MethodologyThere were three benchmarks which I'm calling "Resize", "Jitter" and "Large jitter". I tested them on this branch and your master with wgpu handling the rendering. I ran each test with the Vulkan and DirectX 12 backends and instrumented them with Superliminal. I measured the time that it took to "update the texture cache" - which included calling I performed a The three benchmarks were as follows: ResizeThis benchmark asks the question: What's the performance when you do something fairly minor, like resize your window, thus forcing a new text layout? In most cases I tested the answer was: performance on the whole is good because after a bit of resizing you wind up with all of the possible glyphs you need in the cache. To make it more interesting, I experimented with a set of glyphs + cache-size such that you could never fit all possible glyphs for all text layouts into the cache. The result was a body of text, 625 characters, repeated 3 times in three different sizes, in a 512x512 cache. A random sized text boundary was chosen every frame, which caused an average of 29.3 glyph updates per frame. JitterThis benchmark asks the question: What's the performance when the cache experiences a reasonably large change? This could be caused by e.g. someone resizing their text or someone pasting some new text into place. This was simulated with a body of text, 625 characters, in a 256x256 cache. The text had a random jitter between 0.0 and 1.0 applied to its x and y positions, which caused an average of 97.2 glyph updates per frame. Large jitterMuch like the last, but with ~10x the number of changes. This was done with a body of text, 625 characters, repeated 9 times in different sizes, in a 1024x1024 cache with jitter applied as before. This caused an average of 1236.1 glyph updates per frame. Results
DiscussionThe clear takeaway here is that the penalty for performing a GPU operation is large and the additional cost for copying more data is small. Even when ~100x the data is being copied, the operations only took about 2x as long. Clearly this makes doing 10-1000x fewer operations worth while. Even when only a small number of glyphs were updated, this branch still outperformed master, but as soon as you have more than a very small set of changes, the performance advantage in this approach is fairly evident. One question that remains is, are these numbers remotely meaningful? In other words, while using a CPU-side cache gives clearly better performance, is it actually worth the trouble? While you aren't likely to run into a scenario where you have to update this number of glyphs frame-after-frame-after-frame, I'd argue that it's entirely possible that you could still hit these situations on occasion (e.g. when resizing your text or pasting a chunk of previously unseen characters) and the penalty for doing so - skipping quite a number of frames - is severe enough that it should be minimized as much as possible. The other question is, of course, does this mean that this is the right approach for this draw-cache library? And to that I say: I don't know! It's still entirely possible to apply this methodology if you are calling into draw-cache, without having to add it to the library itself, so it's not a matter of necessity. On one hand, I wouldn't be hugely bothered if I had to add a CPU-side cache onto my program. On the other hand, this really feels like it's the core use-case of this library, and the changes required to support this aren't particularly large. In the end, I'd say it comes down to your vision for what you think draw-cache should be doing for its users, @alexheretic. Either way, I'll personally be happy with the functionality it provides :) |
Thanks for doing the investigation! I think you make a good case for having the draw cache in ram to reduce writes per frame. Another little detail is that with multithreading enabled uploads can actually be done concurrently with rasterization... But I'm more convinced by your numbers overall. Also after thinking about it some more, storing the texture in ram would be a fairly unsurprising thing for this lib to do. So I don't have anything against it, the currently design just didn't need it. If minimising uploads is the aim, I wonder if we could go further and guarantee that there is at most one upload per frame. So we take the smallest rect that covers all changed glyphs and upload that. Following on from that it makes me think a different API would make sense, as the current callback would be executed only 0 or 1 times. So instead it could simply return a result that contains the texture updates. fn cache_queued<F, U>(&mut self, fonts: &[F]) -> Result<TextureUpdate, CacheWriteErr> I was also thinking of optimisations this could allow for when we reorder the draw cache. Currently in that case we clear and re-process, which reorders the glyph tall-to-short. It also means we must rasterize all the glyphs again. With the texture in ram we could do this by moving the existing glyph data into a new texture reordered position (rasterize anything we haven't got) as an optimisation. |
I think this approach would work quite well, and it makes the API quite attractive too. And yes, given the cost of rasterization, I think the additional opportunities for optimization are pretty interesting as well. It feels like we've hit on something here, but that something looks like a pretty major change. Unless you're looking for help, I'm happy to let you take the lead here as I'm sure you're better equipped than I to drive this. Either way, I think this branch is far enough away from this vision that I'll close the PR. If you'd like any help moving this forward, just say the word! |
Hi Alex, thanks for the super useful library!
I've been using glyph-brush (and previously
rusttype::gpu_cache
) with wgpu-rs. I was doing so by calling theCommandEncoder
methodcopy_buffer_to_texture
for each value returned bycache_queued
. When I updated wgpu to version 0.6.0, I hit a snag, as it started enforcing buffer copies to align to 256 bytes.I had also previously noticed that this strategy came with some performance problems as well, as it required one copy to the GPU per changed glyph. So when rendering 100 glyphs for the first time, the result was 14.1ms time spent caching glyphs, the majority of which was spent in these GPU operations:
(sorry, it's a little hard to see where the time is going there, but all of those
wgpu
operations there under theupdate_font_cache::{{closure}}
are the buffer-to-texture copies)To help solve both of these problems (after switching to glyph_brush_draw_cache), I added a new option:
cpu_cache
. From the rust doc comment on the option:The result is this draft PR. Now rendering 100 glyphs for the first time takes 5.6ms, with my profiler not even registering the copy-to-texture operation:
If you think this approach has legs, then I'll do the work to get this working with the
multithread
option and add some tests. Thanks!