Avoid unnecessarily disabling CUDA graphs #119

Nexesenex · 2024-05-15T13:27:17Z

As discussed in PR ggml-org#6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.

* Adding iq4_0_r4 - q4_0 repacked We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU, so ~5-6% faster than iq4_nl_x4. * q4_0_r4: NEON Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Nexesenex merged commit 8d3619c into Nexesenex:sidestream May 15, 2024
3 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessarily disabling CUDA graphs #119

Avoid unnecessarily disabling CUDA graphs #119

Nexesenex commented May 15, 2024

Avoid unnecessarily disabling CUDA graphs #119

Avoid unnecessarily disabling CUDA graphs #119

Conversation

Nexesenex commented May 15, 2024