Reduce `block_indices` and `block_offsets` computation to once per forward pass #102

DamianSzwichtenberg · 2024-07-15T11:13:13Z

Currently, block_indices and block_offsets are computed for each LLM layer, this is not necessary as they don't change during forward pass.

szutenberg · 2024-07-17T07:50:47Z

AR: @madamczykhabana please test it and merge

madamczykhabana · 2024-07-17T14:04:20Z

In testing I saw 8.4% perf boost with acc 100.14%

#289) Re-implements following PRs for current habana_main: #102 (Removing div_i32 operations from each layer) #115 (removing scatter for reshape&cache in case of prompt) Accuracy (GSM8K on Llama3.1-8B-Instruct): | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101| | | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101| I've benchmarked this change on Llama3.1-8B-Instruct and on average, +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can be observed across all prefill buckets on G2, with up to +4.40% (+956.79 tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound scenarios.

HabanaAI#289) Re-implements following PRs for current habana_main: HabanaAI#102 (Removing div_i32 operations from each layer) HabanaAI#115 (removing scatter for reshape&cache in case of prompt) Accuracy (GSM8K on Llama3.1-8B-Instruct): | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101| | | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101| I've benchmarked this change on Llama3.1-8B-Instruct and on average, +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can be observed across all prefill buckets on G2, with up to +4.40% (+956.79 tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound scenarios.

DamianSzwichtenberg force-pushed the cache-indices-and-offsets branch from 88a96c3 to c3e775a Compare July 15, 2024 11:54

Cache indices and offsets

241a045

DamianSzwichtenberg force-pushed the cache-indices-and-offsets branch from c3e775a to 241a045 Compare July 17, 2024 07:03

szutenberg assigned madamczykhabana Jul 17, 2024

madamczykhabana merged commit bdb430f into HabanaAI:habana_next Jul 17, 2024

kzawora-intel mentioned this pull request Sep 17, 2024

Chunk prefill cache writes, remove div_i32 from insert_or_update_cache #289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce `block_indices` and `block_offsets` computation to once per forward pass #102

Reduce `block_indices` and `block_offsets` computation to once per forward pass #102

DamianSzwichtenberg commented Jul 15, 2024

szutenberg commented Jul 17, 2024

madamczykhabana commented Jul 17, 2024

Reduce block_indices and block_offsets computation to once per forward pass #102

Reduce block_indices and block_offsets computation to once per forward pass #102

Conversation

DamianSzwichtenberg commented Jul 15, 2024

szutenberg commented Jul 17, 2024

madamczykhabana commented Jul 17, 2024

Reduce `block_indices` and `block_offsets` computation to once per forward pass #102

Reduce `block_indices` and `block_offsets` computation to once per forward pass #102