Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce block_indices and block_offsets computation to once per forward pass #102

Conversation

DamianSzwichtenberg
Copy link

Currently, block_indices and block_offsets are computed for each LLM layer, this is not necessary as they don't change during forward pass.

@DamianSzwichtenberg DamianSzwichtenberg force-pushed the cache-indices-and-offsets branch from 88a96c3 to c3e775a Compare July 15, 2024 11:54
@szutenberg
Copy link

AR: @madamczykhabana please test it and merge

@madamczykhabana
Copy link

In testing I saw 8.4% perf boost with acc 100.14%

@madamczykhabana madamczykhabana merged commit bdb430f into HabanaAI:habana_next Jul 17, 2024
michalkuligowski pushed a commit that referenced this pull request Sep 26, 2024
#289)

Re-implements following PRs for current habana_main:
#102 (Removing div_i32
operations from each layer)
#115 (removing scatter for
reshape&cache in case of prompt)

Accuracy (GSM8K on Llama3.1-8B-Instruct):
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|

|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101|
| | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101|

I've benchmarked this change on Llama3.1-8B-Instruct and on average,
+2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can
be observed across all prefill buckets on G2, with up to +4.40% (+956.79
tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound
scenarios.
zhouyu5 pushed a commit to zhouyu5/vllm-fork that referenced this pull request Sep 27, 2024
HabanaAI#289)

Re-implements following PRs for current habana_main:
HabanaAI#102 (Removing div_i32
operations from each layer)
HabanaAI#115 (removing scatter for
reshape&cache in case of prompt)

Accuracy (GSM8K on Llama3.1-8B-Instruct):
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|

|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101|
| | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101|

I've benchmarked this change on Llama3.1-8B-Instruct and on average,
+2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can
be observed across all prefill buckets on G2, with up to +4.40% (+956.79
tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound
scenarios.
zhouyu5 pushed a commit to zhouyu5/vllm-fork that referenced this pull request Sep 27, 2024
HabanaAI#289)

Re-implements following PRs for current habana_main:
HabanaAI#102 (Removing div_i32
operations from each layer)
HabanaAI#115 (removing scatter for
reshape&cache in case of prompt)

Accuracy (GSM8K on Llama3.1-8B-Instruct):
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|

|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101|
| | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101|

I've benchmarked this change on Llama3.1-8B-Instruct and on average,
+2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can
be observed across all prefill buckets on G2, with up to +4.40% (+956.79
tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound
scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants