You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on returning the softmax of the LLM's attention layer. As far as I know, only transformers support this feature. But transforms is not efficient enough. I saw "return_softmax" vllm_flash_attn, but seems it can't be used. Would you be able to add this feature?
Motivation
Softmax of attention is important evidence for humans to understand how LLM works. Humans can determine which input token is important for the inference of the output token. And more features can be developed based on this feature such as visualization of LLM evidence.
Pitch
For this RFC in particular, we propose the following changes:
Add a new function llm.forward(). This function receives requests to do prefill and return softmax.
Modify prefill kernels to support output softmax.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
Feature
I'm working on returning the softmax of the LLM's attention layer. As far as I know, only transformers support this feature. But transforms is not efficient enough. I saw "return_softmax" vllm_flash_attn, but seems it can't be used. Would you be able to add this feature?
Motivation
Softmax of attention is important evidence for humans to understand how LLM works. Humans can determine which input token is important for the inference of the output token. And more features can be developed based on this feature such as visualization of LLM evidence.
Pitch
For this RFC in particular, we propose the following changes:
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: