Passing in attn_score_scaling_factor into tvm_wrapper #126

rickzx · 2024-02-19T01:44:37Z

In GPT-2, attention calculation requires an additional feature scale_attn_by_inverse_layer_idx. It provides a scaling factor per attention layer when calculating the attention score, before applying the softmax function.

This PR supports this additional parameter in tvm_wrapper.

See: apache/tvm#16606

rickzx · 2024-02-19T01:44:49Z

cc: @MasterJH5574

MasterJH5574

Thank you @rickzx!

cc @yzh119 for another look

yzh119 · 2024-02-19T16:18:56Z

Thanks for the patch.

Passing in attn_score_scaling_factor into tvm_wrapper

219a4f0

rickzx mentioned this pull request Feb 19, 2024

[KVCache] Support passing in attn_score_scaling_factor into KV cache apache/tvm#16606

Merged

MasterJH5574 approved these changes Feb 19, 2024

View reviewed changes

yzh119 merged commit f1f6a0d into flashinfer-ai:main Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing in attn_score_scaling_factor into tvm_wrapper #126

Passing in attn_score_scaling_factor into tvm_wrapper #126

rickzx commented Feb 19, 2024

rickzx commented Feb 19, 2024

MasterJH5574 left a comment •

edited

Loading

yzh119 commented Feb 19, 2024

Passing in attn_score_scaling_factor into tvm_wrapper #126

Passing in attn_score_scaling_factor into tvm_wrapper #126

Conversation

rickzx commented Feb 19, 2024

rickzx commented Feb 19, 2024

MasterJH5574 left a comment • edited Loading

Choose a reason for hiding this comment

yzh119 commented Feb 19, 2024

MasterJH5574 left a comment •

edited

Loading