Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Text Generation] Turn off the (currently) inefficient external KV cache logic when internal KV cache management enabled #1175

Merged
merged 99 commits into from
Aug 28, 2023

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Aug 9, 2023

Having an external KV cache loop running in parallel to the internal KV cache logic (if enabled) does not make much sense.

  1. It slows down our most efficient pipeline mode
  2. It could be potentially used for testing. But as of now, the cache values returned by the internal kv cache are nonsensical, so we do not have ground truth to compare the external cache with.

This PR has been tested using our internal (modest) LLM testing harness.

In general, this PR roughly accelerates inference by 25%

import time
from deepsparse import Pipeline

opt = Pipeline.create(task="opt",
                      model_path="/home/ubuntu/damian/sparseml/deployment_opt",
                      max_generated_tokens = 512,
                      prompt_processing_sequence_length = 3,
                      use_deepsparse_cache =False, # or True
)
counter = 0
for n in range(100):
    start = time.time()
    output = opt(sequences=prompts[0])
    counter += (time.time() - start)
print(prompts[0] + output.sequences[0])
print(f"Time taken: {counter/n} seconds")
Time taken: 21.29 seconds # use_deepsparse_case = False

Time taken: 15.99 seconds # use_deepsparse_case = True

Edit: confirmed by a more exhaustive investigation

image

SageMoore and others added 30 commits July 20, 2023 16:48
rahul-tuli
rahul-tuli previously approved these changes Aug 23, 2023
Copy link
Member

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

bfineran
bfineran previously approved these changes Aug 24, 2023
Base automatically changed from feature/damian/optimize_decoder to main August 24, 2023 13:17
@dbogunowicz dbogunowicz dismissed stale reviews from bfineran and rahul-tuli August 24, 2023 13:17

The base branch was changed.

rahul-tuli
rahul-tuli previously approved these changes Aug 24, 2023
Copy link
Member

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(n+1)st time is the charm 🚀

bfineran
bfineran previously approved these changes Aug 24, 2023
@dbogunowicz dbogunowicz dismissed stale reviews from bfineran and rahul-tuli via 5bf6cf4 August 25, 2023 07:55
@dbogunowicz dbogunowicz force-pushed the feature/damian/optimize_update_kv_cache branch 2 times, most recently from 227ebe0 to 519bf1b Compare August 25, 2023 07:58
@dbogunowicz dbogunowicz merged commit a6d46be into main Aug 28, 2023
@dbogunowicz dbogunowicz deleted the feature/damian/optimize_update_kv_cache branch August 28, 2023 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants