Add real page pool tests for trie_attention_cache #902
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, we were testing with mocked page pools so the tests run faster. In this PR, I split trie_attention_cache_tests.py into 2 files:
trie_attention_cache/mock_pool_tests.py contains the old tests, and we continue to test with a mocked-up page pool to verify that the trie correctly does accounting for the pages and the evictions.
trie_attention_cache/real_pool_tests.py will contain new tests for page-copying prefix sharing, so that we won't have to recompute the entire last page's worth of KV if branching on a token. Since we're copying the page, the tests will need to not mock the page pool and actually allocate the buffer, which will make them slower. I opted to do this separately from the old tests so that we won't have to take 5-ish seconds to set up the buffer for each of the 30 ish tests.
This PR also replaces some of the nuisance print statements with logging.debug.
This is a step on the way to implement beam search (required by MLPerf).Edit: MLPerf only requires beam search for GPT-J. Thanks @stbaione