-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(tools): Add Zipfian cache testing tool #640
feat(tools): Add Zipfian cache testing tool #640
Conversation
Signed-off-by: Ubuntu <[email protected]>
parser.add_argument('-c', '--count', type=int, default=100000, help='total number of incrby operations') | ||
parser.add_argument('-u', '--uri', type=str, default='redis://localhost:6379', help='Redis server URI') | ||
parser.add_argument('-a', '--alpha', type=int, default=1.0, help='alpha value being used for the Zipf distribution') | ||
parser.add_argument('-n', '--number', type=int, default=30, help='the number of values to be used in the distribution') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's worth noting here that it we really want to see the number of misses go up, we'll have to fill our cache, which would mean that we'll need a far higher number
.
tools/cache_testing.py
Outdated
distMap = [x / zeta[-1] for x in zeta] | ||
|
||
# Generate an array of uniform 0-1 pseudo-random values: | ||
u = np.random.random(numSamples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you took the implementation from here:
https://stackoverflow.com/questions/31027739/python-custom-zipf-number-generator-performing-poorly
it's fine, but I am worried about the memory usage of the this program - we will need to generate traces of hundreds of millions/billions of records to cause evictions. To solve this, we should convert rand_zipf to a generator, yielding
batches of np.random.random(numSamples)
each time and then we can generate a batch each time repeating the preprocessor step every time I call rand_zipf
. if you know what I am talking about please do the adjustments otherwise leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed the numSamples
bottleneck by rewriting the code to look something like this:
# Calculate Zeta values from 1 to n:
tmp = np.power( np.arange(1, n+1), -alpha )
zeta = np.r_[0.0, np.cumsum(tmp)]
# Store the translation map:
distMap = [x / zeta[-1] for x in zeta]
# Generate values form the distribution based on 0-1 pseudo-random values
for i in range(numSamples):
yield np.searchsorted(distMap, np.random.random())
However, it's worth nothing that there exists another significant bottleneck -- the distMap. It's likely going to also be a huge data structure (if we really want to start seeing cache misses, that is). I think it might be a bit more challenging to remove that bottleneck, but it may be possible (I still have to spend some more time thinking about how, if at all, it can be done). Do you think I should spend time on it? Because another workaround to forcing misses out of the cache is to simply limit the amount of memory it is allowed, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought to leave numSamples batch as is but instead to have:
while True:
u = np.random.random(numSamples)
v = np.searchsorted(distMap, u)
samples = [t-1 for t in v]
yield samples
and pass pipeline length as numSamples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not worry about distMap right now. yes it's big but I think it's smaller than total number of writes we gonna send to the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I understand. Shall fix accordingly!
tools/cache_testing.py
Outdated
|
||
distribution_values = rand_zipf(args.number, args.alpha, args.count) | ||
for idx, val in enumerate(distribution_values): | ||
r.incrby(str(val), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following my previous comment, we should use pipelining capabilities of Redis/DF and send data in batches.
Otherwise, sending a billion items will take lots of time. Can you please, add a pipeline length parameter p
?
Finally, I see you read statistics from info()
which is fine, but when I think of it, the idea of using incr
does not bring additional value because you do not read its response to know if it was hit or miss. In that case, we should change it to set <key> <val> NX
which will allow us to increase the value length and put pressure on the server to evict items. val
can be a constant string of specified length (-d
argument),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can add pipelining!
I was also thinking about how incr
doesn't tell us whether the item was a hit or a miss. There is no response regarding hit/miss from redis-py, at least. I think the value length increases sound like a great idea, I'll change it to that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small correction - value length should be constant. I did not mean to append values - but just use a constant
val = 'x' * args.d
for the "set key val" command. The thought is that with larger values it will be easier to add memory pressure on the server and we will need less keys, less samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to support multiprocessing/multiple workers? Shall I add a -w
for the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid multi-processing for now.
Thanks! Overall looks good! There are some comments that should allow us to make the tool more usable. |
No need for multiprocessing but if you want, you can look how other python
scripts use asyncio
and you can use multiple connections - this will increase their effective
throughput.
…On Wed, Jan 4, 2023 at 6:44 PM Devang Jhabakh Jai ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In tools/cache_testing.py
<#640 (comment)>
:
> + parser = argparse.ArgumentParser(description='Cache Benchmark', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+ parser.add_argument('-c', '--count', type=int, default=100000, help='total number of incrby operations')
+ parser.add_argument('-u', '--uri', type=str, default='redis://localhost:6379', help='Redis server URI')
+ parser.add_argument('-a', '--alpha', type=int, default=1.0, help='alpha value being used for the Zipf distribution')
+ parser.add_argument('-n', '--number', type=int, default=30, help='the number of values to be used in the distribution')
+ args = parser.parse_args()
+ uri = urlparse(args.uri)
+
+ r = redis.Redis(host=uri.hostname, port=uri.port)
+
+ initial_hits = r.info()['keyspace_hits']
+ initial_misses = r.info()['keyspace_misses']
+
+ distribution_values = rand_zipf(args.number, args.alpha, args.count)
+ for idx, val in enumerate(distribution_values):
+ r.incrby(str(val), 1)
Do we also want to support multiprocessing/multiple workers? Shall I add a
-w for the same?
—
Reply to this email directly, view it on GitHub
<#640 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4BFCCZ7AWKZSNHJ2Y26QTWQWSHTANCNFSM6AAAAAATQMMWJY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Signed-off-by: Ubuntu <[email protected]>
fcf852e
to
3b50043
Compare
Hey @romange I seem to have implemented pipelining and it works on my end (albeit for some reason, on my machine, the pipelined version seems to be running slower than the non-pipelined) |
tools/cache_testing.py
Outdated
total_count = 0 | ||
for idx, values in enumerate(distribution_values_generator): | ||
total_count += len(values) | ||
p = r.pipeline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add transaction=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, that sped things up significantly. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
Signed-off-by: Ubuntu [email protected]
Adds a cache testing tool that produces values according to a Zipfian distribution to test cache, as per issue #253.