-
-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Use faster hashing algo for cache key generation #3508
Conversation
@@ -452,7 +452,7 @@ private function shouldRemoveUseName(string $last, string $newNameLastName, bool | |||
private function createOldToNewTypes(Node $node, array $oldToNewClasses): array | |||
{ | |||
$oldToNewClasses = $this->resolveOldToNewClassCallbacks($node, $oldToNewClasses); | |||
$cacheKey = md5(serialize($oldToNewClasses)); | |||
$cacheKey = \hash('crc32c', \serialize($oldToNewClasses)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the length seems got shorter, ref https://3v4l.org/e6hLl , I think current md5 is just fine.
In other area, we use sha1()
and sha1_file()
for config key and value and I think that's just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the key length got shorter, from 128 bit to 32 bit, but honestly i don't think that 128 bit keys are needed here to prevent collision.
From what i've seen there are not so many different $oldToNewClasses
permutations, the chance of collisions in that case are pretty low, refer to https://preshing.com/20110504/hash-collision-probabilities/
And crc32c
hashing is a lot faster then md5, the reason i choose that algo was from the performance comparison done here: https://php.watch/articles/php-hash-benchmark
Why do i think we should change this usage of md5? Because it showed up in the traces that while md5
was called overall ~300.000, but the ~55.000 calls from this place accounted for 96% of the cpu time in this method. The reason is that probably for the other calls to md5 the data input was a lot shorter and thus it had basically no effect on the performance, but here we feed it the serialized data of an array that could be quite long
This was in the blackfire trace before this change:
After this change neither the md5
calls, nor the hash
calls appear in the trace because they were not significant enough in terms of performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I look into it, crc seems not cryptographycally, so I think it is better to use common just works, less performance is fine if we got more confident with data integrity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Md5 isn't cryptographic secure either (and sha1 too).
I don't think its relevant for the use-case here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md5 is cryptographically, while not secure for password, it is fine for cache use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched back to a 128bit hash, as i think the point about the collisions is fair enough.
Now it uses the xxh128
algorithm if it is available (as it was only introduced in php 8.1), but i think it matches the needs perfectly:
xxHash is an extremely fast hashing algorithm that is not designed for cryptographic purposes, but provides excellent randomness and dispersion of output, and uniqueness of to minimize collisions.
see: https://php.watch/versions/8.1/xxHash
And i totally agree with @staabm as in this case cryptographically secure hashing is not needed, what we need is fast hashing with a low chance of collision
So for systems running on php 8.1 they see the same performance gains as with the original change, system running on older php versions still see a slight improvement by switching from md5
to the slightly (nearly 2x) faster md4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @keulinho , let's give it a try
* Use faster hashing algo for cache key generation * Use 128 bit hashing algo again
The serialized array can become quite big and md5 slows down the bigger the input data.
On our test run this saved ~2s: https://blackfire.io/profiles/compare/0abc026f-4855-4bb0-ac84-c4a2eef825e0/graph

I guess now we are at the limit of the "quick fixes" in terms of performance improvements without actually touching the architecture of the whole system