Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #759 : The output of the hash generators is misleading #880

Open
wants to merge 3 commits into
base: 2.0
Choose a base branch
from

Conversation

kcassam
Copy link

@kcassam kcassam commented Apr 2, 2024

md5(), sha1() and sha256() generators is misleading

What is the reason for this PR?

Author's checklist

Summary of changes

In order to leverage the full output space of the respective hashes, I use the the function random_bytes($length)
In order to keep those functions seedable, I add a $seed parameter that can be used to pass a parameter so that the function will always return the same hash given the same seed.

  • md5 produces a 128 bits hash value, so random_bytes is called with a $length of 16 (128/8)
  • sha1 produces a 160 bits hash value, so random_bytes is called with a $length of 20 (160/8)
  • sha256 produces a 256 bits hash value, so random_bytes is called with a $length of 32 (256/8)

When this one is merged, I can take car of #761

Review checklist

  • All checks have passed
  • Changes are added to the CHANGELOG.md
  • Changes are approved by maintainer

md5(), sha1() and sha256() generators is misleading
@kcassam kcassam changed the title Fix https://github.com/FakerPHP/Faker/issues/759 : The output of the Fix issue #759 : The output of the Apr 3, 2024
@kcassam kcassam changed the title Fix issue #759 : The output of the Fix issue #759 : The output of the hash generators is misleading Apr 3, 2024
@TimWolla
Copy link

TimWolla commented Apr 3, 2024

This change is incorrect, because it prevents seeding when using these functions, whereas the previous implementation was seedable.

@kcassam kcassam marked this pull request as draft April 3, 2024 20:47
…ge the full output space of the respective hashes
@kcassam kcassam marked this pull request as ready for review April 3, 2024 22:04
@kcassam
Copy link
Author

kcassam commented Apr 3, 2024

This change is incorrect, because it prevents seeding when using these functions, whereas the previous implementation was seedable.

@TimWolla I have corrected, seeding is OK now and I added some new test to garantee seedibility.
However, the expected result will change after updating the library, so it can break some tests

I don't know if this is considered as BC and if this is OK.
Otherwise, we can keep the tests with a change on expected values.

Copy link

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that bias remark I suspect that using ext/bcmath is a non-starter for this library.


I'm also not sure whether it's worth fixing the issue with the current architecture of FakerPHP. The fix is likely going to be pretty ugly when relying on the pre-PHP 8.2 standard library. See also: #15 (comment)

Comment on lines 256 to 258
return array_reduce(range(1, $n), static function ($carry, $item) {
return bcmul($carry, self::numberBetween());
}, '1');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is heavily biased, which is easy to see by performing an exhaustive search for a smaller number range:

<?php
for ($a = 0; $a < 16; $a++) 
for ($b = 0; $b < 16; $b++) 
for ($c = 0; $c < 16; $c++) 
@$results[$a * $b * $c]++;

var_dump($results);

If any of the multiplied numbers is zero, the result will also be zero, which makes zero the most likely result. But the others are not uniformly distributed either:

https://3v4l.org/VJnZK

@kcassam kcassam marked this pull request as draft April 4, 2024 09:08
@kcassam
Copy link
Author

kcassam commented Apr 4, 2024

Thanks for the reviews @TimWolla. I have another idea, I will submit it later on.

@kcassam kcassam marked this pull request as ready for review April 4, 2024 13:34
@kcassam
Copy link
Author

kcassam commented Apr 4, 2024

@TimWolla the PR is ready for review.
✅ seedable (at the cost of a small BC)
✅ seedibility tested
✅ no bias
✅ ext/bcmath not used
✅ no more multiplyNumberBetweenNTimes clumsy/ugly function
✅ issue #759 solved

@kcassam
Copy link
Author

kcassam commented Apr 5, 2024

@TimWolla Alternatively, to avoid any BC, we could keep the legacy methods without modifying them and create new ones with a slightly different name and with the new behavior, for example :

    public static function md5()
    {
        return md5(self::numberBetween());
    }

    public static function maxEntropyMd5($seed = null)
    {
        return md5($seed ?? random_bytes(128 / 8));
    }

We could use bin2hex instead of hash functions for performance reason. I let you decide if it is a better treadoff. bin2hex is about 5 times faster than sha256. https://onlinephp.io/c/c006b. We would have to complete/crop the seed to the desired length, but that is easy to do with substr and str_pad

Copy link

stale bot commented Apr 22, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 1 week if no further activity occurs. Thank you for your contributions.

@kcassam
Copy link
Author

kcassam commented Apr 24, 2024

@TimWolla hi, is it possible to review this PR ?

@stale stale bot removed the lifecycle/stale label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The output of the md5(), sha1() and sha256() generators is misleading
2 participants