-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor String#char_bytesize_at #5501
Refactor String#char_bytesize_at #5501
Conversation
24442fc
to
cceb19b
Compare
The code looks clearer, but this method is used pretty often and therefore needs to be as performant as possible. Can you provide benchmarks to compare the implementations? |
Right, I actually had this thought right after opening this PR. 🤦♀️ A quick bench to the tune of string = String.new(Bytes[104, 101, 108, 108, 111, 32, 255, 32, 255, 32, 119, 111, 114, 108, 100, 33])
Benchmark.ips do |bench|
bench.report("old") { (0...16).map { |index| string.old(index) } }
bench.report("new") { (0...16).map { |index| string.new(index) } }
end shows a 6% slowdown:
I’ll work on this PR to make it faster, but in the meantime: how can I best create a plausible set of representative strings for benching? |
Interestingly, on a random string of bytes the new implementation is actually significantly faster, which strongly suggests I’m missing something here: string = String.new(Random.new.random_bytes(256))
Benchmark.ips do |bench|
bench.report("old") { string.size.times { |index| string.old(index) } }
bench.report("new") { string.size.times { |index| string.new(index) } }
end
🤔 |
The individual bachmark measurements are so small (only nanoseconds!) that there won't be any reliable results. You could wrap the individual examples in Did you look at other implementations? No need to reinvent every wheel. |
Ah, yes, the first When I removed it, a random 256-byte string takes exactly the same amount of time: string = String.new(Random.new.random_bytes(256))
Benchmark.ips do |bench|
bench.report("old") { 256.times { |index| string.old(index) } }
bench.report("new") { 256.times { |index| string.new(index) } }
end
Hmmm, I thought this is exactly what I’ll look at your other pointers, thanks so much! |
OK, I think the culprit here might be that the whole block is compiled out as a no-op. 🤦♂️ (This is a very teaching start to the new year, thanks for bearing with me…) |
OK, just for posterity: string = String.new(Random.new.random_bytes(256))
Benchmark.ips do |bench|
bench.report("old") { (0...256).map { |index| string.old(index) } }
bench.report("new") { (0...256).map { |index| string.new(index) } }
end
Not bad. 😉 I’m still not sure why the current implementation uses byte values like |
Unfortunately, the results from |
@straight-shoota this is absolutely not true. Not every iteration of the benchmark is timed seperately. First the benchmark warms up and calculates approximately how fast each iteration is. It then uses it's approximation to work out approximately how many times the benchmark needs to run to reach 100ms, then runs the benchmark in 100ms blocks. That means only one Time measurement every 100ms. There is no need to use |
Oh sry, I mixed that up. Yes, this is only true for I guess I'm doing to many things simultaneously... no real multithreading support. |
Oh, and 1.8ns is way too short for 256 iterations of anything. You can't fit more than 10 clock cycles in 1.8ns even at 4GHz. It's obviously being optimized out. |
@RX14 The 1.8 ns is just reading |
@straight-shoota how? |
We need a JMH-style blackhole class to get accurate benchmark results. |
OK, I reimplemented it for speed and the results for a random 10000-byte string are quite optimistic:
Thanks so much for the pointer to |
I'd assume I think you can safely skip the |
The values are because of how UTF8 works. I can't explain because I'm on the cellphone,but you can find it on Wikipedia. I find the old code easier to read, understand, shorter and faster to compile and faster for llvm to optimize (I know this because si know how the compiler works). Please don't change this method. As a side note, we need a general Benchmark suite of the whole standard library which we can easily run against this kind of PRs. The benchmarks should also show number of allocations, so ips should be improved in that regard. I'll later open an issue about this. |
Right, I assumed the compiler would compile it out anyway, and it made the code nicely symmetric.
I even linked Wikipedia from this PR. 😉 I gave it some thought and I assume the @asterite Am I right in the above? If so, can I add the relevant comments to the original source (instead of this refactoring)? The values seemed a bit arbitrary on my first read of the method (even when I more-or-less know how UTF-8 works). |
Oh, I read it too quickly to note the wikipedia link. Sorry.
Now I can't remember why this is true (but it is). So yes, adding comments, and maybe even using |
Excellent, thanks! I assumed what this method returns is ‘the size of the UTF-8 byte sequence starting at But for the byte sequence if first == 0xf0 && second < 0x90
return 3
end To the best of my reading this byte sequence is not valid, as three-byte sequences should have a Similarly, the current implementation will return Let me know if I misunderstood what the method should return on invalid sequences or whether there is room for improvement. |
The current implementation will return
|
The method alone doesn't matter, it matters in the context its used. It returns the number of bytes occupied by a utf-8 codepoint starting from the given index, regardless of if it ends up being well-formed in the end. |
So why does it check whether the second and third (but not fourth) bytes are well-formed? If this doesn’t care whether the UTF-8 codepoint is well-formed (nor whether there even are four bytes in a four-byte codepoint), it could check just the first byte and be done. 🤔 |
Sorry, I said it wrong. It counts the number of UTF-8 codepoints in total. If there's an invalid sequence, it counts it as the maximum number of bytes that conform it (until it becomes invalid). Sorry for not putting too much effort into this issue, but this is something that's working fine and doesn't need a refactor or improvement (at least it's not apparent). I'm trying to focus on fixing bugs and getting new stuff done. |
(that is, unless you find a bug in this method that you can reproduce with a snippet of code, then we can discuss how to fix it) |
Hm, reading the method: if first < 0x80
return 1
end • a valid one-byte sequence returns if first < 0xc2
return 1
end • this is either a non-UTF-8 byte or a continuation byte on the first position, so it’s an invalid sequence, returns second = unsafe_byte_at(byte_index + 1)
if (second & 0xc0) != 0x80
return 1
end • any potentially UTF-8 (or not) sequence but with invalid second-byte returns if first < 0xe0
return 2
end • a valid two-byte sequence returns third = unsafe_byte_at(byte_index + 2)
if (third & 0xc0) != 0x80
return 2
end • a three-or-more-byte-suggesting leading byte + a valid second byte + and invalid third byte returns if first < 0xf0
return 3
end • a valid three-byte sequence returns if first == 0xf0 && second < 0x90
return 3
end • a sequence with exactly if first == 0xf4 && second >= 0x90
return 3
end • a sequence with exactly return 4 • any other sequence, including three-byte sequences not covered above, and sequences with the invalid leading byte between I’m puzzled by the cases denoted by 🤔 , because they do feel like errors to me. In particular, I would expect any sequence starting with
I totally understand this, and I feel really bad pulling you off from other stuff. I feel like I already wasted a lot of your time on this. 😢 I’m just trying to figure out how is this method supposed to work, because it looks like the checks are not very consistent in what they’re supposed to accept and what they supposed to fail. I’m perfectly fine if you answer is ‘I still think the above looks correct, I don’t have the time to explain why’ – it just looks to me like the method is not very consistent and thus potentially buggy. |
Yes, it seems there are some cases not considered by the algorithm, probably also not covered by specs. We should also check |
Awesome explanation by the way, you made it super clear that there's something wrong. I think we need to check that the first byte is valid for 2, 3 and 4 byte sequences before checking the other bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case for this change was explained in detail and it should be accepted.
But it needs a rebase against master. @chastell are you still up for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really readable, nice!
I rebased this PR on current While the proposed implementation is certainly a lot more readable, and apparently more performant as well, it is not accurate. It manages to recognize the general format of UTF-8 sequences, but it fails to account for malformed data. For example, some codepoints can be represented by different, technically equivalent byte sequences. Only the most minimal encoding is considered wellformed, though. All others are to be rejected. Then there's a range of invalid codepoints between Of course, this could all be fixed by adding more conditions to the algorithm. But this would taint the improvements in performance and readability that I believe it's not worth progressing. Instead, I propose to look for a more efficient and proven UTF-8 decoding algorithm in #11873. This is a very important piece of code, so efficiency should be highly valued. As a productive outcome of my evaluation of this PR, I have extracted the specs I used for verification into #11872. |
I tried to wrap my mind around how
String#char_bytesize_at
works and ended up refactoring it to this version to facilitate understanding. I used the UTF-8 bit patterns to make it clearer what’s happening in this method.Note: I’m not sure why the current implementation uses the
0x90
,0xc2
and0xf4
values for comparisons, so my assumption that this method counts the bytesize of an UTF-8 sequence might be wrong.