-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized isascii using simple loop 25+bytes/cycle for large strings #48568
Conversation
Unfortunately you do need a short circuit. Otherwise something like
It really should be possible to make the loop have early termination without messing anything up though. |
That said, this seems to be pretty easy to fix.
I'm seeing it as roughly 25% slower in the worst case (medium size string e.g. 10000 all ascii), but it exists quite quickly in the good cases. My guess is that you were trying to make the early exit versions exit way too early. With AVX-512 each clock cycle can scan 64 chars at a time, so by the time you pipeline 4 of those you start having to have a pretty big vector before the cost of the conditional isn't giant. |
I have updated the PR as reflected in the first post to use a 1024 chunk size, which seems to strike the right balance. |
It looks like |
I wouldn't say that. The tests were ok for the algorithm used. But when you add a more complex algorithm with more branches like in this case, then you might have to extend the tests which now do not have full coverage. |
A potential fix for the OOB: function isascii(s::AbstractString)
chunk_size = 1024
bytes = codeunits(s)
l = ncodeunits(s)
l < 2*chunk_size && return _isascii(bytes, 1, l)
for n = 1:chunk_size:(l-chunk_size)
_isascii(bytes, n, n + chunk_size - 1) || return false
end
# handle the last chunk explicitly
return _isascii(bytes, l-chunk_size+1, l)
end which results in
So the short circuiting is very beneficial on my machine, even for longer strings! I had to try to push it past the 64MiB of L3 this CPU has to begin to see diminishing returns, due to hitting main memory:
|
In case one cares about making |
That's likely an artifact due to
|
This is because for short strings every single op/instruction counts for bytes/second. But more importantly in this case it is because there is an added redirection. If you look at the size of @code_native for the loop it ends up being a giant function, which makes it unlikely to get inlined. |
Well, that certainly depends on your CPU - I have a small number of funny blocks like this: ; │┌ @ REPL[6]:3 within `_isascii`
add r15, r14
kxnorq k1, k0, k0
mov rax, -1024
vpternlogd zmm0, zmm0, zmm0, 255
kxnorq k2, k0, k0
kxnorq k3, k0, k0
kxnorq k4, k0, k0
.p2align 4, 0x90
.LBB0_31: # %vector.body145
# =>This Inner Loop Header: Depth=1
; ││ @ REPL[6]:4 within `_isascii`
; ││┌ @ bool.jl:38 within `&`
vpcmpltb k1 {k1}, zmm0, zmmword ptr [r15 + rax + 8]
vpcmpltb k2 {k2}, zmm0, zmmword ptr [r15 + rax + 72]
vpcmpltb k3 {k3}, zmm0, zmmword ptr [r15 + rax + 136]
vpcmpltb k4 {k4}, zmm0, zmmword ptr [r15 + rax + 200]
add rax, 256
jne .LBB0_31
# %bb.32: # %middle.block144
; ││└
; ││ @ REPL[6]:5 within `_isascii`
kandq k0, k2, k1
kandq k0, k3, k0
kandq k0, k4, k0
kortestq k0, k0
setb al
jmp .LBB0_17
.LBB0_2:
mov al, 1
jmp .LBB0_17
.LBB0_5: # %vector.main.loop.iter.check
movabs rcx, 9223372036854774784
; │└ which is only possible due to the inlining (& possibly unrolling) in the parent function (if inlined). Regardless, I wonder how representative this benchmark really is - it's comparing a string of only |
@ndinsmore Regarding your fix with
Those register masks aren't cheap! Though arguably, this much performance ought to be overkill 😂 |
I think that you likely have AVX-512 but the fact that this is an |
@Seelengrab As a further note to illustrate how fragile this performance can be |
Sure, here's the results:
and the code: Codejulia> using BenchmarkTools
julia> function benchmark_isascii(fun)
for p=1:14
n = (2 * 2^(p-1))-1
s='S'^n
s2 = 'λ' * 'S'^(n-1)
b = @benchmark $fun($s)&$fun($s2) seconds=1
cpu_info = Sys.cpu_info()
cpu_ghz= mean(i.speed for i in Sys.cpu_info()) /1_000
parse_time_ns = time(median(b))
GB_per_second= 2*n / parse_time_ns
bytes_per_cycle = GB_per_second / cpu_ghz
print("$fun -> $n bytes $GB_per_second GB/second @ $cpu_ghz GHz -> $bytes_per_cycle bytes/cycle\n")
end
end
benchmark_isascii (generic function with 1 method)
julia> function _isascii_loop(bytes, first, last)
r = true
for n = first:last
@inbounds r &= bytes[n] < UInt8(0x80)
end
return r
end
_isascii_loop (generic function with 1 method)
julia> function isascii_loop_nested(s::AbstractString)
chunk_size = 1024
bytes = codeunits(s)
l = ncodeunits(s)
start = 1
chunk_end = ifelse(l < chunk_size, l, start + chunk_size - 1)
fastmin(a,b) = ifelse(a<b,a,b)
r = true
while start <= l
@inline _isascii_loop(bytes, start, chunk_end) || return false
start += chunk_size
chunk_end = fastmin(l,chunk_end+chunk_size)
end
return true
end
isascii_loop_nested (generic function with 1 method)
julia> function isascii_seelengrab(s::AbstractString)
chunk_size = 1024
bytes = codeunits(s)
l = ncodeunits(s)
l < 2*chunk_size && return _isascii(bytes, 1, l)
for n = 1:chunk_size:(l-chunk_size)
_isascii(bytes, n, n + chunk_size - 1) || return false
end
# handle the last chunk explicitly
return _isascii(bytes, l-chunk_size+1, l)
end
isascii_seelengrab (generic function with 1 method)
julia> @inline function _isascii(bytes, first, last)
r = true
for n = first:last
@inbounds r &= bytes[n] < UInt8(0x80)
end
return r
end
_isascii (generic function with 1 method)
The code size isn't that large actually:
Since everything still fits comfortably in my terminal buffer, there definitely aren't too many blocks here :) |
The following code is almost 40% faster than
The main point is the different
The code is also more compact:
|
@matthias314 that is a brilliant reduction. I have changed the PR to reflect that. |
I kept |
Co-authored-by: matthias314 <[email protected]>
Sorry for posting repeatedly, but after looking at the docstring of
|
Some comments:
|
Yes, I've noticed the
The fallback is back to |
We made the decision to only apply this to string and substring
|
Yes it is. Even though a modified version of the loop using isascii(::AbstractChar) would work. |
Sorry, I meant |
That is a good point should it be a non-breakable export(like it is now), or an internal |
According to his post, @StefanKarpinski seems to lean towards the first option. |
Co-authored-by: matthias314 <[email protected]>
@jakobnissen would you check that the added test are to your liking? Beyond that I think this is ready to go |
Looks good to me! |
This still has an issue at least in terms of naming/docs: the |
What about renaming |
This changes
isascii
to a simple loop that checks the whole string. LLVM is doing a disturbingly good job vectorizing this function which slightly hurts it with small strings because the overhead of loading the function is higher. The benchmarking below shows that the loop-based method is 50x faster than the current method.The funny thing is that I had a fancy
isascii
built to use theUInt64
trick, and was just doing some final benchmarking when I realized the simple loop gave the best results. This does make me a little worried that the result here is very sensitive to the optimizations that it gets.Another note is that any attempts at checking early if the string has encountered a non ASCII character just dramatically slows down the overall function.
UPDATE*
*as per @oscardssmith
isascii
now looks at chunks. I did a little optimization and 1024 seemed to be the right size.isascii
is the version now in this PR and was refined by @matthias314The benchmark should be an average of the two extremes 1. All ascii 2.) non asci first character
benchmark code
results