Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized isascii using simple loop 25+bytes/cycle for large strings #48568

Merged
merged 20 commits into from
Mar 3, 2023

Conversation

ndinsmore
Copy link
Contributor

@ndinsmore ndinsmore commented Feb 7, 2023

This changes isascii to a simple loop that checks the whole string. LLVM is doing a disturbingly good job vectorizing this function which slightly hurts it with small strings because the overhead of loading the function is higher. The benchmarking below shows that the loop-based method is 50x faster than the current method.

The funny thing is that I had a fancy isascii built to use the UInt64 trick, and was just doing some final benchmarking when I realized the simple loop gave the best results. This does make me a little worried that the result here is very sensitive to the optimizations that it gets.

Another note is that any attempts at checking early if the string has encountered a non ASCII character just dramatically slows down the overall function.

UPDATE*
*as per @oscardssmith isascii now looks at chunks. I did a little optimization and 1024 seemed to be the right size.

isascii is the version now in this PR and was refined by @matthias314

The benchmark should be an average of the two extremes 1. All ascii 2.) non asci first character

benchmark code

using BenchmarkTools
function benchmark_isascii(fun)
    for p=1:14
        n = (2 * 2^(p-1))-1
        s='S'^n
        s2 = 'λ' * 'S'^(n-1)
        b = @benchmark $fun($s)&$fun($s2) seconds=1
        cpu_info = Sys.cpu_info()
        cpu_ghz= mean(i.speed for i in Sys.cpu_info()) /1_000
        parse_time_ns = time(median(b))
        GB_per_second= 2*n / parse_time_ns
        bytes_per_cycle = GB_per_second / cpu_ghz
        print("$fun -> $n bytes $GB_per_second GB/second @ $cpu_ghz GHz -> $bytes_per_cycle bytes/cycle\n")
    end
end
##
function isascii_nochunks(s::AbstractString)
    bytes = codeunits(s)
    l = ncodeunits(s)
    r = UInt8(0)
    for n = 1:l
        @inbounds r |= bytes[n]
    end
    return r < 0x80
end

function _isascii(bytes, first, last)
    r = UInt8(0)
    for n = first:last
        @inbounds r |= bytes[n]
    end
    return r < 0x80
end

function isascii(s::AbstractString)
    chunk_size = 1024
    bytes = codeunits(s)
    l = ncodeunits(s)
    start = 1
    fastmin(a,b) = ifelse(a < b, a, b)
    while start <= l
        @inline _isascii(bytes, start, fastmin(l, start + chunk_size)) || return false
        start += chunk_size
    end
    return true
end

isascii_all(c::Char) = bswap(reinterpret(UInt32, c)) < 0x80
isascii_all(s::AbstractString) = all(isascii_all, s)
isascii_all(c::AbstractChar) = UInt32(c) < 0x80

##
 benchmark_isascii(isascii_all)
##
 benchmark_isascii(isascii_nochunks)
##
 benchmark_isascii(isascii)
##

results

isascii_all -> 1 bytes 0.06623796354912871 GB/second @ 2.7 GHz -> 0.024532579092269892 bytes/cycle
isascii_all -> 3 bytes 0.21423568080176733 GB/second @ 2.7 GHz -> 0.079346548445099 bytes/cycle
isascii_all -> 7 bytes 0.3949880668257757 GB/second @ 2.7 GHz -> 0.14629187660213913 bytes/cycle
isascii_all -> 15 bytes 0.7953936797000536 GB/second @ 2.7 GHz -> 0.2945902517407606 bytes/cycle
isascii_all -> 31 bytes 1.3267578400808178 GB/second @ 2.7 GHz -> 0.4913917926225251 bytes/cycle
isascii_all -> 63 bytes 1.792814953381155 GB/second @ 2.7 GHz -> 0.6640055382893166 bytes/cycle
isascii_all -> 127 bytes 2.0539571873077573 GB/second @ 2.7 GHz -> 0.7607248841880582 bytes/cycle
isascii_all -> 255 bytes 2.442238909701151 GB/second @ 2.7 GHz -> 0.9045329295189447 bytes/cycle
isascii_all -> 511 bytes 2.66197056596995 GB/second @ 2.7 GHz -> 0.9859150244333148 bytes/cycle
isascii_all -> 1023 bytes 2.7845331630383012 GB/second @ 2.7 GHz -> 1.0313085789030745 bytes/cycle
isascii_all -> 2047 bytes 2.8371448371448373 GB/second @ 2.7 GHz -> 1.0507943841277174 bytes/cycle
isascii_all -> 4095 bytes 2.8627466210967842 GB/second @ 2.7 GHz -> 1.0602765263321423 bytes/cycle
isascii_all -> 8191 bytes 2.893238748417861 GB/second @ 2.7 GHz -> 1.07156990682143 bytes/cycle
isascii_all -> 16383 bytes 2.8853469531525184 GB/second @ 2.7 GHz -> 1.068647019686118 bytes/cycle

isascii_nochunks -> 1 bytes 0.16941096588015617 GB/second @ 2.7 GHz -> 0.06274480217783561 bytes/cycle
isascii_nochunks -> 3 bytes 0.44490675384501077 GB/second @ 2.7 GHz -> 0.16478027920185584 bytes/cycle
isascii_nochunks -> 7 bytes 0.8653977307954616 GB/second @ 2.7 GHz -> 0.3205176780723932 bytes/cycle
isascii_nochunks -> 15 bytes 1.569987389659521 GB/second @ 2.7 GHz -> 0.5814768109850077 bytes/cycle
isascii_nochunks -> 31 bytes 2.8405009669398655 GB/second @ 2.7 GHz -> 1.0520373951629132 bytes/cycle
isascii_nochunks -> 63 bytes 5.303299492385786 GB/second @ 2.7 GHz -> 1.9641849971799208 bytes/cycle
isascii_nochunks -> 127 bytes 10.465010351966875 GB/second @ 2.7 GHz -> 3.875929759987731 bytes/cycle
isascii_nochunks -> 255 bytes 18.675950486295314 GB/second @ 2.7 GHz -> 6.917018698627894 bytes/cycle
isascii_nochunks -> 511 bytes 34.2668152350081 GB/second @ 2.7 GHz -> 12.691413050003 bytes/cycle
isascii_nochunks -> 1023 bytes 58.472315145922245 GB/second @ 2.7 GHz -> 21.656413017008237 bytes/cycle
isascii_nochunks -> 2047 bytes 76.48904580152671 GB/second @ 2.7 GHz -> 28.32927622278767 bytes/cycle
isascii_nochunks -> 4095 bytes 97.76940343334071 GB/second @ 2.7 GHz -> 36.21089016049656 bytes/cycle
isascii_nochunks -> 8191 bytes 99.35550547582335 GB/second @ 2.7 GHz -> 36.79833536141605 bytes/cycle
isascii_nochunks -> 16383 bytes 98.05788949825984 GB/second @ 2.7 GHz -> 36.31773685120734 bytes/cycle

isascii -> 1 bytes 0.12132643748098569 GB/second @ 2.7 GHz -> 0.044935717585550254 bytes/cycle
isascii -> 3 bytes 0.31225833420420107 GB/second @ 2.7 GHz -> 0.11565123489044483 bytes/cycle
isascii -> 7 bytes 0.6210432456531431 GB/second @ 2.7 GHz -> 0.23001601690857149 bytes/cycle
isascii -> 15 bytes 1.2457223937901678 GB/second @ 2.7 GHz -> 0.4613786643667288 bytes/cycle
isascii -> 31 bytes 2.430222011908987 GB/second @ 2.7 GHz -> 0.900082226632958 bytes/cycle
isascii -> 63 bytes 4.547217078749592 GB/second @ 2.7 GHz -> 1.6841544736109597 bytes/cycle
isascii -> 127 bytes 8.880743635787473 GB/second @ 2.7 GHz -> 3.289164309550916 bytes/cycle
isascii -> 255 bytes 15.564374711582834 GB/second @ 2.7 GHz -> 5.76458322651216 bytes/cycle
isascii -> 511 bytes 30.51297732921594 GB/second @ 2.7 GHz -> 11.301102714524422 bytes/cycle
isascii -> 1023 bytes 42.45936692642148 GB/second @ 2.7 GHz -> 15.725691454230176 bytes/cycle
isascii -> 2047 bytes 79.80871569659996 GB/second @ 2.7 GHz -> 29.558783591333317 bytes/cycle
isascii -> 4095 bytes 113.6582026746599 GB/second @ 2.7 GHz -> 42.09563062024441 bytes/cycle
isascii -> 8191 bytes 136.41434343065026 GB/second @ 2.7 GHz -> 50.52383090024083 bytes/cycle
isascii -> 16383 bytes 156.27125780921037 GB/second @ 2.7 GHz -> 57.878243633040874 bytes/cycle

@oscardssmith
Copy link
Member

Unfortunately you do need a short circuit. Otherwise something like s='α'*'a'^10000000000 gives

julia> @btime isascii(s)
  12.182 ns (0 allocations: 0 bytes)
false
julia> @btime isascii_loop(s)
  367.009 ms (0 allocations: 0 bytes)
false

It really should be possible to make the loop have early termination without messing anything up though.

@oscardssmith
Copy link
Member

That said, this seems to be pretty easy to fix.

function isascii_inner(bytes, lo, hi)
	ret = true
	for n = lo:hi
		@inbounds ret &= bytes[n] < UInt8(0x80)
	end
	return ret
end

function my_isascii_loop(s::AbstractString)
	bytes = codeunits(s)
	len = ncodeunits(s)
	for n = 1:2048:len
		isascii_inner(bytes, n, min(len,n+2047)) || return false
	end
	return true
end

I'm seeing it as roughly 25% slower in the worst case (medium size string e.g. 10000 all ascii), but it exists quite quickly in the good cases. My guess is that you were trying to make the early exit versions exit way too early. With AVX-512 each clock cycle can scan 64 chars at a time, so by the time you pipeline 4 of those you start having to have a pretty big vector before the cost of the conditional isn't giant.

@inkydragon inkydragon added the strings "Strings!" label Feb 7, 2023
@ndinsmore
Copy link
Contributor Author

ndinsmore commented Feb 7, 2023

I have updated the PR as reflected in the first post to use a 1024 chunk size, which seems to strike the right balance.

base/strings/basic.jl Outdated Show resolved Hide resolved
@jakobnissen jakobnissen added the needs tests Unit tests are required for this change label Feb 8, 2023
@jakobnissen
Copy link
Contributor

It looks like isascii is not well tested in Julia. The fact that tests did not catch the OOB access as mentioned by @matthias314 is worrisome. Tests should be added for this PR for various cases, IMO including strings of length 0, 1, chunk_length, chunk_length+1 and some multiple of chunk_length.

@KristofferC
Copy link
Member

It looks like isascii is not well tested in Julia.

I wouldn't say that. The tests were ok for the algorithm used. But when you add a more complex algorithm with more branches like in this case, then you might have to extend the tests which now do not have full coverage.

@Seelengrab
Copy link
Contributor

Seelengrab commented Feb 8, 2023

A potential fix for the OOB:

function isascii(s::AbstractString)
    chunk_size = 1024
    bytes = codeunits(s)
    l = ncodeunits(s)
    l < 2*chunk_size && return _isascii(bytes, 1, l)
    for n = 1:chunk_size:(l-chunk_size)
        _isascii(bytes, n, n + chunk_size - 1) || return false
    end
    # handle the last chunk explicitly
    return _isascii(bytes, l-chunk_size+1, l)
end

which results in

julia> ##
        benchmark_isascii(isascii_all)
isascii_all -> 1 bytes 0.21437768240343347 GB/second @ 3.1595 GHz -> 0.06785177477557634 bytes/cycle
isascii_all -> 3 bytes 0.6417558886509636 GB/second @ 3.0919583333333334 GHz -> 0.20755644787719657 bytes/cycle
isascii_all -> 7 bytes 1.390258449304175 GB/second @ 3.098125 GHz -> 0.4487418839795602 bytes/cycle
isascii_all -> 15 bytes 2.3975999999999997 GB/second @ 3.169375 GHz -> 0.7564898442121869 bytes/cycle
isascii_all -> 31 bytes 3.1046710195881464 GB/second @ 3.1718333333333333 GHz -> 0.9788253963285628 bytes/cycle
isascii_all -> 63 bytes 3.637151162790698 GB/second @ 3.103875 GHz -> 1.1718098063841804 bytes/cycle
isascii_all -> 127 bytes 4.1398838174273855 GB/second @ 3.2747083333333333 GHz -> 1.2641992495293124 bytes/cycle
isascii_all -> 255 bytes 3.8182137217055625 GB/second @ 3.1081666666666665 GHz -> 1.2284456180081171 bytes/cycle
isascii_all -> 511 bytes 3.931249930516181 GB/second @ 3.1722916666666667 GHz -> 1.239246054145772 bytes/cycle
isascii_all -> 1023 bytes 3.807754143565659 GB/second @ 3.1895 GHz -> 1.1938404588699354 bytes/cycle
isascii_all -> 2047 bytes 3.8225957049486463 GB/second @ 3.251 GHz -> 1.175821502598784 bytes/cycle
isascii_all -> 4095 bytes 3.815414876546405 GB/second @ 3.1093333333333333 GHz -> 1.2270845443438265 bytes/cycle
isascii_all -> 8191 bytes 3.8623779050185245 GB/second @ 3.118 GHz -> 1.2387356975684813 bytes/cycle
isascii_all -> 16383 bytes 3.886832740213523 GB/second @ 3.1801666666666666 GHz -> 1.2222103894597316 bytes/cycle

julia> ##
        benchmark_isascii(isascii_loop)
isascii_loop -> 1 bytes 0.47058823529411764 GB/second @ 3.1695833333333336 GHz -> 0.1484700624038231 bytes/cycle
isascii_loop -> 3 bytes 1.2048192771084336 GB/second @ 3.1607916666666664 GHz -> 0.3811764279861639 bytes/cycle
isascii_loop -> 7 bytes 2.0289855072463765 GB/second @ 3.0977916666666667 GHz -> 0.6549780377676709 bytes/cycle
isascii_loop -> 15 bytes 2.592560553633218 GB/second @ 3.104125 GHz -> 0.8351985031637638 bytes/cycle
isascii_loop -> 31 bytes 4.735321100917431 GB/second @ 3.10375 GHz -> 1.5256773583302234 bytes/cycle
isascii_loop -> 63 bytes 8.583481228668942 GB/second @ 3.255125 GHz -> 2.6369129384183227 bytes/cycle
isascii_loop -> 127 bytes 17.62809457579972 GB/second @ 3.1031666666666666 GHz -> 5.6806792768031755 bytes/cycle
isascii_loop -> 255 bytes 31.380745399056696 GB/second @ 3.1742083333333335 GHz -> 9.886164392399163 bytes/cycle
isascii_loop -> 511 bytes 44.69008771929824 GB/second @ 3.0928333333333335 GHz -> 14.449562230737158 bytes/cycle
isascii_loop -> 1023 bytes 68.61375126390293 GB/second @ 3.1016666666666666 GHz -> 22.121574829845112 bytes/cycle
isascii_loop -> 2047 bytes 104.50972722593927 GB/second @ 3.1801666666666666 GHz -> 32.86297171823467 bytes/cycle
isascii_loop -> 4095 bytes 127.63129467831614 GB/second @ 3.093375 GHz -> 41.25956105493713 bytes/cycle
isascii_loop -> 8191 bytes 164.22062233840705 GB/second @ 3.101625 GHz -> 52.946640015607 bytes/cycle
isascii_loop -> 16383 bytes 177.4039354000371 GB/second @ 3.1093333333333333 GHz -> 57.05529654803938 bytes/cycle

julia> benchmark_isascii(isascii_loop_short)
isascii_loop_short -> 1 bytes 0.267828418230563 GB/second @ 3.0984166666666666 GHz -> 0.0864404136189655 bytes/cycle
isascii_loop_short -> 3 bytes 0.7309756097560975 GB/second @ 3.06225 GHz -> 0.23870539954481101 bytes/cycle
isascii_loop_short -> 7 bytes 1.3010232558139536 GB/second @ 3.1350416666666665 GHz -> 0.41499392803836854 bytes/cycle
isascii_loop_short -> 15 bytes 1.8025285972305842 GB/second @ 3.17275 GHz -> 0.5681281529369109 bytes/cycle
isascii_loop_short -> 31 bytes 3.357534320907266 GB/second @ 3.24475 GHz -> 1.0347590171530214 bytes/cycle
isascii_loop_short -> 63 bytes 6.746978892529137 GB/second @ 3.0800833333333335 GHz -> 2.190518295239567 bytes/cycle
isascii_loop_short -> 127 bytes 12.861902585199857 GB/second @ 3.1730833333333335 GHz -> 4.053439898689453 bytes/cycle
isascii_loop_short -> 255 bytes 24.504578313253013 GB/second @ 3.1025833333333335 GHz -> 7.898120914265965 bytes/cycle
isascii_loop_short -> 511 bytes 43.69544148548394 GB/second @ 3.228125 GHz -> 13.53585796258941 bytes/cycle
isascii_loop_short -> 1023 bytes 70.73558026407227 GB/second @ 3.1051666666666664 GHz -> 22.779962513253913 bytes/cycle
isascii_loop_short -> 2047 bytes 93.51465610536738 GB/second @ 3.0985 GHz -> 30.180621625098397 bytes/cycle
isascii_loop_short -> 4095 bytes 183.54341926729987 GB/second @ 3.176375 GHz -> 57.78392641526894 bytes/cycle
isascii_loop_short -> 8191 bytes 225.66651125333783 GB/second @ 3.1062083333333335 GHz -> 72.65015318891075 bytes/cycle
isascii_loop_short -> 16383 bytes 258.68121650189914 GB/second @ 3.111625 GHz -> 83.13380195296642 bytes/cycle

julia> versioninfo()
Julia Version 1.10.0-DEV.496
Commit 07c4244caa (2023-02-05 01:47 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 24

So the short circuiting is very beneficial on my machine, even for longer strings! I had to try to push it past the 64MiB of L3 this CPU has to begin to see diminishing returns, due to hitting main memory:

isascii_loop_short -> 16777215 bytes 228.72978002576704 GB/second @ 3.1910416666666666 GHz -> 71.67871933953658 bytes/cycle
isascii_loop_short -> 33554431 bytes 165.98039661850325 GB/second @ 3.0955416666666666 GHz -> 53.61917702669266 bytes/cycle
isascii_loop_short -> 67108863 bytes 143.83534849477755 GB/second @ 3.2654583333333336 GHz -> 44.04752221963049 bytes/cycle
isascii_loop_short -> 134217727 bytes 120.51988843056374 GB/second @ 3.1855 GHz -> 37.83389999389852 bytes/cycle
isascii_loop_short -> 268435455 bytes 99.18517613943506 GB/second @ 3.2829583333333336 GHz -> 30.212133712561602 bytes/cycle
isascii_loop_short -> 536870911 bytes 99.34236215407893 GB/second @ 3.0989166666666663 GHz -> 32.05712603460745 bytes/cycle

@matthias314
Copy link
Contributor

matthias314 commented Feb 8, 2023

In case one cares about making isascii faster: According to @Seelengrab's benchmarks, isascii_loop is noticeably faster than isascii_loop_short for short strings, roughly shorter than chunk_size. How come? The two functions do the same thing for strings of this length.

@Seelengrab
Copy link
Contributor

Seelengrab commented Feb 8, 2023

That's likely an artifact due to _isascii not inlining. If I annotate the sub function with @inline, the performance is the same and improves a bit in the higher ones:

julia> benchmark_isascii(_isascii_loop)
_isascii_loop -> 1 bytes 0.39370078740157477 GB/second @ 3.1849583333333333 GHz -> 0.1236125393796072 bytes/cycle
_isascii_loop -> 3 bytes 1.001669449081803 GB/second @ 3.181125 GHz -> 0.31487899692146737 bytes/cycle
_isascii_loop -> 7 bytes 1.7907810499359795 GB/second @ 3.16 GHz -> 0.566702863903791 bytes/cycle
_isascii_loop -> 15 bytes 2.363564668769716 GB/second @ 3.1655 GHz -> 0.7466639294802451 bytes/cycle
_isascii_loop -> 31 bytes 4.186468200270636 GB/second @ 3.1529166666666666 GHz -> 1.327808071976943 bytes/cycle
_isascii_loop -> 63 bytes 7.453941908713692 GB/second @ 3.1700416666666666 GHz -> 2.351370326482678 bytes/cycle
_isascii_loop -> 127 bytes 15.51358629130967 GB/second @ 3.2835 GHz -> 4.724710306474697 bytes/cycle
_isascii_loop -> 255 bytes 28.170083102493077 GB/second @ 3.0805 GHz -> 9.144646356920331 bytes/cycle
_isascii_loop -> 511 bytes 37.728391401037804 GB/second @ 3.165 GHz -> 11.920502812334218 bytes/cycle
_isascii_loop -> 1023 bytes 63.11992551210428 GB/second @ 3.2256666666666667 GHz -> 19.56802485649611 bytes/cycle
_isascii_loop -> 2047 bytes 103.57684264218312 GB/second @ 3.155125 GHz -> 32.82812650598095 bytes/cycle
_isascii_loop -> 4095 bytes 212.3491897543126 GB/second @ 3.0968333333333335 GHz -> 68.5697830324458 bytes/cycle
_isascii_loop -> 8191 bytes 250.18482156771074 GB/second @ 3.1687083333333335 GHz -> 78.95482804014593 bytes/cycle
_isascii_loop -> 16383 bytes 273.6738006031253 GB/second @ 3.1684583333333336 GHz -> 86.37443570710002 bytes/cycle

@ndinsmore
Copy link
Contributor Author

In case one cares about making isascii faster: According to @Seelengrab's benchmarks, isascii_loop is noticeably faster than isascii_loop_short for short strings, roughly shorter than chunk_size. How come? The two functions do the same thing for strings of this length.

This is because for short strings every single op/instruction counts for bytes/second. But more importantly in this case it is because there is an added redirection. If you look at the size of @code_native for the loop it ends up being a giant function, which makes it unlikely to get inlined.

@Seelengrab
Copy link
Contributor

If you look at the size of @code_native for the loop it ends up being a giant function, which makes it unlikely to get inlined.

Well, that certainly depends on your CPU - I have a small number of funny blocks like this:

; │┌ @ REPL[6]:3 within `_isascii`
	add	r15, r14
	kxnorq	k1, k0, k0
	mov	rax, -1024
	vpternlogd	zmm0, zmm0, zmm0, 255
	kxnorq	k2, k0, k0
	kxnorq	k3, k0, k0
	kxnorq	k4, k0, k0
	.p2align	4, 0x90
.LBB0_31:                               # %vector.body145
                                        # =>This Inner Loop Header: Depth=1
; ││ @ REPL[6]:4 within `_isascii`
; ││┌ @ bool.jl:38 within `&`
	vpcmpltb	k1 {k1}, zmm0, zmmword ptr [r15 + rax + 8]
	vpcmpltb	k2 {k2}, zmm0, zmmword ptr [r15 + rax + 72]
	vpcmpltb	k3 {k3}, zmm0, zmmword ptr [r15 + rax + 136]
	vpcmpltb	k4 {k4}, zmm0, zmmword ptr [r15 + rax + 200]
	add	rax, 256
	jne	.LBB0_31
# %bb.32:                               # %middle.block144
; ││└
; ││ @ REPL[6]:5 within `_isascii`
	kandq	k0, k2, k1
	kandq	k0, k3, k0
	kandq	k0, k4, k0
	kortestq	k0, k0
	setb	al
	jmp	.LBB0_17
.LBB0_2:
	mov	al, 1
	jmp	.LBB0_17
.LBB0_5:                                # %vector.main.loop.iter.check
	movabs	rcx, 9223372036854774784
; │└

which is only possible due to the inlining (& possibly unrolling) in the parent function (if inlined).

Regardless, I wonder how representative this benchmark really is - it's comparing a string of only S and a string of only λ, so the worst case for one and the best case for the other. Both cases are also going to be extremely friendly on the branch predictor & caching effects, regardless of size, since the content is always the same.

@Seelengrab
Copy link
Contributor

@ndinsmore Regarding your fix with min, that leaves a lot of performance on the table:

julia> benchmark_isascii(isascii_min)
isascii_min -> 1 bytes 0.3976143141153081 GB/second @ 3.1706666666666665 GHz -> 0.12540400991862116 bytes/cycle
isascii_min -> 3 bytes 0.9331259720062208 GB/second @ 3.1020416666666666 GHz -> 0.3008102637798936 bytes/cycle
isascii_min -> 7 bytes 1.5087378640776699 GB/second @ 3.104125 GHz -> 0.48604288296304754 bytes/cycle
isascii_min -> 15 bytes 2.028455284552846 GB/second @ 3.1062083333333335 GHz -> 0.6530325937204832 bytes/cycle
isascii_min -> 31 bytes 3.9087807959570435 GB/second @ 3.211625 GHz -> 1.2170726021739908 bytes/cycle
isascii_min -> 63 bytes 7.5122767190393684 GB/second @ 3.1057916666666667 GHz -> 2.418796083351599 bytes/cycle
isascii_min -> 127 bytes 14.370294784580498 GB/second @ 3.1045833333333337 GHz -> 4.628735402361185 bytes/cycle
isascii_min -> 255 bytes 26.77567140600316 GB/second @ 3.1062916666666665 GHz -> 8.619818831995223 bytes/cycle
isascii_min -> 511 bytes 46.39954462659381 GB/second @ 3.099875 GHz -> 14.968198597231762 bytes/cycle
isascii_min -> 1023 bytes 66.05567993770077 GB/second @ 3.1082916666666667 GHz -> 21.251441956391083 bytes/cycle
isascii_min -> 2047 bytes 102.04397095404408 GB/second @ 3.1702083333333335 GHz -> 32.18841168294747 bytes/cycle
isascii_min -> 4095 bytes 157.1784401796652 GB/second @ 3.1790833333333337 GHz -> 49.44143443225201 bytes/cycle
isascii_min -> 8191 bytes 211.88219968363265 GB/second @ 3.1795833333333334 GHz -> 66.63835398253418 bytes/cycle
isascii_min -> 16383 bytes 257.06526612104466 GB/second @ 3.21 GHz -> 80.08263742088619 bytes/cycle

julia> benchmark_isascii(isascii_seelengrab)
isascii_seelengrab -> 1 bytes 0.4175365344467641 GB/second @ 3.2627916666666663 GHz -> 0.12796910655142374 bytes/cycle
isascii_seelengrab -> 3 bytes 1.0812759055685708 GB/second @ 3.191625 GHz -> 0.33878538536594077 bytes/cycle
isascii_seelengrab -> 7 bytes 1.7330855018587359 GB/second @ 3.2445416666666667 GHz -> 0.534154182596536 bytes/cycle
isascii_seelengrab -> 15 bytes 2.460591133004926 GB/second @ 3.1650833333333335 GHz -> 0.7774174875874545 bytes/cycle
isascii_seelengrab -> 31 bytes 4.692272727272727 GB/second @ 3.1760833333333336 GHz -> 1.4773770820264143 bytes/cycle
isascii_seelengrab -> 63 bytes 8.128506787330316 GB/second @ 3.09825 GHz -> 2.623580016890282 bytes/cycle
isascii_seelengrab -> 127 bytes 16.600654878847415 GB/second @ 3.176625 GHz -> 5.225878055750179 bytes/cycle
isascii_seelengrab -> 255 bytes 28.339643652561247 GB/second @ 3.20125 GHz -> 8.852680563080437 bytes/cycle
isascii_seelengrab -> 511 bytes 37.997386987196236 GB/second @ 3.264375 GHz -> 11.640018988993678 bytes/cycle
isascii_seelengrab -> 1023 bytes 62.40331389996931 GB/second @ 3.1880833333333336 GHz -> 19.573928086353654 bytes/cycle
isascii_seelengrab -> 2047 bytes 109.4381029372137 GB/second @ 3.248 GHz -> 33.69399720973328 bytes/cycle
isascii_seelengrab -> 4095 bytes 226.3476203729474 GB/second @ 3.2520416666666665 GHz -> 69.60169750990708 bytes/cycle
isascii_seelengrab -> 8191 bytes 265.1621750811782 GB/second @ 3.25175 GHz -> 81.5444530118177 bytes/cycle
isascii_seelengrab -> 16383 bytes 291.38884678479263 GB/second @ 3.1832083333333334 GHz -> 91.53935786529605 bytes/cycle

Those register masks aren't cheap! Though arguably, this much performance ought to be overkill 😂

@ndinsmore
Copy link
Contributor Author

Well, that certainly depends on your CPU - I have a small number of funny blocks like this:

I think that you likely have AVX-512 but the fact that this is an .LBB0_31: show that there are a lot of blocks.

@ndinsmore
Copy link
Contributor Author

ndinsmore commented Feb 8, 2023

@Seelengrab
Let me know how the current implementation, which is equivalent to isascii_loop_nested does on your computer. It seems to beat everything because of the @inline.

As a further note to illustrate how fragile this performance can be isascii_loop_nested is named such because it started out with the whole _isascii_loop function carefully pasted inside, and it did not get any of the SIMD optimizations.

@Seelengrab
Copy link
Contributor

Seelengrab commented Feb 8, 2023

Sure, here's the results:

julia> benchmark_isascii(isascii_loop_nested)
isascii_loop_nested -> 1 bytes 0.3076923076923077 GB/second @ 3.1762916666666667 GHz -> 0.09687155336563058 bytes/cycle
isascii_loop_nested -> 3 bytes 0.7577749683944375 GB/second @ 3.1770416666666663 GHz -> 0.2385159050147084 bytes/cycle
isascii_loop_nested -> 7 bytes 1.321928166351607 GB/second @ 3.16925 GHz -> 0.4171107253613969 bytes/cycle
isascii_loop_nested -> 15 bytes 1.8447319778188538 GB/second @ 3.108 GHz -> 0.5935431074063235 bytes/cycle
isascii_loop_nested -> 31 bytes 3.5057223796033994 GB/second @ 3.1112083333333334 GHz -> 1.126804123672228 bytes/cycle
isascii_loop_nested -> 63 bytes 7.048654708520179 GB/second @ 3.10775 GHz -> 2.2680893599936223 bytes/cycle
isascii_loop_nested -> 127 bytes 13.162058212058213 GB/second @ 3.1755 GHz -> 4.144877408930315 bytes/cycle
isascii_loop_nested -> 255 bytes 25.706268958543983 GB/second @ 3.1150833333333336 GHz -> 8.252193025936378 bytes/cycle
isascii_loop_nested -> 511 bytes 45.93931469792606 GB/second @ 3.1609583333333333 GHz -> 14.533350286044914 bytes/cycle
isascii_loop_nested -> 1023 bytes 62.501121730846066 GB/second @ 3.1578333333333335 GHz -> 19.792406733787743 bytes/cycle
isascii_loop_nested -> 2047 bytes 93.82083333333334 GB/second @ 3.1142083333333335 GHz -> 30.126704218568122 bytes/cycle
isascii_loop_nested -> 4095 bytes 157.67493520918177 GB/second @ 3.1763333333333335 GHz -> 49.64055049087473 bytes/cycle
isascii_loop_nested -> 8191 bytes 211.06855228485549 GB/second @ 3.1750833333333337 GHz -> 66.47653939314625 bytes/cycle
isascii_loop_nested -> 16383 bytes 259.2550366773427 GB/second @ 3.1818333333333335 GHz -> 81.47976638542015 bytes/cycle

julia> benchmark_isascii(isascii_seelengrab)
isascii_seelengrab -> 1 bytes 0.4385964912280702 GB/second @ 3.1745 GHz -> 0.13816238501435507 bytes/cycle
isascii_seelengrab -> 3 bytes 0.977198697068404 GB/second @ 3.1764583333333336 GHz -> 0.3076378137291493 bytes/cycle
isascii_seelengrab -> 7 bytes 1.6590747330960853 GB/second @ 3.1916666666666664 GHz -> 0.5198145377846743 bytes/cycle
isascii_seelengrab -> 15 bytes 2.1249112845990066 GB/second @ 3.1744583333333334 GHz -> 0.669377595001459 bytes/cycle
isascii_seelengrab -> 31 bytes 4.130574098798398 GB/second @ 3.2448333333333337 GHz -> 1.2729695717700131 bytes/cycle
isascii_seelengrab -> 63 bytes 8.009426751592356 GB/second @ 3.107 GHz -> 2.577865063274012 bytes/cycle
isascii_seelengrab -> 127 bytes 15.116703440873039 GB/second @ 3.25 GHz -> 4.651293366422474 bytes/cycle
isascii_seelengrab -> 255 bytes 28.450531022917833 GB/second @ 3.2112916666666664 GHz -> 8.859528805257854 bytes/cycle
isascii_seelengrab -> 511 bytes 39.60901202381416 GB/second @ 3.3138333333333336 GHz -> 11.952626472005479 bytes/cycle
isascii_seelengrab -> 1023 bytes 66.7050034404797 GB/second @ 4.013125 GHz -> 16.621710871323398 bytes/cycle
isascii_seelengrab -> 2047 bytes 110.02576524398495 GB/second @ 3.1782916666666665 GHz -> 34.61789438581574 bytes/cycle
isascii_seelengrab -> 4095 bytes 225.56095394097557 GB/second @ 3.214375 GHz -> 70.17256976581001 bytes/cycle
isascii_seelengrab -> 8191 bytes 264.864652379151 GB/second @ 3.23325 GHz -> 81.91901411247228 bytes/cycle
isascii_seelengrab -> 16383 bytes 290.41587428115923 GB/second @ 3.1745 GHz -> 91.48397362770805 bytes/cycle

and the code:

Code
julia> using BenchmarkTools

julia> function benchmark_isascii(fun)
           for p=1:14
               n = (2 * 2^(p-1))-1
               s='S'^n
               s2 = 'λ' * 'S'^(n-1)
               b = @benchmark $fun($s)&$fun($s2) seconds=1
               cpu_info = Sys.cpu_info()
               cpu_ghz= mean(i.speed for i in Sys.cpu_info()) /1_000
               parse_time_ns = time(median(b))
               GB_per_second= 2*n / parse_time_ns
               bytes_per_cycle = GB_per_second / cpu_ghz 
               print("$fun -> $n bytes $GB_per_second GB/second @ $cpu_ghz GHz -> $bytes_per_cycle bytes/cycle\n")
           end
       end
benchmark_isascii (generic function with 1 method)

julia> function _isascii_loop(bytes, first, last)
           r = true
           for n = first:last
               @inbounds r &= bytes[n] < UInt8(0x80)
           end
           return r
       end
_isascii_loop (generic function with 1 method)

julia> function isascii_loop_nested(s::AbstractString)
           chunk_size = 1024
           bytes = codeunits(s)
           l = ncodeunits(s)
           start = 1
           chunk_end = ifelse(l < chunk_size, l, start + chunk_size - 1)
           fastmin(a,b) = ifelse(a<b,a,b)
           r = true
           while start <= l
               @inline _isascii_loop(bytes, start, chunk_end) || return false
               start += chunk_size
               chunk_end = fastmin(l,chunk_end+chunk_size)
           end
           return true
       end
isascii_loop_nested (generic function with 1 method)

julia> function isascii_seelengrab(s::AbstractString)
           chunk_size = 1024
           bytes = codeunits(s)
           l = ncodeunits(s)
           l < 2*chunk_size && return _isascii(bytes, 1, l)
           for n = 1:chunk_size:(l-chunk_size)
               _isascii(bytes, n, n + chunk_size - 1) || return false
           end
           # handle the last chunk explicitly
           return _isascii(bytes, l-chunk_size+1, l)
       end
isascii_seelengrab (generic function with 1 method)

julia> @inline function _isascii(bytes, first, last)
           r = true
           for n = first:last
               @inbounds r &= bytes[n] < UInt8(0x80)
           end
           return r
       end
_isascii (generic function with 1 method)

I think that you likely have AVX-512 but the fact that this is an .LBB0_31: show that there are a lot of blocks.

The code size isn't that large actually:

julia> @code_native binary=true dump_module=false isascii_seelengrab("foo")
	.text
; code origin: 00007f8fb1f1b800, code size: 864

Since everything still fits comfortably in my terminal buffer, there definitely aren't too many blocks here :)

@matthias314
Copy link
Contributor

matthias314 commented Feb 8, 2023

The following code is almost 40% faster than isascii_loop_nested on my machine:

function _isascii_loop2(bytes, first, last)
    r = UInt8(0)
    for n = first:last
        @inbounds r |= bytes[n]
    end
    return r < 0x80
end

function isascii_loop_nested2(s::AbstractString)
    chunk_size = 1024
    bytes = codeunits(s)
    l = length(bytes)
    start = 1
    while start <= l
        @inline _isascii_loop2(bytes, start, min(start+chunk_size-1, l)) || return false
        start += chunk_size
    end
    return true
end

The main point is the different _isascii_loop2. I don't know if fastmin instead of min would make a difference in isascii_loop_nested2.

n = 2^14-1

s = String(rand(' ':'~', n))  # ascii
@btime isascii_loop_nested($s)    # 329.387 ns
@btime isascii_loop_nested2($s)   # 195.208 ns

s = String(rand(' ':'ø', n))  # not ascii (with the first non-ascii character occurring soon)
@btime isascii_loop_nested($s)    # 18.661 ns
@btime isascii_loop_nested2($s)   # 11.618 ns

The code is also more compact:

julia> @code_native binary=true dump_module=false isascii_loop_nested("a")
; code origin: 00007f08b4225c40, code size: 464
julia> @code_native binary=true dump_module=false isascii_loop_nested2("a")
; code origin: 00007f08b423f840, code size: 352
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores

@ndinsmore
Copy link
Contributor Author

ndinsmore commented Feb 8, 2023

@matthias314 that is a brilliant reduction. I have changed the PR to reflect that.

@ndinsmore
Copy link
Contributor Author

ndinsmore commented Feb 8, 2023

I don't know if fastmin instead of min would make a difference in

I kept fastmin in place because in my experience it gives the optimizer just a little more help to vectorize through that section.

base/strings/basic.jl Outdated Show resolved Hide resolved
@matthias314
Copy link
Contributor

Sorry for posting repeatedly, but after looking at the docstring of codeunit I realized that code units need not be UInt8. They can also be UInt16 or UInt32. So I suggest something like the following:

function _isascii_loop4(s, first, last)
    r = zero(codeunit(s))
    for n in first:last
        @inbounds r |= codeunit(s, n)
    end
    return r < 0x80
end

function isascii_loop_nested4(s::AbstractString)
    chunk_size = 1024 ÷ sizeof(codeunit(s))
    l = ncodeunits(s)
    start = 1
    while start <= l
        @inline _isascii_loop4(s, start, min(start+chunk_size-1, l)) || return false
        start += chunk_size
    end
    return true
end

base/strings/basic.jl Outdated Show resolved Hide resolved
base/strings/basic.jl Outdated Show resolved Hide resolved
@matthias314
Copy link
Contributor

matthias314 commented Feb 16, 2023

Some comments:

  • Like @Seelengrab I think that because of inlining the Val(N) doesn't make the code faster.
  • On my machine, replacing the loop over start:N:stop-N by a while loop gives a small, but noticeable speed-up.
  • If you want isascii to work for an arbitrary AbstractString AbstractVector v, then you cannot assume that indexing starts at 1. You have to go from firstindex(v) to lastindex(v).
  • Also, if isascii(v::AbstractString) isascii(v::AbstractVector) is supposed to be used by others, then a docstring would be helpful.

@Seelengrab
Copy link
Contributor

On my machine, replacing the loop over start:N:stop-N by a while loop gives a small, but noticeable speed-up.

Yes, I've noticed the while thing as well. That's actually what led me to the let version, which gets the same semantics I think (as explained above). I'll have to benchmark again, but can't do that until the weekend.

If you want isascii to work for an arbitrary AbstractString v, then you cannot assume that indexing starts at 1. You have to go from firstindex(v) to lastindex(v).

The fallback is back to all(isascii, s) now, isn't it?

@ndinsmore
Copy link
Contributor Author

If you want isascii to work for an arbitrary AbstractString v, then you cannot assume that indexing starts at 1. You have to go from firstindex(v) to lastindex(v).

We made the decision to only apply this to string and substring

Also, if isascii(v::AbstractString) is supposed to be used by others, then a docstring would be helpful.

isascii(::AbstractString) is now unchanged in this PR

@ndinsmore
Copy link
Contributor Author

The fallback is back to all(isascii, s) now, isn't it?

Yes it is.

Even though a modified version of the loop using isascii(::AbstractChar) would work.

@matthias314
Copy link
Contributor

matthias314 commented Feb 16, 2023

Sorry, I meant AbstractVector, not AbstractString. Currently, the code for isascii(cu::AbstractVector) assumes that indexing starts at 1, and this method doesn't have a docstring.

@ndinsmore
Copy link
Contributor Author

That is a good point should it be a non-breakable export(like it is now), or an internal _isascii?

@matthias314
Copy link
Contributor

According to his post, @StefanKarpinski seems to lean towards the first option.

base/strings/basic.jl Outdated Show resolved Hide resolved
base/strings/basic.jl Outdated Show resolved Hide resolved
@ndinsmore
Copy link
Contributor Author

@jakobnissen would you check that the added test are to your liking?

Beyond that I think this is ready to go

@jakobnissen
Copy link
Contributor

Looks good to me!

@oscardssmith oscardssmith added performance Must go faster and removed needs tests Unit tests are required for this change labels Mar 3, 2023
@oscardssmith oscardssmith merged commit 778947f into JuliaLang:master Mar 3, 2023
@StefanKarpinski
Copy link
Member

This still has an issue at least in terms of naming/docs: the isascii(cu::AbstractVector{CU<:Integer}) method purports to work with vectors of code units, but that doesn't make sense because we don't know what generic code units mean. It happens that for UTF-{8,16,32} code units from 0-127 represent ASCII code points (also from 0-127), but that need not be the case and is only the case for the UTF family because it is based on ASCII. Consider (again) EBCDIC, which has a UInt8 code unit, but has some ASCII characters above 0x7f and some non-ASCII characters below. The generic isascii method I was suggesting was one that operates on code points not code units.

@matthias314
Copy link
Contributor

What about renaming isascii(cu::AbstractVector{CU<:Integer}) to something like is7bit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants