You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice if the Unicode standard library exported some string-comparison functions for normalized (canonical equivalence) and case-folded string comparison. Currently you have to allocate a new copy of each string with Unicode.normalize in order to do this, which seems a bit wasteful.
Here is a sample implementation, using the lower-level utf8proc_decompose_char function to decompose one character at a time to small (16-byte) buffers. We could also define infix versions ==ᵘ(a,b) = equal_decomposed(a,b) and ==ᶜ(a,b) = equal_decomposed(a,b, casefold=true).
import Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK
functiondecompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, options::Integer)
ret =@ccallutf8proc_decompose_char(codepoint::UInt32, dest::Ptr{UInt32}, length(dest)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
ret <0&&utf8proc_error(ret)
return ret
endfunctionequal_decomposed(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false)
functiondecompose_next_char!(c, state, d, options, s)
n =decompose_char!(c, d, options)
if n >length(d) # may be possible in future Unicode versions?
n =decompose_char!(c, resize!(d, n), options)
endreturn1, n, iterate(s, state)
end
options = UTF8PROC_DECOMPOSE
casefold && (options |= UTF8PROC_CASEFOLD)
stripmark && (options |= UTF8PROC_STRIPMARK)
i1,i2 =iterate(s1),iterate(s2)
d1,d2 =Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
n1 = n2 =0# lengths of codepoint buffers
j1 = j2 =1# indices in d1, d2whiletrueif j1 > n1
i1 ===nothing&&return i2 ===nothing&& j2 > n2
j1, n1, i1 =decompose_next_char!(UInt32(i1[1]), i1[2], d1, options, s1)
endif j2 > n2
i2 ===nothing&&returnfalse
j2, n2, i2 =decompose_next_char!(UInt32(i2[1]), i2[2], d2, options, s2)
end
d1[j1] == d2[j2] ||returnfalse
j1 +=1; j2 +=1endend
Currently the performance advantage over simply doing normalize_string(a, :NFC) == normalize_string(b, :NFC) is not as great as I would like, just a factor of two. Probably it could be sped up in the common case of ASCII strings, at least.
The text was updated successfully, but these errors were encountered:
It would be nice if the
Unicode
standard library exported some string-comparison functions for normalized (canonical equivalence) and case-folded string comparison. Currently you have to allocate a new copy of each string withUnicode.normalize
in order to do this, which seems a bit wasteful.Here is a sample implementation, using the lower-level
utf8proc_decompose_char
function to decompose one character at a time to small (16-byte) buffers. We could also define infix versions==ᵘ(a,b) = equal_decomposed(a,b)
and==ᶜ(a,b) = equal_decomposed(a,b, casefold=true)
.Currently the performance advantage over simply doing
normalize_string(a, :NFC) == normalize_string(b, :NFC)
is not as great as I would like, just a factor of two. Probably it could be sped up in the common case of ASCII strings, at least.The text was updated successfully, but these errors were encountered: