unicode string comparison (normalized and casefolded) #40969

stevengj · 2021-05-27T12:52:30Z

It would be nice if the Unicode standard library exported some string-comparison functions for normalized (canonical equivalence) and case-folded string comparison. Currently you have to allocate a new copy of each string with Unicode.normalize in order to do this, which seems a bit wasteful.

Here is a sample implementation, using the lower-level utf8proc_decompose_char function to decompose one character at a time to small (16-byte) buffers. We could also define infix versions ==ᵘ(a,b) = equal_decomposed(a,b) and ==ᶜ(a,b) = equal_decomposed(a,b, casefold=true).

import Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK

function decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, options::Integer)
    ret = @ccall utf8proc_decompose_char(codepoint::UInt32, dest::Ptr{UInt32}, length(dest)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
    ret < 0 && utf8proc_error(ret)
    return ret
end

function equal_decomposed(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false)
    function decompose_next_char!(c, state, d, options, s)
        n = decompose_char!(c, d, options)
        if n > length(d) # may be possible in future Unicode versions?
            n = decompose_char!(c, resize!(d, n), options)
        end
        return 1, n, iterate(s, state)
    end
    options = UTF8PROC_DECOMPOSE
    casefold && (options |= UTF8PROC_CASEFOLD)
    stripmark && (options |= UTF8PROC_STRIPMARK)
    i1,i2 = iterate(s1),iterate(s2)
    d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
    n1 = n2 = 0 # lengths of codepoint buffers
    j1 = j2 = 1 # indices in d1, d2
    while true
        if j1 > n1
            i1 === nothing && return i2 === nothing && j2 > n2
            j1, n1, i1 = decompose_next_char!(UInt32(i1[1]), i1[2], d1, options, s1)
        end
        if j2 > n2
            i2 === nothing && return false
            j2, n2, i2 = decompose_next_char!(UInt32(i2[1]), i2[2], d2, options, s2)
        end
        d1[j1] == d2[j2] || return false
        j1 += 1; j2 += 1
    end
end

Currently the performance advantage over simply doing normalize_string(a, :NFC) == normalize_string(b, :NFC) is not as great as I would like, just a factor of two. Probably it could be sped up in the common case of ASCII strings, at least.

The text was updated successfully, but these errors were encountered:

laborg · 2022-02-10T18:40:18Z

I guess this can be closed now that #42493 has landed.

stevengj added the unicode Related to unicode characters and encodings label May 27, 2021

stevengj closed this as completed Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode string comparison (normalized and casefolded) #40969

unicode string comparison (normalized and casefolded) #40969

stevengj commented May 27, 2021

laborg commented Feb 10, 2022

unicode string comparison (normalized and casefolded) #40969

unicode string comparison (normalized and casefolded) #40969

Comments

stevengj commented May 27, 2021

laborg commented Feb 10, 2022