Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode string comparison (normalized and casefolded) #40969

Closed
stevengj opened this issue May 27, 2021 · 1 comment
Closed

unicode string comparison (normalized and casefolded) #40969

stevengj opened this issue May 27, 2021 · 1 comment
Labels
unicode Related to unicode characters and encodings

Comments

@stevengj
Copy link
Member

It would be nice if the Unicode standard library exported some string-comparison functions for normalized (canonical equivalence) and case-folded string comparison. Currently you have to allocate a new copy of each string with Unicode.normalize in order to do this, which seems a bit wasteful.

Here is a sample implementation, using the lower-level utf8proc_decompose_char function to decompose one character at a time to small (16-byte) buffers. We could also define infix versions ==ᵘ(a,b) = equal_decomposed(a,b) and ==ᶜ(a,b) = equal_decomposed(a,b, casefold=true).

import Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK

function decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, options::Integer)
    ret = @ccall utf8proc_decompose_char(codepoint::UInt32, dest::Ptr{UInt32}, length(dest)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
    ret < 0 && utf8proc_error(ret)
    return ret
end

function equal_decomposed(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false)
    function decompose_next_char!(c, state, d, options, s)
        n = decompose_char!(c, d, options)
        if n > length(d) # may be possible in future Unicode versions?
            n = decompose_char!(c, resize!(d, n), options)
        end
        return 1, n, iterate(s, state)
    end
    options = UTF8PROC_DECOMPOSE
    casefold && (options |= UTF8PROC_CASEFOLD)
    stripmark && (options |= UTF8PROC_STRIPMARK)
    i1,i2 = iterate(s1),iterate(s2)
    d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
    n1 = n2 = 0 # lengths of codepoint buffers
    j1 = j2 = 1 # indices in d1, d2
    while true
        if j1 > n1
            i1 === nothing && return i2 === nothing && j2 > n2
            j1, n1, i1 = decompose_next_char!(UInt32(i1[1]), i1[2], d1, options, s1)
        end
        if j2 > n2
            i2 === nothing && return false
            j2, n2, i2 = decompose_next_char!(UInt32(i2[1]), i2[2], d2, options, s2)
        end
        d1[j1] == d2[j2] || return false
        j1 += 1; j2 += 1
    end
end

Currently the performance advantage over simply doing normalize_string(a, :NFC) == normalize_string(b, :NFC) is not as great as I would like, just a factor of two. Probably it could be sped up in the common case of ASCII strings, at least.

@stevengj stevengj added the unicode Related to unicode characters and encodings label May 27, 2021
@laborg
Copy link
Contributor

laborg commented Feb 10, 2022

I guess this can be closed now that #42493 has landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

2 participants