Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex::find_from_utf16/ecs2 incorrectly matching against bytes instead of u16 words #100

Closed
jedel1043 opened this issue Jan 5, 2025 · 1 comment
Assignees

Comments

@jedel1043
Copy link
Contributor

jedel1043 commented Jan 5, 2025

The find functions for UTF16 and ECS2 strings are incorrectly matching against individual bytes instead of whole u16 words in some cases.

Reproducer

use regress::Regex;

fn main() {
    let input = "赔".encode_utf16().collect::<Vec<_>>(); // U+8D54
    let re = Regex::new(r"[A-Z]").unwrap(); // 0x41 - 0x5A

    let matched = re.find_from_utf16(&input, 0).collect::<Vec<_>>();
    dbg!(matched.is_empty()); // false

    let matched = re.find_from_ucs2(&input, 0).collect::<Vec<_>>();
    dbg!(matched.is_empty()); // false
}

In this case the regex [A-Z] is interpreting "赔" as [0x54, 0x8D], which matches against the "T" character.

@ridiculousfish
Copy link
Owner

Yikes, great find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants