Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexQuery Exception in Parsing Regular Expressions. #2287

Closed
MochiXu opened this issue Dec 21, 2023 · 9 comments · Fixed by #2288
Closed

RegexQuery Exception in Parsing Regular Expressions. #2287

MochiXu opened this issue Dec 21, 2023 · 9 comments · Fixed by #2288

Comments

@MochiXu
Copy link
Contributor

MochiXu commented Dec 21, 2023

During the integration of Tantivy into ClickHouse, I encountered a need to adapt SQL's LIKE syntax. This led me to consider using Tantivy's RegexQuery. The proposed solution involves converting the LIKE matching strings from SQL statements into standard regular expressions, and then using Tantivy's RegexQuery to search for these expressions.

However, I have come across an issue. While certain regular expressions run effortlessly in Python, they fail to execute in Tantivy.

Here's a specific example:

I aim to find all strings containing the percentage symbol (%). In Python, this task is straightforward, but it seems to encounter difficulties when attempted in Tantivy.

import re

docs = [
"Ancient empires rise and fall, shaping history's course.",
"Artistic expressions reflect diverse cultural heritages.",
"Social movements transform societies, forging new paths.",
"Economies fluctuate, % reflecting the complex interplay of global forces.",
"Strategic military camp%aigns alter the balance of power.",
]
compiled_regex = re.compile(".*\\%.*")
for line in docs:
    if compiled_regex.match(line):
        print(line)

========== terminal ==========
Economies fluctuate, % reflecting the complex interplay of global forces.
Strategic military camp%aigns alter the balance of power.

Process finished with exit code 0
use tantivy::collector::Count;
use tantivy::query::RegexQuery;
use tantivy::schema::{Schema, INDEXED, TEXT, FAST};
use tantivy::{Index, Document};


fn main() {
    let mut schema_builder = Schema::builder();
    let row_id = schema_builder.add_u64_field("row_id", FAST|INDEXED);
    let text = schema_builder.add_text_field("text", TEXT);
    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema);

    let mut index_writer = index.writer_with_num_threads(8, 1024 * 1024 * 1024).unwrap();

    let str_vec: Vec<String> = vec![
        "Ancient empires rise and fall, shaping history's course.".to_string(),
        "Artistic expressions reflect diverse cultural heritages.".to_string(),
        "Social movements transform societies, forging new paths.".to_string(),
        "Economies fluctuate, % reflecting the complex interplay of global forces.".to_string(),
        "Strategic military camp%aigns alter the balance of power.".to_string(),
        ];
    
    for i in 0..str_vec.len() {
        let mut temp = Document::default();
        temp.add_u64(row_id, i as u64);
        temp.add_text(text, &str_vec[i]);
        let _ = index_writer.add_document(temp);
    }
    index_writer.commit().unwrap();

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();
    let regex_query = RegexQuery::from_pattern(".*\\%.*", text).unwrap();
    let regex_result = searcher.search(&regex_query, &Count).expect("failed to search");

    println!("regex query result count:{:?}", regex_result);
}

========== terminal ==========
thread 'main' panicked at tests/regex_search.rs:37:65:
called `Result::unwrap()` on an `Err` value: InvalidArgument(".*\\%.*")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/result.rs:1652:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/result.rs:1077:23
   4: regrex_search::main
             at ./tests/regrex_search.rs:37:23
   5: core::ops::function::FnOnce::call_once
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I would like to understand the similarities and differences between the regular expressions used in RegexQuery and standard regular expressions. Where can I find information on this topic?

@adamreichold
Copy link
Collaborator

Tantivy uses tantivy_fst::Regex, the docs of which contain some information on how syntax and matching differs from the usual regex semantics.

@adamreichold
Copy link
Collaborator

Using only

fn main() {
    tantivy_fst::Regex::new(".*\\%.*").unwrap();
}

as a test program, I cannot reproduce the above error though.

@MochiXu
Copy link
Contributor Author

MochiXu commented Dec 22, 2023

It seems that the version of tantivy I am using is 0.21.1, which depends on tantivy-fst version 0.4.0. I have confirmed that tantivy-fst version 0.5.0 can correctly parse this regular expression, while version 0.4.0 throws an error. Could you tell me which versions of tantivy use tantivy-fst version 0.5.0? I am considering upgrading my tantivy version to resolve this issue.
@adamreichold

@adamreichold
Copy link
Collaborator

I don't think there is released version of tantivy which yet which uses [email protected]. That said, using [email protected], I was able to reproduce your error, i.e.

thread 'main' panicked at src/main.rs:2:40:
called `Result::unwrap()` on an `Err` value: Syntax(Parse(Error { kind: EscapeUnrecognized, pattern: ".*\\%.*", span: Span(Position(o: 2, l: 1, c: 3), Position(o: 4, l: 1, c: 5)) }))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

and I think it should be fixed by using the pattern ".*%.*" instead.

@MochiXu
Copy link
Contributor Author

MochiXu commented Dec 22, 2023

Although version 0.4.0 of tantivy-fst can parse ".*%.*", it fails to find any results during the actual string matching process.

@adamreichold
Copy link
Collaborator

Ok, to avoid chasing fixed bugs, could you try a Git dependency against this repo here to see whether that works using the original pattern?

@adamreichold
Copy link
Collaborator

And one other idea might be to use an escape sequence for the code point, e.g. \u0025.

@MochiXu
Copy link
Contributor Author

MochiXu commented Dec 22, 2023

I tried using the 'raw' tokenizer for indexing strings and found that I could successfully search for results in version 0.21.1 of Tantivy. Could it be that the default tokenizer filters out these special characters and symbols?

use tantivy::collector::Count;
use tantivy::query::RegexQuery;
use tantivy::schema::{Schema, INDEXED, TEXT, FAST, TextOptions, TextFieldIndexing, IndexRecordOption};
use tantivy::{Index, Document};


fn main() {
    let mut schema_builder = Schema::builder();

    let text_options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer("raw")
                .set_index_option(IndexRecordOption::Basic)
    );

    let row_id = schema_builder.add_u64_field("row_id", FAST|INDEXED);
    let text = schema_builder.add_text_field("text", text_options);
    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema);



    let mut index_writer = index.writer_with_num_threads(8, 1024 * 1024 * 1024).unwrap();

    let str_vec: Vec<String> = vec![
        "Ancient empires rise and fall, shaping 🐶history's course.".to_string(),
        "Artistic expres🐶sions reflect diverse cultural heritages.".to_string(),
        "Social movements transform societies, forging new paths.".to_string(),
        "Economies🐶 fluctuate, % reflecting the complex interplay of global forces.".to_string(),
        "Strategic military 🐶%🐈 camp%aigns alter the bala🚀nce of power.".to_string(),
        ];
    
    for i in 0..str_vec.len() {
        let mut temp = Document::default();
        temp.add_u64(row_id, i as u64);
        temp.add_text(text, &str_vec[i]);
        let _ = index_writer.add_document(temp);
    }
    index_writer.commit().unwrap();

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();

    let cat_regex_query = RegexQuery::from_pattern(".*🐈.*", text).unwrap();
    let dog_regex_query = RegexQuery::from_pattern(".*🐶.*", text).unwrap();
    let percent_regex_query = RegexQuery::from_pattern(".*%.*", text).unwrap();

    let cat_regex_result = searcher.search(&cat_regex_query, &Count).expect("failed to search");
    let dog_regex_result = searcher.search(&dog_regex_query, &Count).expect("failed to search");
    let percent_regex_result = searcher.search(&percent_regex_query, &Count).expect("failed to search");

    println!("cat regex result count:{:?}", cat_regex_result);
    println!("dog result count:{:?}", dog_regex_result);
    println!("'%' result count:{:?}", percent_regex_result);
}
========== terminal ==========
Finished release [optimized + debuginfo] target(s) in 2.21s

cat regex result count:1
dog result count:4
'%' result count:2

@adamreichold
Copy link
Collaborator

Indeed the default SimpleTokenizer uses char::is_alphanumeric which considers '%' as punctuation, c.f. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=9ba1cda49d001fa10e00a2d631337ace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants