RegexQuery Exception in Parsing Regular Expressions. #2287

MochiXu · 2023-12-21T12:09:58Z

During the integration of Tantivy into ClickHouse, I encountered a need to adapt SQL's LIKE syntax. This led me to consider using Tantivy's RegexQuery. The proposed solution involves converting the LIKE matching strings from SQL statements into standard regular expressions, and then using Tantivy's RegexQuery to search for these expressions.

However, I have come across an issue. While certain regular expressions run effortlessly in Python, they fail to execute in Tantivy.

Here's a specific example:

I aim to find all strings containing the percentage symbol (%). In Python, this task is straightforward, but it seems to encounter difficulties when attempted in Tantivy.

import re

docs = [
"Ancient empires rise and fall, shaping history's course.",
"Artistic expressions reflect diverse cultural heritages.",
"Social movements transform societies, forging new paths.",
"Economies fluctuate, % reflecting the complex interplay of global forces.",
"Strategic military camp%aigns alter the balance of power.",
]
compiled_regex = re.compile(".*\\%.*")
for line in docs:
    if compiled_regex.match(line):
        print(line)

========== terminal ==========
Economies fluctuate, % reflecting the complex interplay of global forces.
Strategic military camp%aigns alter the balance of power.

Process finished with exit code 0

use tantivy::collector::Count;
use tantivy::query::RegexQuery;
use tantivy::schema::{Schema, INDEXED, TEXT, FAST};
use tantivy::{Index, Document};


fn main() {
    let mut schema_builder = Schema::builder();
    let row_id = schema_builder.add_u64_field("row_id", FAST|INDEXED);
    let text = schema_builder.add_text_field("text", TEXT);
    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema);

    let mut index_writer = index.writer_with_num_threads(8, 1024 * 1024 * 1024).unwrap();

    let str_vec: Vec<String> = vec![
        "Ancient empires rise and fall, shaping history's course.".to_string(),
        "Artistic expressions reflect diverse cultural heritages.".to_string(),
        "Social movements transform societies, forging new paths.".to_string(),
        "Economies fluctuate, % reflecting the complex interplay of global forces.".to_string(),
        "Strategic military camp%aigns alter the balance of power.".to_string(),
        ];
    
    for i in 0..str_vec.len() {
        let mut temp = Document::default();
        temp.add_u64(row_id, i as u64);
        temp.add_text(text, &str_vec[i]);
        let _ = index_writer.add_document(temp);
    }
    index_writer.commit().unwrap();

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();
    let regex_query = RegexQuery::from_pattern(".*\\%.*", text).unwrap();
    let regex_result = searcher.search(&regex_query, &Count).expect("failed to search");

    println!("regex query result count:{:?}", regex_result);
}

========== terminal ==========
thread 'main' panicked at tests/regex_search.rs:37:65:
called `Result::unwrap()` on an `Err` value: InvalidArgument(".*\\%.*")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/result.rs:1652:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/result.rs:1077:23
   4: regrex_search::main
             at ./tests/regrex_search.rs:37:23
   5: core::ops::function::FnOnce::call_once
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I would like to understand the similarities and differences between the regular expressions used in RegexQuery and standard regular expressions. Where can I find information on this topic?

The text was updated successfully, but these errors were encountered:

adamreichold · 2023-12-21T12:22:13Z

Tantivy uses tantivy_fst::Regex, the docs of which contain some information on how syntax and matching differs from the usual regex semantics.

adamreichold · 2023-12-21T12:28:37Z

Using only

fn main() {
    tantivy_fst::Regex::new(".*\\%.*").unwrap();
}

as a test program, I cannot reproduce the above error though.

MochiXu · 2023-12-22T08:23:41Z

It seems that the version of tantivy I am using is 0.21.1, which depends on tantivy-fst version 0.4.0. I have confirmed that tantivy-fst version 0.5.0 can correctly parse this regular expression, while version 0.4.0 throws an error. Could you tell me which versions of tantivy use tantivy-fst version 0.5.0? I am considering upgrading my tantivy version to resolve this issue.
@adamreichold

adamreichold · 2023-12-22T08:38:53Z

I don't think there is released version of tantivy which yet which uses [email protected]. That said, using [email protected], I was able to reproduce your error, i.e.

thread 'main' panicked at src/main.rs:2:40:
called `Result::unwrap()` on an `Err` value: Syntax(Parse(Error { kind: EscapeUnrecognized, pattern: ".*\\%.*", span: Span(Position(o: 2, l: 1, c: 3), Position(o: 4, l: 1, c: 5)) }))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

and I think it should be fixed by using the pattern ".*%.*" instead.

MochiXu · 2023-12-22T08:58:37Z

Although version 0.4.0 of tantivy-fst can parse ".*%.*", it fails to find any results during the actual string matching process.

adamreichold · 2023-12-22T09:03:08Z

Ok, to avoid chasing fixed bugs, could you try a Git dependency against this repo here to see whether that works using the original pattern?

adamreichold · 2023-12-22T09:04:20Z

And one other idea might be to use an escape sequence for the code point, e.g. \u0025.

MochiXu · 2023-12-22T10:11:05Z

I tried using the 'raw' tokenizer for indexing strings and found that I could successfully search for results in version 0.21.1 of Tantivy. Could it be that the default tokenizer filters out these special characters and symbols?

use tantivy::collector::Count;
use tantivy::query::RegexQuery;
use tantivy::schema::{Schema, INDEXED, TEXT, FAST, TextOptions, TextFieldIndexing, IndexRecordOption};
use tantivy::{Index, Document};


fn main() {
    let mut schema_builder = Schema::builder();

    let text_options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_tokenizer("raw")
                .set_index_option(IndexRecordOption::Basic)
    );

    let row_id = schema_builder.add_u64_field("row_id", FAST|INDEXED);
    let text = schema_builder.add_text_field("text", text_options);
    let schema = schema_builder.build();
    let index = Index::create_in_ram(schema);



    let mut index_writer = index.writer_with_num_threads(8, 1024 * 1024 * 1024).unwrap();

    let str_vec: Vec<String> = vec![
        "Ancient empires rise and fall, shaping 🐶history's course.".to_string(),
        "Artistic expres🐶sions reflect diverse cultural heritages.".to_string(),
        "Social movements transform societies, forging new paths.".to_string(),
        "Economies🐶 fluctuate, % reflecting the complex interplay of global forces.".to_string(),
        "Strategic military 🐶%🐈 camp%aigns alter the bala🚀nce of power.".to_string(),
        ];
    
    for i in 0..str_vec.len() {
        let mut temp = Document::default();
        temp.add_u64(row_id, i as u64);
        temp.add_text(text, &str_vec[i]);
        let _ = index_writer.add_document(temp);
    }
    index_writer.commit().unwrap();

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();

    let cat_regex_query = RegexQuery::from_pattern(".*🐈.*", text).unwrap();
    let dog_regex_query = RegexQuery::from_pattern(".*🐶.*", text).unwrap();
    let percent_regex_query = RegexQuery::from_pattern(".*%.*", text).unwrap();

    let cat_regex_result = searcher.search(&cat_regex_query, &Count).expect("failed to search");
    let dog_regex_result = searcher.search(&dog_regex_query, &Count).expect("failed to search");
    let percent_regex_result = searcher.search(&percent_regex_query, &Count).expect("failed to search");

    println!("cat regex result count:{:?}", cat_regex_result);
    println!("dog result count:{:?}", dog_regex_result);
    println!("'%' result count:{:?}", percent_regex_result);
}
========== terminal ==========
Finished release [optimized + debuginfo] target(s) in 2.21s

cat regex result count:1
dog result count:4
'%' result count:2

adamreichold · 2023-12-22T10:37:17Z

Indeed the default SimpleTokenizer uses char::is_alphanumeric which considers '%' as punctuation, c.f. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=9ba1cda49d001fa10e00a2d631337ace

adamreichold mentioned this issue Dec 22, 2023

Forward regex parser errors to enable understandin their reason. #2288

Merged

PSeitz closed this as completed in #2288 Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegexQuery Exception in Parsing Regular Expressions. #2287

RegexQuery Exception in Parsing Regular Expressions. #2287

MochiXu commented Dec 21, 2023

adamreichold commented Dec 21, 2023

adamreichold commented Dec 21, 2023

MochiXu commented Dec 22, 2023

adamreichold commented Dec 22, 2023

MochiXu commented Dec 22, 2023

adamreichold commented Dec 22, 2023

adamreichold commented Dec 22, 2023

MochiXu commented Dec 22, 2023

adamreichold commented Dec 22, 2023

RegexQuery Exception in Parsing Regular Expressions. #2287

RegexQuery Exception in Parsing Regular Expressions. #2287

Comments

MochiXu commented Dec 21, 2023

adamreichold commented Dec 21, 2023

adamreichold commented Dec 21, 2023

MochiXu commented Dec 22, 2023

adamreichold commented Dec 22, 2023

MochiXu commented Dec 22, 2023

adamreichold commented Dec 22, 2023

adamreichold commented Dec 22, 2023

MochiXu commented Dec 22, 2023

adamreichold commented Dec 22, 2023