-
-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegexQuery Exception in Parsing Regular Expressions. #2287
Comments
Tantivy uses |
Using only fn main() {
tantivy_fst::Regex::new(".*\\%.*").unwrap();
} as a test program, I cannot reproduce the above error though. |
It seems that the version of |
I don't think there is released version of thread 'main' panicked at src/main.rs:2:40:
called `Result::unwrap()` on an `Err` value: Syntax(Parse(Error { kind: EscapeUnrecognized, pattern: ".*\\%.*", span: Span(Position(o: 2, l: 1, c: 3), Position(o: 4, l: 1, c: 5)) }))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace and I think it should be fixed by using the pattern |
Although version |
Ok, to avoid chasing fixed bugs, could you try a Git dependency against this repo here to see whether that works using the original pattern? |
And one other idea might be to use an escape sequence for the code point, e.g. |
I tried using the 'raw' tokenizer for indexing strings and found that I could successfully search for results in version 0.21.1 of Tantivy. Could it be that the default tokenizer filters out these special characters and symbols? use tantivy::collector::Count;
use tantivy::query::RegexQuery;
use tantivy::schema::{Schema, INDEXED, TEXT, FAST, TextOptions, TextFieldIndexing, IndexRecordOption};
use tantivy::{Index, Document};
fn main() {
let mut schema_builder = Schema::builder();
let text_options = TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default()
.set_tokenizer("raw")
.set_index_option(IndexRecordOption::Basic)
);
let row_id = schema_builder.add_u64_field("row_id", FAST|INDEXED);
let text = schema_builder.add_text_field("text", text_options);
let schema = schema_builder.build();
let index = Index::create_in_ram(schema);
let mut index_writer = index.writer_with_num_threads(8, 1024 * 1024 * 1024).unwrap();
let str_vec: Vec<String> = vec![
"Ancient empires rise and fall, shaping 🐶history's course.".to_string(),
"Artistic expres🐶sions reflect diverse cultural heritages.".to_string(),
"Social movements transform societies, forging new paths.".to_string(),
"Economies🐶 fluctuate, % reflecting the complex interplay of global forces.".to_string(),
"Strategic military 🐶%🐈 camp%aigns alter the bala🚀nce of power.".to_string(),
];
for i in 0..str_vec.len() {
let mut temp = Document::default();
temp.add_u64(row_id, i as u64);
temp.add_text(text, &str_vec[i]);
let _ = index_writer.add_document(temp);
}
index_writer.commit().unwrap();
let reader = index.reader().unwrap();
let searcher = reader.searcher();
let cat_regex_query = RegexQuery::from_pattern(".*🐈.*", text).unwrap();
let dog_regex_query = RegexQuery::from_pattern(".*🐶.*", text).unwrap();
let percent_regex_query = RegexQuery::from_pattern(".*%.*", text).unwrap();
let cat_regex_result = searcher.search(&cat_regex_query, &Count).expect("failed to search");
let dog_regex_result = searcher.search(&dog_regex_query, &Count).expect("failed to search");
let percent_regex_result = searcher.search(&percent_regex_query, &Count).expect("failed to search");
println!("cat regex result count:{:?}", cat_regex_result);
println!("dog result count:{:?}", dog_regex_result);
println!("'%' result count:{:?}", percent_regex_result);
}
========== terminal ==========
Finished release [optimized + debuginfo] target(s) in 2.21s
cat regex result count:1
dog result count:4
'%' result count:2 |
Indeed the default |
During the integration of
Tantivy
intoClickHouse
, I encountered a need to adapt SQL'sLIKE
syntax. This led me to consider using Tantivy'sRegexQuery
. The proposed solution involves converting theLIKE
matching strings from SQL statements into standard regular expressions, and then using Tantivy'sRegexQuery
to search for these expressions.However, I have come across an issue. While certain regular expressions run effortlessly in Python, they fail to execute in
Tantivy
.Here's a specific example:
I aim to find all strings containing the percentage symbol (%). In Python, this task is straightforward, but it seems to encounter difficulties when attempted in Tantivy.
I would like to understand the similarities and differences between the regular expressions used in
RegexQuery
and standard regular expressions. Where can I find information on this topic?The text was updated successfully, but these errors were encountered: