-
-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.19.1 consumes much more memory than v0.18 #1820
Comments
Thanks, @Enter-tainer for the report; that's unexpected indeed. Can you also share the schema of the index? |
use std::sync::Arc;
use cang_jie::{CangJieTokenizer, TokenizerOption};
use miette::{IntoDiagnostic, Result};
use once_cell::sync::Lazy;
use tantivy::{
directory::MmapDirectory,
schema::{
IndexRecordOption, Schema, SchemaBuilder, TextFieldIndexing, TextOptions, FAST, STORED,
},
Index,
};
pub fn get_schema() -> Schema {
static SCHEMA: Lazy<Schema> = Lazy::new(|| {
let zh_text_indexing = TextFieldIndexing::default()
.set_tokenizer("jieba") // Set custom tokenizer
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
let zh_text_options = TextOptions::default()
.set_indexing_options(zh_text_indexing)
.set_stored();
let mut builder = SchemaBuilder::new();
builder.add_i64_field("id", STORED | FAST);
builder.add_text_field("text", zh_text_options);
builder.add_text_field("sender_name", STORED);
builder.add_date_field("send_time", STORED);
builder.build()
});
SCHEMA.clone()
}
pub fn create_or_open_index(path: &str) -> Result<Index> {
let dir = MmapDirectory::open(path).into_diagnostic()?;
let index = Index::open_or_create(dir, get_schema()).into_diagnostic()?;
let tokenizer = CangJieTokenizer {
worker: Arc::new(jieba_rs::Jieba::new()),
option: TokenizerOption::Unicode,
};
index.tokenizers().register("jieba", tokenizer);
Ok(index)
}
[dependencies]
cang-jie = "0.14.0"
jieba-rs = { version = "0.6", features = ["tfidf", "textrank"] }
miette = { version = "5.5.0", features = ["fancy"] }
once_cell = "1.17.0"
tantivy = "0.18" # i'm using v0.18 now, so v0.18 here. A typical entry may looks like: {
"id": 308,
"text": "It would be OK if the driver of mouse and keyboard are working.",
"send_time": 1502877989,
"sender": "SOMEONE"
} |
I'm not sure if this is related to the custom tokenizer or something. But after I switch to v0.18, the memory consumption (RSS) becomes ~90M. That looks good to me. |
What's your parameter for
|
Oh, I thought it was in bytes! I think maybe that is the problem. And I found that v0.18 doesn't have Thank you again for your kind reply! |
Thank you for the update! |
* fix doc store cache docs addresses an issue reported in #1820 * rename doc_store_cache_size
Confirmed. I switch to a smaller cache size and it works. |
* fix doc store cache docs addresses an issue reported in quickwit-oss#1820 * rename doc_store_cache_size
Describe the bug
When building a chat history searcher using tantivy v0.19.1, I found that it takes quite a lot of memory (~800M) for only 80k entries. I use bytehound to ananlyze the memory usage but I cannot figure out why. I try v0.18 and it only takes ~80M.
To Reproduce
I cannot share my code because it contains sensitive data. But I will provide the flamegraph. Hope that helps.
full backtrace for the most memory-consuming part:
Details
The text was updated successfully, but these errors were encountered: