v0.19.1 consumes much more memory than v0.18 #1820

Enter-tainer · 2023-01-22T09:43:19Z

Describe the bug

What did you do?
When building a chat history searcher using tantivy v0.19.1, I found that it takes quite a lot of memory (~800M) for only 80k entries. I use bytehound to ananlyze the memory usage but I cannot figure out why. I try v0.18 and it only takes ~80M.

To Reproduce

If your bug is deterministic, can you give a minimal reproducing code?
Some bugs are not deterministic. Can you describe with precision in which context it happened?
If this is possible, can you share your code?

I cannot share my code because it contains sensitive data. But I will provide the flamegraph. Hope that helps.

full backtrace for the most memory-consuming part:

Details

#0 [libc.so.6] 0x00007F80A0D21A5F
#1 [libc.so.6] 0x00007F80A0C9F8FC
#2 [tg-search] std::sys::unix::thread::Thread::new::thread_start [thread.rs:108]
#3 [tg-search] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1987]
#4 [tg-search] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1987]
#5 [tg-search] core::ops::function::FnOnce::call_once{{vtable.shim}} [function.rs:251]
#6 [tg-search] std::thread::Builder::spawn_unchecked_::{{closure}} [mod.rs:550]
#7 [tg-search] std::panic::catch_unwind [panic.rs:137]
#8 [tg-search] std::panicking::try [panicking.rs:447]
#9 [tg-search] std::panicking::try::do_call [panicking.rs:483]
#10 [tg-search] <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once [unwind_safe.rs:271]
#11 [tg-search] std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} [mod.rs:551]
#12 [tg-search] std::sys_common::backtrace::__rust_begin_short_backtrace [backtrace.rs:121]
#13 [tg-search] tantivy::directory::watch_event_router::WatchCallbackList::broadcast::{{closure}} [watch_event_router.rs:87]
#14 [tg-search] tantivy::directory::watch_event_router::WatchCallback::call [watch_event_router.rs:16]
#15 [tg-search] tantivy::reader::IndexReaderBuilder::try_into::{{closure}} [mod.rs:89]
#16 [tg-search] tantivy::reader::InnerIndexReader::reload [mod.rs:243]
#17 [tg-search] tantivy::reader::InnerIndexReader::create_searcher [mod.rs:230]
#18 [tg-search] tantivy::core::searcher::SearcherInner::new [searcher.rs:263]
#19 [tg-search] core::iter::traits::iterator::Iterator::collect [iterator.rs:1836]
#20 [tg-search] <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter [result.rs:2075]
#21 [tg-search] core::iter::adapters::try_process [mod.rs:164]
#22 [tg-search] <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}} [result.rs:2075]
#23 [tg-search] core::iter::traits::iterator::Iterator::collect [iterator.rs:1836]
#24 [tg-search] <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter [mod.rs:2757]
#25 [tg-search] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter [spec_from_iter.rs:33]
#26 [tg-search] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter [spec_from_iter_nested.rs:43]
#27 [tg-search] <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend [spec_extend.rs:18]
#28 [tg-search] alloc::vec::Vec<T,A>::extend_desugared [mod.rs:2857]
#29 [tg-search] <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next [mod.rs:178]
#30 [tg-search] core::iter::traits::iterator::Iterator::try_for_each [iterator.rs:2299]
#31 [tg-search] <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold [mod.rs:195]
#32 [tg-search] <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold [map.rs:117]
#33 [tg-search] core::iter::traits::iterator::Iterator::try_fold [iterator.rs:2238]
#34 [tg-search] core::iter::adapters::map::map_try_fold::{{closure}} [map.rs:91]
#35 [tg-search] tantivy::core::searcher::SearcherInner::new::{{closure}} [searcher.rs:265]
#36 [tg-search] tantivy::core::segment_reader::SegmentReader::get_store_reader [segment_reader.rs:138]
#37 [tg-search] tantivy::store::reader::StoreReader::open [reader.rs:121]
#38 [tg-search] lru::LruCache<K,V>::new [lib.rs:208]
#39 [tg-search] hashbrown::map::HashMap<K,V>::with_capacity [map.rs:326]
#40 [tg-search] hashbrown::map::HashMap<K,V,S>::with_capacity_and_hasher [map.rs:422]
#41 [tg-search] hashbrown::raw::RawTable<T>::with_capacity [mod.rs:411]
#42 [tg-search] hashbrown::raw::RawTable<T,A>::with_capacity_in [mod.rs:481]
#43 [tg-search] hashbrown::raw::RawTable<T,A>::fallible_with_capacity [mod.rs:460]
#44 [tg-search] hashbrown::raw::RawTableInner<A>::fallible_with_capacity [mod.rs:1109]
#45 [tg-search] hashbrown::raw::RawTableInner<A>::new_uninitialized [mod.rs:1080]
#46 [tg-search] hashbrown::raw::alloc::inner::do_alloc [alloc.rs:62]
#47 [tg-search] <hashbrown::raw::alloc::inner::Global as hashbrown::raw::alloc::inner::Allocator>::allocate [alloc.rs:47]
#48 [tg-search] alloc::alloc::alloc [alloc.rs:99]
#49 [libbytehound.so] malloc

The text was updated successfully, but these errors were encountered:

fmassot · 2023-01-22T11:44:17Z

Thanks, @Enter-tainer for the report; that's unexpected indeed. Can you also share the schema of the index?

Enter-tainer · 2023-01-22T11:50:02Z

schema.rs:

use std::sync::Arc;

use cang_jie::{CangJieTokenizer, TokenizerOption};
use miette::{IntoDiagnostic, Result};
use once_cell::sync::Lazy;
use tantivy::{
    directory::MmapDirectory,
    schema::{
        IndexRecordOption, Schema, SchemaBuilder, TextFieldIndexing, TextOptions, FAST, STORED,
    },
    Index,
};

pub fn get_schema() -> Schema {
    static SCHEMA: Lazy<Schema> = Lazy::new(|| {
        let zh_text_indexing = TextFieldIndexing::default()
            .set_tokenizer("jieba") // Set custom tokenizer
            .set_index_option(IndexRecordOption::WithFreqsAndPositions);
        let zh_text_options = TextOptions::default()
            .set_indexing_options(zh_text_indexing)
            .set_stored();
        let mut builder = SchemaBuilder::new();
        builder.add_i64_field("id", STORED | FAST);
        builder.add_text_field("text", zh_text_options);
        builder.add_text_field("sender_name", STORED);
        builder.add_date_field("send_time", STORED);
        builder.build()
    });
    SCHEMA.clone()
}

pub fn create_or_open_index(path: &str) -> Result<Index> {
    let dir = MmapDirectory::open(path).into_diagnostic()?;
    let index = Index::open_or_create(dir, get_schema()).into_diagnostic()?;
    let tokenizer = CangJieTokenizer {
        worker: Arc::new(jieba_rs::Jieba::new()),
        option: TokenizerOption::Unicode,
    };
    index.tokenizers().register("jieba", tokenizer);
    Ok(index)
}

Cargo.toml

[dependencies]
cang-jie = "0.14.0"
jieba-rs = { version = "0.6", features = ["tfidf", "textrank"] }
miette = { version = "5.5.0", features = ["fancy"] }
once_cell = "1.17.0"
tantivy = "0.18" # i'm using v0.18 now, so v0.18 here.

A typical entry may looks like:

  {
    "id": 308,
    "text": "It would be OK if the driver of mouse and keyboard are working.",
    "send_time": 1502877989,
    "sender": "SOMEONE"
  }

Enter-tainer · 2023-01-22T11:51:27Z

I'm not sure if this is related to the custom tokenizer or something. But after I switch to v0.18, the memory consumption (RSS) becomes ~90M. That looks good to me.

PSeitz · 2023-01-22T12:34:20Z

What's your parameter for doc_store_cache_size? That cache is for the number of decompressed blocks. This needs more documentation

    pub(crate) fn new(
        schema: Schema,
        index: Index,
        segment_readers: Vec<SegmentReader>,
        generation: TrackedObject<SearcherGeneration>,
        doc_store_cache_size: usize,
    ) -> io::Result<SearcherInner>

Enter-tainer · 2023-01-22T12:41:18Z

That cache is for the number of decompressed blocks

Oh, I thought it was in bytes! I think maybe that is the problem. And I found that v0.18 doesn't have doc_store_cache_size, so the memory consumption becomes normal. Thank you for your suggestion! I'll test on v0.19 to confirm this when I have time. Let me just close this for now.

Thank you again for your kind reply!

fulmicoton · 2023-01-23T00:31:19Z

Thank you for the update!

addresses an issue reported in #1820

* fix doc store cache docs addresses an issue reported in #1820 * rename doc_store_cache_size

Enter-tainer · 2023-01-27T07:28:52Z

I'll test on v0.19 to confirm this when I have time.

Confirmed. I switch to a smaller cache size and it works.

* fix doc store cache docs addresses an issue reported in quickwit-oss#1820 * rename doc_store_cache_size

Enter-tainer closed this as completed Jan 22, 2023

PSeitz added a commit that referenced this issue Jan 23, 2023

fix docs

03a01ac

addresses an issue reported in #1820

PSeitz added a commit that referenced this issue Jan 23, 2023

fix doc store cache docs

04665d5

addresses an issue reported in #1820

PSeitz mentioned this issue Jan 23, 2023

fix doc store cache docs #1821

Merged

PSeitz added a commit that referenced this issue Jan 23, 2023

fix doc store cache docs

bee3ac9

addresses an issue reported in #1820

PSeitz added a commit that referenced this issue Jan 23, 2023

fix doc store cache docs (#1821)

0f20787

* fix doc store cache docs addresses an issue reported in #1820 * rename doc_store_cache_size

Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 30, 2023

fix doc store cache docs (quickwit-oss#1821)

6292889

* fix doc store cache docs addresses an issue reported in quickwit-oss#1820 * rename doc_store_cache_size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.19.1 consumes much more memory than v0.18 #1820

v0.19.1 consumes much more memory than v0.18 #1820

Enter-tainer commented Jan 22, 2023 •

edited

Loading

fmassot commented Jan 22, 2023

Enter-tainer commented Jan 22, 2023

Enter-tainer commented Jan 22, 2023 •

edited

Loading

PSeitz commented Jan 22, 2023

Enter-tainer commented Jan 22, 2023

fulmicoton commented Jan 23, 2023

Enter-tainer commented Jan 27, 2023

v0.19.1 consumes much more memory than v0.18 #1820

v0.19.1 consumes much more memory than v0.18 #1820

Comments

Enter-tainer commented Jan 22, 2023 • edited Loading

fmassot commented Jan 22, 2023

Enter-tainer commented Jan 22, 2023

Enter-tainer commented Jan 22, 2023 • edited Loading

PSeitz commented Jan 22, 2023

Enter-tainer commented Jan 22, 2023

fulmicoton commented Jan 23, 2023

Enter-tainer commented Jan 27, 2023

Enter-tainer commented Jan 22, 2023 •

edited

Loading

Enter-tainer commented Jan 22, 2023 •

edited

Loading