Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.19.1 consumes much more memory than v0.18 #1820

Closed
Enter-tainer opened this issue Jan 22, 2023 · 7 comments
Closed

v0.19.1 consumes much more memory than v0.18 #1820

Enter-tainer opened this issue Jan 22, 2023 · 7 comments

Comments

@Enter-tainer
Copy link

Enter-tainer commented Jan 22, 2023

Describe the bug

  • What did you do?
    When building a chat history searcher using tantivy v0.19.1, I found that it takes quite a lot of memory (~800M) for only 80k entries. I use bytehound to ananlyze the memory usage but I cannot figure out why. I try v0.18 and it only takes ~80M.

To Reproduce

If your bug is deterministic, can you give a minimal reproducing code?
Some bugs are not deterministic. Can you describe with precision in which context it happened?
If this is possible, can you share your code?

I cannot share my code because it contains sensitive data. But I will provide the flamegraph. Hope that helps.

v0 19

full backtrace for the most memory-consuming part:

Details
#0 [libc.so.6] 0x00007F80A0D21A5F
#1 [libc.so.6] 0x00007F80A0C9F8FC
#2 [tg-search] std::sys::unix::thread::Thread::new::thread_start [thread.rs:108]
#3 [tg-search] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1987]
#4 [tg-search] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1987]
#5 [tg-search] core::ops::function::FnOnce::call_once{{vtable.shim}} [function.rs:251]
#6 [tg-search] std::thread::Builder::spawn_unchecked_::{{closure}} [mod.rs:550]
#7 [tg-search] std::panic::catch_unwind [panic.rs:137]
#8 [tg-search] std::panicking::try [panicking.rs:447]
#9 [tg-search] std::panicking::try::do_call [panicking.rs:483]
#10 [tg-search] <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once [unwind_safe.rs:271]
#11 [tg-search] std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} [mod.rs:551]
#12 [tg-search] std::sys_common::backtrace::__rust_begin_short_backtrace [backtrace.rs:121]
#13 [tg-search] tantivy::directory::watch_event_router::WatchCallbackList::broadcast::{{closure}} [watch_event_router.rs:87]
#14 [tg-search] tantivy::directory::watch_event_router::WatchCallback::call [watch_event_router.rs:16]
#15 [tg-search] tantivy::reader::IndexReaderBuilder::try_into::{{closure}} [mod.rs:89]
#16 [tg-search] tantivy::reader::InnerIndexReader::reload [mod.rs:243]
#17 [tg-search] tantivy::reader::InnerIndexReader::create_searcher [mod.rs:230]
#18 [tg-search] tantivy::core::searcher::SearcherInner::new [searcher.rs:263]
#19 [tg-search] core::iter::traits::iterator::Iterator::collect [iterator.rs:1836]
#20 [tg-search] <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter [result.rs:2075]
#21 [tg-search] core::iter::adapters::try_process [mod.rs:164]
#22 [tg-search] <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}} [result.rs:2075]
#23 [tg-search] core::iter::traits::iterator::Iterator::collect [iterator.rs:1836]
#24 [tg-search] <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter [mod.rs:2757]
#25 [tg-search] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter [spec_from_iter.rs:33]
#26 [tg-search] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter [spec_from_iter_nested.rs:43]
#27 [tg-search] <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend [spec_extend.rs:18]
#28 [tg-search] alloc::vec::Vec<T,A>::extend_desugared [mod.rs:2857]
#29 [tg-search] <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next [mod.rs:178]
#30 [tg-search] core::iter::traits::iterator::Iterator::try_for_each [iterator.rs:2299]
#31 [tg-search] <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold [mod.rs:195]
#32 [tg-search] <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold [map.rs:117]
#33 [tg-search] core::iter::traits::iterator::Iterator::try_fold [iterator.rs:2238]
#34 [tg-search] core::iter::adapters::map::map_try_fold::{{closure}} [map.rs:91]
#35 [tg-search] tantivy::core::searcher::SearcherInner::new::{{closure}} [searcher.rs:265]
#36 [tg-search] tantivy::core::segment_reader::SegmentReader::get_store_reader [segment_reader.rs:138]
#37 [tg-search] tantivy::store::reader::StoreReader::open [reader.rs:121]
#38 [tg-search] lru::LruCache<K,V>::new [lib.rs:208]
#39 [tg-search] hashbrown::map::HashMap<K,V>::with_capacity [map.rs:326]
#40 [tg-search] hashbrown::map::HashMap<K,V,S>::with_capacity_and_hasher [map.rs:422]
#41 [tg-search] hashbrown::raw::RawTable<T>::with_capacity [mod.rs:411]
#42 [tg-search] hashbrown::raw::RawTable<T,A>::with_capacity_in [mod.rs:481]
#43 [tg-search] hashbrown::raw::RawTable<T,A>::fallible_with_capacity [mod.rs:460]
#44 [tg-search] hashbrown::raw::RawTableInner<A>::fallible_with_capacity [mod.rs:1109]
#45 [tg-search] hashbrown::raw::RawTableInner<A>::new_uninitialized [mod.rs:1080]
#46 [tg-search] hashbrown::raw::alloc::inner::do_alloc [alloc.rs:62]
#47 [tg-search] <hashbrown::raw::alloc::inner::Global as hashbrown::raw::alloc::inner::Allocator>::allocate [alloc.rs:47]
#48 [tg-search] alloc::alloc::alloc [alloc.rs:99]
#49 [libbytehound.so] malloc
@fmassot
Copy link
Contributor

fmassot commented Jan 22, 2023

Thanks, @Enter-tainer for the report; that's unexpected indeed. Can you also share the schema of the index?

@Enter-tainer
Copy link
Author

schema.rs:

use std::sync::Arc;

use cang_jie::{CangJieTokenizer, TokenizerOption};
use miette::{IntoDiagnostic, Result};
use once_cell::sync::Lazy;
use tantivy::{
    directory::MmapDirectory,
    schema::{
        IndexRecordOption, Schema, SchemaBuilder, TextFieldIndexing, TextOptions, FAST, STORED,
    },
    Index,
};

pub fn get_schema() -> Schema {
    static SCHEMA: Lazy<Schema> = Lazy::new(|| {
        let zh_text_indexing = TextFieldIndexing::default()
            .set_tokenizer("jieba") // Set custom tokenizer
            .set_index_option(IndexRecordOption::WithFreqsAndPositions);
        let zh_text_options = TextOptions::default()
            .set_indexing_options(zh_text_indexing)
            .set_stored();
        let mut builder = SchemaBuilder::new();
        builder.add_i64_field("id", STORED | FAST);
        builder.add_text_field("text", zh_text_options);
        builder.add_text_field("sender_name", STORED);
        builder.add_date_field("send_time", STORED);
        builder.build()
    });
    SCHEMA.clone()
}

pub fn create_or_open_index(path: &str) -> Result<Index> {
    let dir = MmapDirectory::open(path).into_diagnostic()?;
    let index = Index::open_or_create(dir, get_schema()).into_diagnostic()?;
    let tokenizer = CangJieTokenizer {
        worker: Arc::new(jieba_rs::Jieba::new()),
        option: TokenizerOption::Unicode,
    };
    index.tokenizers().register("jieba", tokenizer);
    Ok(index)
}

Cargo.toml

[dependencies]
cang-jie = "0.14.0"
jieba-rs = { version = "0.6", features = ["tfidf", "textrank"] }
miette = { version = "5.5.0", features = ["fancy"] }
once_cell = "1.17.0"
tantivy = "0.18" # i'm using v0.18 now, so v0.18 here.

A typical entry may looks like:

  {
    "id": 308,
    "text": "It would be OK if the driver of mouse and keyboard are working.",
    "send_time": 1502877989,
    "sender": "SOMEONE"
  }

@Enter-tainer
Copy link
Author

Enter-tainer commented Jan 22, 2023

I'm not sure if this is related to the custom tokenizer or something. But after I switch to v0.18, the memory consumption (RSS) becomes ~90M. That looks good to me.

@PSeitz
Copy link
Contributor

PSeitz commented Jan 22, 2023

What's your parameter for doc_store_cache_size? That cache is for the number of decompressed blocks. This needs more documentation

    pub(crate) fn new(
        schema: Schema,
        index: Index,
        segment_readers: Vec<SegmentReader>,
        generation: TrackedObject<SearcherGeneration>,
        doc_store_cache_size: usize,
    ) -> io::Result<SearcherInner>

@Enter-tainer
Copy link
Author

That cache is for the number of decompressed blocks

Oh, I thought it was in bytes! I think maybe that is the problem. And I found that v0.18 doesn't have doc_store_cache_size, so the memory consumption becomes normal. Thank you for your suggestion! I'll test on v0.19 to confirm this when I have time. Let me just close this for now.

Thank you again for your kind reply!

@fulmicoton
Copy link
Collaborator

Thank you for the update!

PSeitz added a commit that referenced this issue Jan 23, 2023
addresses an issue reported in #1820
PSeitz added a commit that referenced this issue Jan 23, 2023
addresses an issue reported in #1820
PSeitz added a commit that referenced this issue Jan 23, 2023
addresses an issue reported in #1820
PSeitz added a commit that referenced this issue Jan 23, 2023
* fix doc store cache docs

addresses an issue reported in #1820

* rename doc_store_cache_size
@Enter-tainer
Copy link
Author

I'll test on v0.19 to confirm this when I have time.

Confirmed. I switch to a smaller cache size and it works.

Hodkinson pushed a commit to Hodkinson/tantivy that referenced this issue Jan 30, 2023
* fix doc store cache docs

addresses an issue reported in quickwit-oss#1820

* rename doc_store_cache_size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants