Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemory exception with low system memory usage #359

Closed
mskarlin opened this issue Sep 27, 2024 · 27 comments
Closed

OutOfMemory exception with low system memory usage #359

mskarlin opened this issue Sep 27, 2024 · 27 comments

Comments

@mskarlin
Copy link

We're struggling to diagnose an issue where we are instantiating several Index objects (built on the same shared directory) in asynchronous coroutines, and setting the following stack trace:

ValueError: Failed to open file for read: 'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" }, filepath: "<index file path>" }

We have a relatively small index (~5Gb), and even with several instantiated Index objects, the system RAM usage is low (on a machine with 60Gb of memory, we're only seeing ~3Gb used). But when running with 4 concurrent coroutines we see the above issue.

In trying to reproduce, I made a tiny index with ~100 files, and ran 100 simultaneous coroutines, each of which with their own Index object, and was not able to reproduce the error.

Do you have any suggestions on what to investigate next?

@cjrh
Copy link
Collaborator

cjrh commented Sep 28, 2024

We're running with thousands of different indexes on a shared box, but we're using process pools instead of your async handling. We don't see memory growth in these long lived processes (that access different indexes over time), but to be fair we have not yet upgraded to 0.22.

As for suggestions, this might be a good application of the FIL memory profiler: https://pythonspeed.com/fil/docs/index.html

In essence, you can run your application in a staging environment with a copy of your "real" index, do what you need to do to make memory grow, and then when memory is high, kill the process. The memory profiler will dump a file showing where allocations were held. You get a flamegraph view:

https://pythonspeed.com/fil/docs/interpreting-output.html

@cjrh
Copy link
Collaborator

cjrh commented Sep 28, 2024

It would also be good document more specifics about your situation, i.e., package version, platform, etc.

@cjrh
Copy link
Collaborator

cjrh commented Oct 2, 2024

There was an actual memory leak a few versions back, in PyO3: PyO3/pyo3#3285. So it is important to know which version you're using. There are also a few other memory leak issues on the PyO3 tracker, so you might have a look there too to see if anything might be affecting you.

@cjrh
Copy link
Collaborator

cjrh commented Oct 16, 2024

@mskarlin is this still an issue for you? Without a reproducer there isn't much we can do I think.

@jamesbraza
Copy link

jamesbraza commented Oct 20, 2024

Hi @cjrh yes it is still an issue (I work with @mskarlin on https://github.com/Future-House/paper-qa), sorry for the delayed response.

I hit it with tantivy.Index.open on https://github.com/Future-House/paper-qa/blob/81f8c5f1395aef58d1d5317c478ce5b24949f88a/paperqa/agents/search.py#L176:

ValueError: Failed to open file for read: 'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" }, filepath:
"/path/to/indexes/index_name/index/abc123.fieldnorm" }'

I also seem to get it on .fast and .term. I think it's related to quickwit-oss/tantivy#1563. Here's my system info:

> python --version --version
Python 3.12.7 (main, Oct 11 2024, 14:40:10) [GCC 11.4.0]
> pip list | grep tantivy
tantivy                   0.22.0
> cat /etc/os-release | grep -e PRETTY_NAME -e VERSION
PRETTY_NAME="Ubuntu 22.04.3 LTS"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
> ulimit -a | grep virtual
virtual memory              (kbytes, -v) unlimited
> cat /proc/meminfo | grep -e Vm -e Mem
MemTotal:       131305920 kB
MemFree:        124926700 kB
MemAvailable:   126850976 kB
VmallocTotal:   13743895347199 kB
VmallocUsed:      390780 kB
VmallocChunk:          0 kB

Using this answer, I run: strace -e trace=open,mmap,brk,mmap2,munmap -f python a.py 2>strace.log, and in my strace.log I see:

[pid  8370] mmap(NULL, 14815, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd5742b4000
[pid  8370] mmap(NULL, 145, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd5742b3000
[pid  8370] mmap(NULL, 126, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd5742b2000
[pid  8370] mmap(NULL, 41685, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd5742a7000
[pid  8370] mmap(NULL, 90245, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd574290000
[pid  8370] mmap(NULL, 8720, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd57428d000
[pid  8370] mmap(NULL, 46066, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd574281000
[pid  8370] mmap(NULL, 145, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd574280000
[pid  8370] mmap(NULL, 126, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd57427f000
[pid  8370] mmap(NULL, 21925, PROT_READ, MAP_SHARED, 25, 0) = 0x7fd574279000
[pid  8370] mmap(NULL, 32802, PROT_READ, MAP_SHARED, 25, 0) = -1 ENOMEM (Cannot allocate memory)

There is a series of successful mmap calls, and then we hit ENOMEM when trying to allocate about 32 KB.

Please let me know how to investigate this, I am not sure what to look into next.

One detail that may matter is our index was built on a macOS machine, and is being used on Ubuntu, might that affect things?

@cjrh
Copy link
Collaborator

cjrh commented Oct 20, 2024

Thanks for the detail. Unfortunately, I can't think of an obvious solution for this.

One detail that may matter is our index was built on a macOS machine, and is being used on Ubuntu, might that affect things?

Are you able to run Index.open on the macOS machine to check whether it produces the same error?

@cjrh
Copy link
Collaborator

cjrh commented Oct 20, 2024

Hmm, actually, I am slowly remembering something we saw in prod a while back. We also saw this OutOfMemory error on one of our larger indexes that had a very large number of segments. For example, each of these clusters

8b6d10f00a4a4363b6dbbbdaf034679b.term
8db1b848a5164252843cd706ae8c804a.fast
8db1b848a5164252843cd706ae8c804a.fieldnorm
8db1b848a5164252843cd706ae8c804a.idx
8db1b848a5164252843cd706ae8c804a.pos
8db1b848a5164252843cd706ae8c804a.store
8db1b848a5164252843cd706ae8c804a.term
935c493b48364e1d8cd8cb571b59a513.fast
935c493b48364e1d8cd8cb571b59a513.fieldnorm
935c493b48364e1d8cd8cb571b59a513.idx
935c493b48364e1d8cd8cb571b59a513.pos
935c493b48364e1d8cd8cb571b59a513.store
935c493b48364e1d8cd8cb571b59a513.term
9f9a01da2e7f441ab08528aaf44634f9.fast
9f9a01da2e7f441ab08528aaf44634f9.fieldnorm
9f9a01da2e7f441ab08528aaf44634f9.idx
9f9a01da2e7f441ab08528aaf44634f9.pos
9f9a01da2e7f441ab08528aaf44634f9.store
9f9a01da2e7f441ab08528aaf44634f9.term
d8968eb9e0c844deb1f0cdae38a6ee2a.fast
d8968eb9e0c844deb1f0cdae38a6ee2a.fieldnorm
d8968eb9e0c844deb1f0cdae38a6ee2a.idx
d8968eb9e0c844deb1f0cdae38a6ee2a.pos
d8968eb9e0c844deb1f0cdae38a6ee2a.store
d8968eb9e0c844deb1f0cdae38a6ee2a.term

That share a UUID name is a segment. So what we figured was happening might be that each segment's files are separately being mmap'ed, and eventually we just run out of virtual address space causing the OutOfMemory error.

So what we did, which worked, was we recreated this tantivy index with a much larger memory limit on the writer. Basically (handwavy explanation), the writer accumulates data in memory over multiple document additions, and when the specified memory limit is reached, it writes out that accumulated data as a segment. A higher memory limit means fewer segments.

This resolved our problem, but we didn't do a deeper dive to see precisely why the original memory error was being raised in the first place. The virt mem thing was just a hunch.

So perhaps the best option for you to try is to regenerate your problematic tantivy index with a much larger memory limit on the writer.

@cjrh
Copy link
Collaborator

cjrh commented Oct 20, 2024

I just checked, we're currently using 1GB as the heap setting for the writer. I am wondering whether I should set this as the default in this package. It might be better to default to bigger (avoiding the hard-to-debug OutOfMemory error), rather than the current default.

@jamesbraza
Copy link

Thank you so much for the replies @cjrh I am following what you're saying.

For reference my index is built from about 5.5k PDFs and goes to this many files:

> ls /path/to/index | wc -l
21001
> ls /path/to/index | head -n 8
000a60b7148d4e8285cd78737474a3da.fast
000a60b7148d4e8285cd78737474a3da.fieldnorm
000a60b7148d4e8285cd78737474a3da.idx
000a60b7148d4e8285cd78737474a3da.pos
000a60b7148d4e8285cd78737474a3da.store
000a60b7148d4e8285cd78737474a3da.term
000d73f244864638833d735c8a76ebd7.fast
000d73f244864638833d735c8a76ebd7.fieldnorm

Are you able to run Index.open on the macOS machine to check whether it produces the same error?

Last night I rebuilt the index on the Ubuntu machine itself. What happens is:

  • Index built on macOS laptop, running on macOS laptop: OutOfMemory errors don't happen
  • Index built on Ubuntu server, running on Ubuntu server: OutOfMemory errors happen
  • Index built on Ubuntu server, running on Ubuntu server: OutOfMemory errors happen

So actually I don't think it's an "index across OS's compatibility" issue any more.

So perhaps the best option for you to try is to regenerate your problematic tantivy index with a much larger memory limit on the writer

This sounds quite reasonable, let's do this. Can you give me some rough guidance on how I can regenerate the index using a much larger memory limit?

We instantiate our writer like this here:

index: tantivy.Index
writer: tantivy.IndexWriter = index.writer()

Should I configure the writer's heap_size argument? https://github.com/quickwit-oss/tantivy-py/blob/0.22.0/tantivy/tantivy.pyi#L342

@jamesbraza
Copy link

jamesbraza commented Oct 20, 2024

Firstly I want to say again @cjrh thank you again for your help here, basically your help is very awesome :)

So perhaps the best option for you to try is to regenerate your problematic tantivy index with a much larger memory limit on the writer

I decided to be proactive here and took a guess at how to do this, making the below change and rebuilding the index:

index: tantivy.Index
--- writer: tantivy.IndexWriter = index.writer()
+++ writer: tantivy.IndexWriter = index.writer(heap_size=1_000_000_000)

Then looking at the new index's size:

> ls /path/to/index | wc -l
19273

And that did not resolve the issue, I still get the OutOfMemory error:

ValueError: Failed to open file for read: 'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" }, filepath:
"/path/to/index/14a007c93a624cd4ade83137b518b779.fieldnorm" }'

How can I correctly adjust the writer's memory limit?

@wallies
Copy link
Collaborator

wallies commented Oct 20, 2024

@jamesbraza I think from memory it is

writer: tantivy.IndexWriter = index.writer(1_000_000_000)

This was the original issue that was raised for this #118

@jamesbraza
Copy link

Okay thank you @wallies, I tried that just now (both position arg and keyword arg variations of heap_size) and the error persists :/

Can you suggest me more things to try?

@jamesbraza
Copy link

So I am actually pretty stalled out due to this error, so sharing a thought

So what we figured was happening might be that each segment's files are separately being mmap'ed, and eventually we just run out of virtual address space causing the OutOfMemory error.

Perhaps there is some way tantivy IndexWriter.commit can validate against this error type? Like during the build tanvity can check how much space is available and how much it has assigned. Just an idea.

@jamesbraza
Copy link

Do you think it could be arm64 (my Mac where everything works) vs amd64 (server getting the IoError)?

@cjrh
Copy link
Collaborator

cjrh commented Oct 21, 2024

@jamesbraza

Do you think it could be arm64

I don't know. But In my case, we're all on amd64.

Then looking at the new index's size:

ls /path/to/index | wc -l
19273

And that did not resolve the issue, I still get the OutOfMemory error:

So something is wrong here. Increasing the writer memory threshold should definitely result in fewer segments. Unless there is some other reason your segments are small. How often are you calling commit?

@cjrh
Copy link
Collaborator

cjrh commented Oct 21, 2024

Can you show a snippet of code that shows the loop where the index being created (docs being added to a writer)?

@cjrh
Copy link
Collaborator

cjrh commented Oct 21, 2024

Can you also show the output of ulimit -a on the problem amd64 machine pls. On my dev machine it looks like this:

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 124938
max locked memory           (kbytes, -l) 4007836
max memory size             (kbytes, -m) unlimited
open files                          (-n) 65536
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 124938
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

@cjrh
Copy link
Collaborator

cjrh commented Oct 21, 2024

Here is a small test file I made to generate indexes with a few of the settings configurable as command line args:

import argparse
import random
import string
from tantivy import SchemaBuilder, Index, Document

def run(path, heap_size=15_000_000, doc_count=1000, string_length=10, num_text_fields=10) -> Index:
    text_fieldnames = [f"text{i}" for i in range(num_text_fields)]
    schema = (
        SchemaBuilder()
        .add_unsigned_field("order", fast=True)
        .add_text_field("text", stored=True)
        .build()
    )

    schema_b = (
        SchemaBuilder()
        .add_unsigned_field("order", fast=True)
    )

    for fieldname in text_fieldnames:
        schema_b.add_text_field(fieldname, stored=True)

    schema = schema_b.build()

    index = Index(schema, path)
    writer = index.writer(heap_size)

    for i in range(doc_count):
        doc = Document()
        doc.add_unsigned("order", i)
        for fieldname in text_fieldnames:
            doc.add_text(fieldname, "".join(random.choices(string.ascii_lowercase, k=string_length)))
        writer.add_document(doc)

    writer.commit()
    writer.wait_merging_threads()
    index.reload()
    return index


def main(args):
    import os
    import tempfile
    with tempfile.TemporaryDirectory() as path:
        index = run(path, args.heap, args.count, args.length, args.fields)

        # Print out some diagnostics
        items = os.listdir(path)
        print(len(items))
        print(items[:10])

        # Check that we can open the index
        index = Index.open(path)
        print(index)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Tantivy Python segment test")
    parser.add_argument("--heap", type=int, default=15_000_000, help="Heap size")
    parser.add_argument("--count", type=int, default=1000, help="Document count")
    parser.add_argument("--length", type=int, default=10, help="String length")
    parser.add_argument("--fields", type=int, default=10, help="Number of text fields")
    args = parser.parse_args()
    main(args)

Running this with these settings, for example (python 3.12, tantivy 0.22.0):

$ python main.py --count 1_000_000 --length 1000

Takes a while to run, but produces a large index, nearly 10 GB. The important thing to note is that when it completes, there are 70 files in the index folder, not tens of thousands.

During the indexing, the number of files goes up and down, because tantivy also merges segments together as part of the indexing process. This is me listing out the status of the tmp index folder a few times during the indexing (the tmp dir was tmpisktkptr):

$ du -d0 -h tmpisktkptr/
1,4G	tmpisktkptr/
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ ls -l tmpisktkptr/ | wc -l
38
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ ls -l tmpisktkptr/ | wc -l
20
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ ls -l tmpisktkptr/ | wc -l
20
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ du -d0 -h tmpisktkptr/
6,2G	tmpisktkptr/
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ ls -l tmpisktkptr/ | wc -l
44
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ ls -l tmpisktkptr/ | wc -l
56
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ du -d0 -h tmpisktkptr/
6,9G	tmpisktkptr/
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ ls -l tmpisktkptr/ | wc -l
26
(venv) /tmp  [venv://h/c/t/t/v:3.12.3] 
$ du -d0 -h tmpisktkptr/
7,4G	tmpisktkptr/

So I definitely think that your problem is due to ending up with a very large number of files in your tantivy index.

Perhaps you can run the above script on the mac and see whether you end up with a large number of files or not. And if not, compare what the script is doing to what our live code is doing.

@cjrh
Copy link
Collaborator

cjrh commented Oct 21, 2024

If the same invocation of the script produces a large number of files on mac, but not on amd64, then that suggests there is a bug in the mac build somehow, and we can escalate to the experts in the tantivy project 😄

@jamesbraza
Copy link

jamesbraza commented Oct 21, 2024

How often are you calling commit?

Once per input PDF, there are about 5.5k of them.


Can you show a snippet of code that shows the loop where the index being created (docs being added to a writer)?

Yes it's here: https://github.com/Future-House/paper-qa/blob/v5.3.0/paperqa/agents/search.py#L404-L481

commit gets called within add_document, which is invoked near the bottom of that snippet.


Can you also show the output of ulimit -a on the problem amd64 machine pls.

Here is the Ubuntu server, which hits the IoError issue:

> ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 512038
max locked memory           (kbytes, -l) 16413240
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 512038
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
> uname -a
Linux sn4622119851 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
> python -c 'import multiprocessing as mp; print(mp.cpu_count())'
48

Here is my Mac, which does not hit the IoError issue:

> ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8176
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       6000
-n: file descriptors                256
> uname -a
Darwin Jamess-MacBook-Pro.local 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030 arm64
> python -c 'import multiprocessing as mp; print(mp.cpu_count())'
11

I just rebuilt the indexes using now writer(heap_size=int(15e9)) on both my Mac and the Ubuntu server. Per this comment:

Increasing the writer memory threshold should definitely result in fewer segments.

It seems the trend does hold, but just not as dramatically as you stated:

> ls /path/to/normal-index/index/ | wc -l
21001
> ls /path/to/1GBheap-index/index/ | wc -l
19273
> ls /path/to/15GBheap-index/index/ | wc -l
15343

Even with a 15 GB heap size, at inference time the error persists:

           Starting paper search for 'proton antiport serotonin transporter essential residues'.
           Starting paper search for 'serotonin transporter proton antiport mechanism'.
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/259ec6cdcc6a49388d2e301fc6d3cfe8.term" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/259ec6cdcc6a49388d2e301fc6d3cfe8.term" }\'')
           Starting paper search for 'HER2 positive breast cancer cells upregulated genes SREBF1 SREBP1c NR1H2 ACLY FASN ABCA1'.
           Starting paper search for 'HER2 positive breast cancer gene expression'.
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/259ec6cdcc6a49388d2e301fc6d3cfe8.term" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/259ec6cdcc6a49388d2e301fc6d3cfe8.term" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/259ec6cdcc6a49388d2e301fc6d3cfe8.term" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/259ec6cdcc6a49388d2e301fc6d3cfe8.term" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/329655a70d9648ab8dd88ad3fba4ff5e.fieldnorm" }\'')
[15:14:40] Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/329655a70d9648ab8dd88ad3fba4ff5e.term" }\'')
           Starting paper search for 'mucin-degrading bacteria enriched in mouse gut microbiome fiber-free conditions'.
           Starting paper search for 'B. intestinihominis fiber-free conditions'.
           Starting paper search for 'C. symbiosum fiber-free conditions'.
           Starting paper search for 'B. thetaiotaomicron fiber-free conditions'.
           Starting paper search for 'A. muciniphila fiber-free conditions'.
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/86532e330292440e9d50c41993060b56.store" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/86532e330292440e9d50c41993060b56.store" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/9fa60161320944e0bbf0ec0a382f4bd7.fast" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/9fa60161320944e0bbf0ec0a382f4bd7.fast" }\'')
           Starting paper search for 'ZBTB16 effect on PAX6 promoters'.
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/9fa60161320944e0bbf0ec0a382f4bd7.fast" }\'')
           Starting paper search for 'drosophila melanogaster embryos heat shock gene expression'.
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/9fa60161320944e0bbf0ec0a382f4bd7.idx" }\'')
           Encountered exception during tool call for tool paper_search: ValueError('Failed to open file for read: \'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot
           allocate memory" }, filepath: "/path/to/15GBheap-index/index/9fa60161320944e0bbf0ec0a382f4bd7.idx" }\'')

Please note that it only seems to be a subset of index files lead to the IoError, it's not all of them. To share, here's me running:

> du -h /path/to/15GBheap-index/index/9fa60161320944e0bbf0ec0a382f4bd7.idx
4.0K    /path/to/15GBheap-index/index/9fa60161320944e0bbf0ec0a382f4bd7.idx

I know this is a lot to take in, but can you please give some guidance on what to try next?

Also, I want to restate your efforts are very much appreciated 🙏

@jamesbraza
Copy link

Do you think we should not be calling commit so much? It seems you:

writer.commit()
writer.wait_merging_threads()
index.reload()

We don't do wait_merging_threads or reload, not sure if that matters or not

@jamesbraza
Copy link

jamesbraza commented Oct 22, 2024

Okay I was talking in-person at FutureHouse and I want to highlight something to help make it clear:

  1. We build successfully, and do not see the IoError during build, regardless of heap_size
  2. Using the built Index at "inference time" (when we use the index for search) when calling Index.open concurrently:
    • On my Mac with concurrency of 20: no IoError issues
    • On the Ubuntu server with concurrency of 20: hits the IoError issues

With this new mental clarity, I decided to play around and discovered this (with an index built on the 5.5k papers using the default heap_size), the below happens on the Ubuntu server (and not on my Mac):

In [1]: from paperqa.agents.search import SearchDocumentStorage, SearchIndex
In [2]: index_1 = await SearchIndex(index_name="normal-index").index
In [3]: index_2 = await SearchIndex(index_name="normal-index").index
In [4]: index_3 = await SearchIndex(index_name="normal-index").index
In [5]: index_4 = await SearchIndex(index_name="normal-index").index
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 index_4 = await SearchIndex(index_name="normal-index").index

File /path/to/.venv/lib/python3.12/site-packages/paperqa/agents/search.py:176, in SearchIndex.index(self)
    174 index_path = await self.index_filename
    175 if await (index_path / "meta.json").exists():
--> 176     self._index = Index.open(path=str(index_path))  # type: ignore[assignment]
    177     print("success", self._index)
    178 else:
    179     # NOTE: this creates the above meta.json file

ValueError: Failed to open file for read: 'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" }, filepath: "/path/to/normal-index/index/9ac6cd3a75304ff794ef4377ae3ba1b2.store" }'

At this point, running free -h we see there's still lots of RAM left:

> free -h
               total        used        free      shared  buff/cache   available
Mem:           125Gi       7.5Gi        60Gi        12Mi        57Gi       116Gi
Swap:            9Gi       307Mi       9.7Gi

So, seeing this, can you help me figure out what is going wrong?

@cjrh
Copy link
Collaborator

cjrh commented Oct 22, 2024

So, seeing this, can you help me figure out what is going wrong?

No

At this point, running free -h we see there's still lots of RAM left:

Yes, my strong suspicion is that you're hitting some kind of virtual memory limit from mmap calls being made for too many segments. So the physical memory in your machine doesn't matter.

It seems the trend does hold, but just not as dramatically as you stated:
ls /path/to/normal-index/index/ | wc -l
21001
ls /path/to/1GBheap-index/index/ | wc -l
19273
ls /path/to/15GBheap-index/index/ | wc -l
15343

These are still too many I think.

Either you should try to run my test script that I posted above, and check whether you get a similar number of files as I do, or you will have to give me a script that I can easily run on my machine to verify that I get the large number of files that you see. If I can reproduce the issue on my computer it is very likely I will figure out what is going on. But failing that, it's very difficult. A very simple reproducer that both you and I can run on our machines and compare results would be very helpful.

We build successfully, and do not see the IoError during build, regardless of heap_size
On the Ubuntu server with concurrency of 20: hits the IoError issues

Out of interest, have you tried building the index on the Ubuntu server, and when doing so how many segments (files) to you end up with in the index directory when doing so? Is it similar to what you end up with on the mac?

Do you think we should not be calling commit so much?
We don't do wait_merging_threads or reload, not sure if that matters or not

You must call wait_merging_threads, and after the commit just like the sequence I showed. These threads are combining segments so that you end up with fewer segments. This is important and could be directly related to ending up with so many segments. You should add this line.

The frequency of calling commit shouldn't really matter for how many segments you end up with, because segments get merged together periodically anyway. However, calling commit too often will make things slow. So if you have a lot of data ready to index it's better to call commit less frequently; although, below some frequency it won't matter any more.

@jamesbraza
Copy link

jamesbraza commented Oct 25, 2024

Allllright, coming from Future-House/paper-qa#628 (comment) on adding in use of wait_merge_threads:

  • Index file count: 1000X smaller (21004 to 22 files)
  • Index size: 40% smaller (650 to 389 MiB)
  • Index build time: ~2X longer (anecdotally)
  • (Runtime) tantivy.Searcher.search performance: unchanged
  • (Runtime) tantivy.Searcher.search speed: definitely faster (but didn't quantify this)

So it turns out by not calling wait_merging_threads, we were missing out on tons of compaction.

It turns out wait_merging_threads really matters.

I re-ran the original code that lead to the IoError, and it no longer occurs 👍 😍 . So this case is resolved 🚀 🌕 .


I guess, I am left wondering, how can tantivy-py make this easier for "the next dev"?

Is there some way we can change the Python bindings to make this more seamless, or obvious that wait_merging_threads is important? Perhaps a docs modification might be good too

@cjrh
Copy link
Collaborator

cjrh commented Oct 27, 2024

For sure, the first port of call should be the documentation. The very first examples seen by a new user should include all the parts, including the merging threads. I can add this to the docs over the next few days for any that are missing.

In terms of changing the code to make it easier for users, one of the design goals we have is to mimic the tantivy (upstream crate) interface as much as possible, so that knowledge of how one works can be transferred over to the other. So I wouldn't want to change the existing calls themselves, for example I wouldn't want to call wait_merging_threads inside the commit wrapper in the python code.

However, we should be able to add new methods, perhaps in a new python module namespace that automates some of these typical state management steps. For example, we could expose a new context manager (which rust doesn't have) that lets the user add docs and then in the finalizer section of the context manager we can call commit and wait_merging_threads. This can be explained as some kind of pythonic extension of the raw tantivy interface. I'll think about it more.

For now, I am happy to close this issue if you are, since the original issue is now solved.

@jamesbraza
Copy link

I have slowly done some validations over the past week, and indeed things are looking great now. I would like to say thank you again for your support here, it lead to a happy ending.


Per your thoughts on next steps, they all sound fine, so defer to you here. Feel free to close this out if you'd like

@cjrh
Copy link
Collaborator

cjrh commented Nov 4, 2024

Good to hear 😄

Yes when I do my next pass over the docs I'll make a special callout about the wait_merging_threads call.

@cjrh cjrh closed this as completed Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants