Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow loading of the pipe scispacy_linker #402

Open
vlievin opened this issue Oct 25, 2021 · 9 comments
Open

Slow loading of the pipe scispacy_linker #402

vlievin opened this issue Oct 25, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@vlievin
Copy link

vlievin commented Oct 25, 2021

Hi, loading an UMLS linker is particularly slow (~20-30s). It is a real issue when testing the code. I reported the profiler output bellow. Is there anything we can do to speed-up the loading of the linker?

Profiler output

   Ordered by: internal time
   List reduced from 951 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   19.741   19.741   53.338   53.338 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/linking_utils.py:55(__init__)
        1   18.422   18.422   25.783   25.783 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/candidate_generation.py:116(load_approximate_nearest_neighbours_index)
  3359672   16.912    0.000   16.912    0.000 /Users/-/anaconda3/lib/python3.8/json/decoder.py:343(raw_decode)
  3359672    3.847    0.000   24.272    0.000 /Users/-/anaconda3/lib/python3.8/json/decoder.py:332(decode)
     4023    3.202    0.001    3.202    0.001 {method 'decompress' of 'zlib.Decompress' objects}
  3359672    2.840    0.000   30.230    0.000 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/linking_utils.py:65(<genexpr>)
  3359672    2.818    0.000   28.086    0.000 /Users/-/anaconda3/lib/python3.8/json/__init__.py:299(loads)
  6719602    2.603    0.000    2.603    0.000 {method 'match' of 're.Pattern' objects}
        6    2.251    0.375    2.251    0.375 {method 'do_handshake' of '_ssl._SSLSocket' objects}
        6    1.206    0.201    1.206    0.201 {method 'read' of '_ssl._SSLSocket' objects}
        6    1.122    0.187    1.122    0.187 {method 'connect' of '_socket.socket' objects}
  9300568    1.002    0.000    1.002    0.000 {method 'add' of 'set' objects}
  3359671    0.867    0.000    1.565    0.000 <string>:1(__new__)
     4033    0.763    0.000    0.763    0.000 {built-in method zlib.crc32}
  3360030    0.704    0.000    0.704    0.000 {built-in method __new__ of type object at 0x10c379808}
  3359928    0.679    0.000    0.679    0.000 {method 'startswith' of 'str' objects}
  6719344    0.581    0.000    0.581    0.000 {method 'end' of 're.Match' objects}
        2    0.525    0.262    0.525    0.262 {method 'astype' of 'numpy.ndarray' objects}
        5    0.474    0.095    4.703    0.941 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/numpy/lib/format.py:699(read_array)
        2    0.369    0.184    0.369    0.184 {method 'copy' of 'numpy.ndarray' objects}


Code to reproduce the above results

import cProfile
import pstats
from time import time

import spacy
from scispacy.abbreviation import AbbreviationDetector  # type: ignore
from scispacy.linking import EntityLinker  # type: ignore


def load_spacy_model(model_name: str):
   """Load a ScispaCy model"""
    model = spacy.load(
        model_name,
        disable=[
            "tok2vec",
            "tagger",
            "parser",
            "attribute_ruler",
            "lemmatizer",
        ],
    )

    return model


def add_scispacy_linker(model):
    """add the entity linker (slow loading)"""
    model.add_pipe(
        "scispacy_linker",
        config={"linker_name": "umls"},
    )
    return model

# load the model (ok loading time)
model = load_spacy_model(model_name="en_core_sci_sm")

# load the linker (slow loading time) + profiling
profiler = cProfile.Profile()
profiler.enable()
t0 = time()
model = add_scispacy_linker(model)
duration = time() - t0
profiler.disable()
stats = pstats.Stats(profiler).sort_stats("time")
stats.print_stats(20)
@MichalMalyska
Copy link
Contributor

Do you need the full linker for the test? If not you could always make up a "test" version of it with a much smaller index, since that is what takes the most time to load.

@vlievin
Copy link
Author

vlievin commented Oct 25, 2021

Hi @MichalMalyska, thank you for your reply! ideally we want to test using the same model. I there any computation that happens during loading we could cache? Or is the duration simply caused by loading the weights?

@MichalMalyska
Copy link
Contributor

I think it is just loading weights. The UMLS index is quite beefy from what I remember (~2Gb or sth)

@dakinggg
Copy link
Collaborator

It should only be really slow the first time, because it needs to download some large files, including the umls index. These files are then cached, and subsequent calls should be fast. Is this not what you are experiencing?

@MichalMalyska
Copy link
Contributor

I think for me it's usually ~15-20 seconds to load it in, but this is on a laptop.

@vlievin
Copy link
Author

vlievin commented Oct 27, 2021

Hi @dakinggg, files are effectively cached, so it is simply about loading the UMLS index.
@MichalMalyska, yes, this is approximately what I get (profiling output in the opening post).

The profiler shows that most of the time is spent decoding json objects:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3359672   16.912    0.000   16.912    0.000 .../python3.8/json/decoder.py:343(raw_decode)

I am wondering if there is a more efficient way to store, load and query the data. Furthermore, the current solution is very memory intensive (RAM usage spikes at 8GB RAM when running the above example).

Two ideas for improvement are:

  1. pyarrow to store the alias list
  2. faiss to improve upon the current nearest neighbour search (at least in terms of speed)?

Those are only suggestion as I don't know enough about the inner working of scipacy. Regarding my project, this issue is not critical, but that might be a nice improvement for the library.

@MichalMalyska
Copy link
Contributor

@vlievin are you aware of how aliasing is supported in faiss?
I always wanted to take a look and try at adding this to scispacy, but that was the barrier I could not find an answer to

@vlievin
Copy link
Author

vlievin commented Oct 29, 2021

Hi @MichalMalyska, I am only getting started with faiss, so unfortunately I don't know about aliasing in faiss. But if I get a definite answer in the near future, I'll let you know here.

I am not an expert with nmsn either. So please take these suggestions for what they are: ideas and not recommendations.

@dakinggg dakinggg added the enhancement New feature or request label Feb 2, 2022
@ddofer
Copy link

ddofer commented Dec 3, 2024

I'll add that for me at least, it is redownloads some of the file every single time. Check if it works for you in offline - if not, then that's part of it! @#535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants