Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute preprocessing and parsing in parallel #439

Merged
merged 3 commits into from
Jun 19, 2020

Conversation

HiromuHota
Copy link
Contributor

@HiromuHota HiromuHota commented Jun 10, 2020

Description of the problems or issues

Is your pull request related to a problem? Please describe.

Currently, preprocessor and parser are executed in a complete sequential order.
i.e., preprocess N docs (and load them into a queue), then parse N docs.
This has two drawbacks:

  1. the progress bar shows nothing during preprocessing.
  2. the machine RAM has to be large enough to hold N preprocessed docs at a time.

They become more serious when N is large and/or each HTML file is large.

Does your pull request fix any issue.

Fix #435

Description of the proposed changes

A clear and concise description of what you propose.

This PR

  • places a cap on the in_queue so that only a certain number of documents are loaded to in_queue.
  • executes preprocessor and parser in parallel (ie the main process does preprocessing and child process(es) do parsing in parallel).

Test plan

A clear and concise description of how you test the new changes.

For the 1st issue: I manually check the progress bar starts showing progress right after starting parse.apply.

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.rst accordingly.

@codecov-commenter
Copy link

codecov-commenter commented Jun 17, 2020

Codecov Report

Merging #439 into master will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #439      +/-   ##
==========================================
+ Coverage   83.22%   83.28%   +0.06%     
==========================================
  Files          88       88              
  Lines        4559     4564       +5     
  Branches      837      837              
==========================================
+ Hits         3794     3801       +7     
  Misses        572      572              
+ Partials      193      191       -2     
Flag Coverage Δ
#unittests 83.28% <100.00%> (+0.06%) ⬆️
Impacted Files Coverage Δ
src/fonduer/utils/udf.py 88.67% <100.00%> (+1.17%) ⬆️
src/fonduer/parser/visual_linker.py 84.18% <0.00%> (+0.22%) ⬆️
...c/fonduer/parser/preprocessors/doc_preprocessor.py 88.09% <0.00%> (+2.38%) ⬆️

@HiromuHota HiromuHota changed the title Cap the number of docs in in_queue Execute preprocessing and parsing in parallel Jun 17, 2020
@HiromuHota
Copy link
Contributor Author

Tests passed, but extremely slow.
While the pytest step took 18m to finish on Linux on the master branch (8edfacd), it took 28m on 26ac90f.

@senwu
Copy link
Collaborator

senwu commented Jun 17, 2020

Great! How about we set maxsize as argument in config and user can override it.

Is there a way to automatically estimate the maxsize?

@HiromuHota HiromuHota force-pushed the fix/435 branch 2 times, most recently from 3ba328e to e2dbfa8 Compare June 17, 2020 22:51
Hiromu Hota added 2 commits June 17, 2020 17:32
Currently, preprocessor and parser are executed in a complete sequential order.
i.e., preprocess N docs (and load them into a queue), then parse N docs.
This has two drawbacks:
  1. the progress bar shows nothing during preprocessing.
  2. the machine RAM may not be large enought o hold N preprocessed docs.
They become more serious when N is large and/or each HTML file is large.
@HiromuHota
Copy link
Contributor Author

Good point. I think the current magic number (maxsize=parallelism * 2) is good enough, or at least there would be little benefit from optimizing it.

This maxsize is the buffer size between preprocessing and parsing.
Let's assume preprocessing is slower than parsing , the in_queue is always close to empty and Parser will be waiting. This case, there is no point in optimizing maxsize. maxsize=1 would be good enough.
Let's assume preprocessing is faster than parsing, the in_queue will be piling up, and up to maxsize. If maxsize is large, it could eat up the whole RAM (#435). Having a cap on queue by setting a low number for maxsize can prevents this, but Preprocessor will be waiting.
So what should actually be optimized is the speed of both preprocessing and parsing, in other words, they have to be balanced for the best performance and the best use of memory.

If you remember that only one process does preprocessing while multiple child process(es) do parsing in parallel, you may want to parallelize preprocessing too.
But then you would wonder what's the best parallelism for preprocessing and one for parsing. This will be a nightmare.
By the time we start thinking about the whole pipeline optimization, we have to leverage parallel/distributed computing frameworks like Dask, PySpark instead of direct use of multiprocessing.

@HiromuHota
Copy link
Contributor Author

Regarding performance, the current commit (b01a9d5) took 21m in the pytest step on Linux. I think this is within an allowable margin of error. The performance on GitHub Actions fluctuates anyway.

@senwu senwu requested review from lukehsiao and senwu June 18, 2020 03:29
@HiromuHota
Copy link
Contributor Author

I can clearly confirm that the first issue has been resolved, but still seeing memory issue.
I thought I was seeing a memory issue on spaCy (explosion/spaCy#3618) and (explosion/spaCy#4486), but this issue has been fixed at v2.1.9.

@HiromuHota
Copy link
Contributor Author

The 2nd issue: "the machine RAM has to be large enough to hold N preprocessed docs at a time" is not the case.
I ran only the preprocessing part as:

from multiprocessing import Queue

queue = Queue()
i = 0
for doc in doc_preprocessor:
    print(i)
    i += 1
    queue.put(doc)

I can see an increase of the memory usage, but this did not kill the python process.

Running the following (preprocessing + parsing), ie parser.apply(doc_preprocessor) kills it.
So I suspected a memory leak somewhere in parsing, and I updated spaCy from 2.1.9 to 2.2.4, SQLAlchemy from 1.3.13 to 1.3.17, but still having an issue.

@HiromuHota
Copy link
Contributor Author

HiromuHota commented Jun 18, 2020

Now I have a better picture of the issue.
The memory issue is actually two-hold:
2. Loading all the preprocessed documents onto the memory at a time.
3. Parser itself is very memory-hungry. (e.g., a machine with 1GB of RAM barely processes ONSMS04099-1.html, 1.3MB on disk, in the hardware tutorial).

This PR addresses only the 2nd issue (and the 1st issue: the progress bar).

@HiromuHota
Copy link
Contributor Author

It is typically difficult to test the parallel execution of multiple things (preprocessing and parsing in this case).
I hard-coded print in udf.py for testing purpose and got the following result when executing test_e2e.py:

Main process put        112823 into in_queue (in_queue.qsize: 0)
1-th worker process got 112823 from in_queue (in_queue.qsize: 0)
Main process put        2N3906-D into in_queue (in_queue.qsize: 0)
0-th worker process got 2N3906-D from in_queue (in_queue.qsize: 0)
Main process put        2N3906 into in_queue (in_queue.qsize: 1)
Main process put        2N4123-D into in_queue (in_queue.qsize: 2)
Main process put        2N4124 into in_queue (in_queue.qsize: 3)
Main process put        2N6426-D into in_queue (in_queue.qsize: 4)
1-th worker process got 2N3906 from in_queue (in_queue.qsize: 4)
Main process put        2N6427 into in_queue (in_queue.qsize: 4)
0-th worker process got 2N4123-D from in_queue (in_queue.qsize: 4)
Main process put        AUKCS04635-1 into in_queue (in_queue.qsize: 4)
1-th worker process got 2N4124 from in_queue (in_queue.qsize: 4)
Main process put        BC182-D into in_queue (in_queue.qsize: 4)
Main process put        BC182 into in_queue (in_queue.qsize: 4)
0-th worker process got 2N6426-D from in_queue (in_queue.qsize: 4)
1-th worker process got 2N6427 from in_queue (in_queue.qsize: 3)
Main process put        BC337-D into in_queue (in_queue.qsize: 4)
0-th worker process got AUKCS04635-1 from in_queue (in_queue.qsize: 4)
Main process put        BC337 into in_queue (in_queue.qsize: 4)
0-th worker process got BC182-D from in_queue (in_queue.qsize: 3)
1-th worker process got BC182 from in_queue (in_queue.qsize: 2)
0-th worker process got BC337-D from in_queue (in_queue.qsize: 1)
1-th worker process got BC337 from in_queue (in_queue.qsize: 0)

in_queue.qsize() returns "the approximate size of the queue" (https://docs.python.org/3/library/queue.html#queue.Queue.qsize) but you can roughly see that the number of documents in in_queue is capped (by 4 in this case) and that preprocessing and parsing are executed in parallel.

@senwu
Copy link
Collaborator

senwu commented Jun 19, 2020

Great! Glad to have 1st issue fixed!

For the 3rd issue, what's the bottleneck? Spacy or other parts of the parsing procedure? My guess is that Spacy is the bottleneck here (Correct me If I am wrong.)

Also, one thing I am aware of is that we create Spacy model for parsing each document which is not friendly for memory.

@HiromuHota HiromuHota marked this pull request as ready for review June 19, 2020 00:35
@HiromuHota
Copy link
Contributor Author

Good question. I'd like to address the 3rd issue (Parser itself is very memory-hungry) in a different issue/PR.
This could be spaCy, lxml, SQLalchemy, or the parsed Document itself.

IMO, before addressing the 3rd issue, there should be a guideline of how much memory (and PARALLEL) is recommended for how large the HTML file could be.
For example, 1GB+ RAM is a must and 2GB+ is recommended for ~1MB of HTML files and PARALLEL=1.
Another example could be 4GB+ is recommended for ~1MB of HTML files and PARALLEL=2.
This recommendation could also be affected by if visual=True, if lingual=True, or etc.

@senwu
Copy link
Collaborator

senwu commented Jun 19, 2020

Agreed! Let's address the 3rd issue in another PR. Before giving any recommendation for memory usage, I think we would be better to do some tests.

Also, is there any smart queue we can use which means we can control the pool size by estimating the total memory usage of all documents in the pool?

@lukehsiao lukehsiao added this to the v0.8.3 milestone Jun 19, 2020
@lukehsiao lukehsiao added the enhancement New feature or request label Jun 19, 2020
Copy link
Contributor

@lukehsiao lukehsiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great step forward. Looking forward the reduced parser memory usage, too.

@senwu senwu removed their request for review June 19, 2020 09:02
@senwu senwu self-requested a review June 19, 2020 09:03
Copy link
Collaborator

@senwu senwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lukehsiao lukehsiao merged commit b0ac254 into HazyResearch:master Jun 19, 2020
@HiromuHota HiromuHota deleted the fix/435 branch June 19, 2020 18:18
@HiromuHota
Copy link
Contributor Author

For future reference, disabling the global cache for IDs (ie collect_ids=False) in lxml may help solve the remaining issue (https://benbernardblog.com/tracking-down-a-freaky-python-memory-leak-part-2/).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants