-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execute preprocessing and parsing in parallel #439
Conversation
Codecov Report
@@ Coverage Diff @@
## master #439 +/- ##
==========================================
+ Coverage 83.22% 83.28% +0.06%
==========================================
Files 88 88
Lines 4559 4564 +5
Branches 837 837
==========================================
+ Hits 3794 3801 +7
Misses 572 572
+ Partials 193 191 -2
|
Great! How about we set Is there a way to automatically estimate the maxsize? |
3ba328e
to
e2dbfa8
Compare
Currently, preprocessor and parser are executed in a complete sequential order. i.e., preprocess N docs (and load them into a queue), then parse N docs. This has two drawbacks: 1. the progress bar shows nothing during preprocessing. 2. the machine RAM may not be large enought o hold N preprocessed docs. They become more serious when N is large and/or each HTML file is large.
Good point. I think the current magic number ( This If you remember that only one process does preprocessing while multiple child process(es) do parsing in parallel, you may want to parallelize preprocessing too. |
Regarding performance, the current commit (b01a9d5) took 21m in the pytest step on Linux. I think this is within an allowable margin of error. The performance on GitHub Actions fluctuates anyway. |
I can clearly confirm that the first issue has been resolved, but still seeing memory issue. |
The 2nd issue: "the machine RAM has to be large enough to hold N preprocessed docs at a time" is not the case.
I can see an increase of the memory usage, but this did not kill the python process. Running the following (preprocessing + parsing), ie |
Now I have a better picture of the issue. This PR addresses only the 2nd issue (and the 1st issue: the progress bar). |
It is typically difficult to test the parallel execution of multiple things (preprocessing and parsing in this case).
|
Great! Glad to have 1st issue fixed! For the 3rd issue, what's the bottleneck? Spacy or other parts of the parsing procedure? My guess is that Spacy is the bottleneck here (Correct me If I am wrong.) Also, one thing I am aware of is that we create Spacy model for parsing each document which is not friendly for memory. |
Good question. I'd like to address the 3rd issue (Parser itself is very memory-hungry) in a different issue/PR. IMO, before addressing the 3rd issue, there should be a guideline of how much memory (and PARALLEL) is recommended for how large the HTML file could be. |
Agreed! Let's address the 3rd issue in another PR. Before giving any recommendation for memory usage, I think we would be better to do some tests. Also, is there any smart queue we can use which means we can control the pool size by estimating the total memory usage of all documents in the pool? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great step forward. Looking forward the reduced parser memory usage, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
For future reference, disabling the global cache for IDs (ie |
Description of the problems or issues
Is your pull request related to a problem? Please describe.
Currently, preprocessor and parser are executed in a complete sequential order.
i.e., preprocess N docs (and load them into a queue), then parse N docs.
This has two drawbacks:
They become more serious when N is large and/or each HTML file is large.
Does your pull request fix any issue.
Fix #435
Description of the proposed changes
A clear and concise description of what you propose.
This PR
in_queue
so that only a certain number of documents are loaded toin_queue
.Test plan
A clear and concise description of how you test the new changes.
For the 1st issue: I manually check the progress bar starts showing progress right after starting
parse.apply
.Checklist