-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve a memory leak by large data on out_queue (related to #494) #545
Conversation
Fix add & commit process on multi-thread.
to
This cahnge reduces memory usage and prevents memory leaks because out_queue will have less data. |
Codecov Report
@@ Coverage Diff @@
## master #545 +/- ##
=======================================
Coverage 86.07% 86.08%
=======================================
Files 92 92
Lines 4776 4779 +3
Branches 899 899
=======================================
+ Hits 4111 4114 +3
Misses 475 475
Partials 190 190
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 👍
Description of the problems or issues
Is your pull request related to a problem? Please describe.
Fonduer accelerates parsing document with multi-processing.
Each process gets documents from
in_queue
(shared memory), and puts parsed data and document name toout_queue
(shared memory).This is well-known process, but possibility to hung up by memory leak of shared memory.
Previous code put parsed (relatively large) data to
out_queue
.From the
out_queue
, other process get the data and commit it to postges DB.See also #494
Does your pull request fix any issue.
See #494
Description of the proposed changes
Change
out_queue
input to only document name, not include parsed data.Instead of committing data with
out_queue
, each multi-thread process commits parsed data before putting document name toout_queue
.Test plan
Do existing test and monitor python memory usage.
In my case (3000 html file, 12MB total), python memory usage reduce to 700MB from 1.4 GB.
Checklist