Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve a memory leak by large data on out_queue (related to #494) #545

Merged
merged 4 commits into from
Jun 10, 2021

Conversation

YasushiMiyata
Copy link
Contributor

@YasushiMiyata YasushiMiyata commented May 17, 2021

Description of the problems or issues

Is your pull request related to a problem? Please describe.
Fonduer accelerates parsing document with multi-processing.
Each process gets documents from in_queue (shared memory), and puts parsed data and document name to out_queue (shared memory).
This is well-known process, but possibility to hung up by memory leak of shared memory.
Previous code put parsed (relatively large) data to out_queue.
From the out_queue, other process get the data and commit it to postges DB.
See also #494

Does your pull request fix any issue.
See #494

Description of the proposed changes

Change out_queue input to only document name, not include parsed data.
Instead of committing data with out_queue, each multi-thread process commits parsed data before putting document name to out_queue.

Test plan

Do existing test and monitor python memory usage.
In my case (3000 html file, 12MB total), python memory usage reduce to 700MB from 1.4 GB.

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.rst accordingly.

@YasushiMiyata YasushiMiyata marked this pull request as ready for review May 17, 2021 06:59
@YasushiMiyata
Copy link
Contributor Author

YasushiMiyata commented May 21, 2021

Fix add & commit process on multi-thread. parser.py, labeler.py and featurizer.py generate data on multi-threads, but they did add & commit data on single process through out_queue. So, I update codes
from

doc(html) -- in_queue -- th1: parser -- out_queue(doc name, parsed data) -- writer(parsed data)
                      |- th2: parser -|
                      |- th3: parser -|
                      |- th4: parser -|

to

doc(html) -- in_queue -- th1: parser -- writer(data) -- out_queue(doc name)
                      |- th2: parser -- writer(data) -|
                      |- th3: parser -- writer(data) -|
                      |- th4: parser -- writer(data) -|

This cahnge reduces memory usage and prevents memory leaks because out_queue will have less data.

@codecov-commenter
Copy link

codecov-commenter commented May 21, 2021

Codecov Report

Merging #545 (3d05728) into master (b1d72be) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #545   +/-   ##
=======================================
  Coverage   86.07%   86.08%           
=======================================
  Files          92       92           
  Lines        4776     4779    +3     
  Branches      899      899           
=======================================
+ Hits         4111     4114    +3     
  Misses        475      475           
  Partials      190      190           
Flag Coverage Δ
unittests 86.08% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/fonduer/features/featurizer.py 86.02% <100.00%> (ø)
src/fonduer/parser/parser.py 93.41% <100.00%> (ø)
src/fonduer/supervision/labeler.py 70.37% <100.00%> (ø)
src/fonduer/utils/udf.py 88.88% <100.00%> (+0.31%) ⬆️

Copy link
Collaborator

@senwu senwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

@senwu senwu merged commit 9d794b9 into HazyResearch:master Jun 10, 2021
senwu pushed a commit that referenced this pull request Jun 10, 2021
senwu pushed a commit that referenced this pull request Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants