Resolve a memory leak by large data on out_queue (related to #494) #545

YasushiMiyata · 2021-05-17T06:29:52Z

Description of the problems or issues

Is your pull request related to a problem? Please describe.
Fonduer accelerates parsing document with multi-processing.
Each process gets documents from in_queue (shared memory), and puts parsed data and document name to out_queue (shared memory).
This is well-known process, but possibility to hung up by memory leak of shared memory.
Previous code put parsed (relatively large) data to out_queue.
From the out_queue, other process get the data and commit it to postges DB.
See also #494

Does your pull request fix any issue.
See #494

Description of the proposed changes

Change out_queue input to only document name, not include parsed data.
Instead of committing data with out_queue, each multi-thread process commits parsed data before putting document name to out_queue.

Test plan

Do existing test and monitor python memory usage.
In my case (3000 html file, 12MB total), python memory usage reduce to 700MB from 1.4 GB.

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.
I have updated the CHANGELOG.rst accordingly.

…azyResearch#494)

…._add

YasushiMiyata · 2021-05-21T08:01:14Z

Fix add & commit process on multi-thread. parser.py, labeler.py and featurizer.py generate data on multi-threads, but they did add & commit data on single process through out_queue. So, I update codes
from

doc(html) -- in_queue -- th1: parser -- out_queue(doc name, parsed data) -- writer(parsed data)
                      |- th2: parser -|
                      |- th3: parser -|
                      |- th4: parser -|

to

doc(html) -- in_queue -- th1: parser -- writer(data) -- out_queue(doc name)
                      |- th2: parser -- writer(data) -|
                      |- th3: parser -- writer(data) -|
                      |- th4: parser -- writer(data) -|

This cahnge reduces memory usage and prevents memory leaks because out_queue will have less data.

codecov-commenter · 2021-05-21T08:23:47Z

Codecov Report

Merging #545 (3d05728) into master (b1d72be) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #545   +/-   ##
=======================================
  Coverage   86.07%   86.08%           
=======================================
  Files          92       92           
  Lines        4776     4779    +3     
  Branches      899      899           
=======================================
+ Hits         4111     4114    +3     
  Misses        475      475           
  Partials      190      190

Flag	Coverage Δ
unittests	`86.08% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/fonduer/features/featurizer.py	`86.02% <100.00%> (ø)`
src/fonduer/parser/parser.py	`93.41% <100.00%> (ø)`
src/fonduer/supervision/labeler.py	`70.37% <100.00%> (ø)`
src/fonduer/utils/udf.py	`88.88% <100.00%> (+0.31%)`	⬆️

senwu

LGTM. 👍

YasushiMiyata added 2 commits May 17, 2021 14:54

Resolve a memory leak caused by large data on out_queue (related to H…

7a9594c

…azyResearch#494)

Add HazyResearch#545 change log to CHANGELOG.rst

5c283cf

YasushiMiyata marked this pull request as ready for review May 17, 2021 06:59

YasushiMiyata added 2 commits May 17, 2021 16:15

Fix HazyResearch#545 CHANGELOG.rst

bb39209

Add multi-thread support for Parser._add, Labeler._add and Featurizer…

3d05728

…._add

senwu approved these changes Jun 10, 2021

View reviewed changes

senwu merged commit 9d794b9 into HazyResearch:master Jun 10, 2021

senwu pushed a commit that referenced this pull request Jun 10, 2021

Add #545 change log to CHANGELOG.rst

39f2e1c

senwu pushed a commit that referenced this pull request Jun 10, 2021

Fix #545 CHANGELOG.rst

5ab8d4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve a memory leak by large data on out_queue (related to #494) #545

Resolve a memory leak by large data on out_queue (related to #494) #545

YasushiMiyata commented May 17, 2021 •

edited

Loading

YasushiMiyata commented May 21, 2021 •

edited

Loading

codecov-commenter commented May 21, 2021 •

edited

Loading

senwu left a comment

Resolve a memory leak by large data on out_queue (related to #494) #545

Resolve a memory leak by large data on out_queue (related to #494) #545

Conversation

YasushiMiyata commented May 17, 2021 • edited Loading

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

YasushiMiyata commented May 21, 2021 • edited Loading

codecov-commenter commented May 21, 2021 • edited Loading

Codecov Report

senwu left a comment

Choose a reason for hiding this comment

YasushiMiyata commented May 17, 2021 •

edited

Loading

YasushiMiyata commented May 21, 2021 •

edited

Loading

codecov-commenter commented May 21, 2021 •

edited

Loading