Make Fetcher pipelien faster #217

wkelly17 · 2024-08-19T21:57:16Z

adjust fetcher pipeline to reduce glob overhead, and use a thread pool for IO ops

Please include a summary of your changes.

Small change - I replace the -t (trace) argument with -l (log level that matches to standard py log levels.). The current debug or warning wasn't granular enough. Debug spits out too much, so I added some higher level logs under info and made that the default.
At the start of each loop -> We're just gonna call one rglob(*). There were already multiple call for *.wav, and several for nested mp3 that was likely causing pretty significant overhead. I wasn't completely sure whether the workers had any interdependence or not (i.e. the files created by one worker affected another). I tried to refactor on the "safe side" by assumin so. To avoid having to looks for new files with an rglob, the mutable set of all_files is passes around. When I worker finihses, it updates the set by union and by exclusion based on what it has created or deleted. I did some ad hoc testing, and I don't think we're at the scale where reading into memory matters too much yet. I testing rglob on my own comptuer from home directory and read 2.5 million files before it started to churn. You can also fit about 50 million paths into a set in about 1.5 mins on my machine, and then that's only about 2gb of memory (sort I think it should be fine to just read the whole generator into memory based on those ad hoc tests).
Since I made the assumption the workers might be interdependent (I just couldn't tell looking only at the code level), I decided to parallelize the work in each worker. I think the majority of the input is most likely IO, which makes it a pretty good case for green threads based on some stuff Craig sent me, so basically each worker was refactored to internally leverage a thread pool.
I adjusted the base docker image for the build to python 3.12 since ther was an bug report for prior to 3.12 with regards to some rglob internals causing significant slowdowns (see Path.rglob performance issues in deeply nested directories compared to glob.glob(recursive=True) python/cpython#102613). And then it bumped the node base version since they are bundled together, but node 14 is definitely EOL so it probably needed bumping too.

Some of this stuff was fairly new to me, so it may not be the most pythonic, and it may not even quite be correct, but I figured it was a good first stab of the two things that almost certainly are the where the largest amount of time is being spent.

Please feel free to review and revise. I would have tested it more, but I don't have a local directory that matches what the workers are looking for mocked, and I don't have ssh access to run it against the dev cdn or anything of that nature.

I'm sure there is room for improvement, but I think that'd require some changes I can't control (i.e. hashes vs timestamps) and a more thoughtful rearchitecting.

This change is

* ignore hash or query params in load more button (#190) * Added LogRocket script (#193) * Added LogRocket script * break long line * simplify null check Co-Authored-By: Will Kelly <[email protected]> Co-Authored-By: Maxim <[email protected]>

…ink.org Deploy to prod

…ink.org Audiobieldev.walink.org

…l for IO ops

sonarqubecloud · 2024-08-19T22:02:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

danparisd

LGTM

The base branch was changed.

mXaln and others added 5 commits October 31, 2023 06:13

Deploy to Prod (#194)

bc9c44f

* ignore hash or query params in load more button (#190) * Added LogRocket script (#193) * Added LogRocket script * break long line * simplify null check Co-Authored-By: Will Kelly <[email protected]> Co-Authored-By: Maxim <[email protected]>

Merge pull request #204 from Bible-Translation-Tools/audiobieldev.wal…

7b9a16d

…ink.org Deploy to prod

Merge pull request #212 from Bible-Translation-Tools/audiobieldev.wal…

b849ffb

…ink.org Audiobieldev.walink.org

adjust fetcher pipeline to reduce glob overhead, and use a thread poo…

9957c20

…l for IO ops

adjust base dockerfile to newer node/python

2de6186

wkelly17 requested review from PurpleGuitar and mXaln August 19, 2024 21:57

forgot to bump pip version

026f38c

danparisd previously approved these changes Aug 21, 2024

View reviewed changes

danparisd changed the base branch from audio.bibleineverylanguage.org to audiobieldev.walink.org August 21, 2024 14:42

danparisd approved these changes Aug 21, 2024

View reviewed changes

danparisd merged commit 796ad83 into audiobieldev.walink.org Aug 21, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Fetcher pipelien faster #217

Make Fetcher pipelien faster #217

wkelly17 commented Aug 19, 2024 •

edited

Loading

sonarqubecloud bot commented Aug 19, 2024

danparisd left a comment

Make Fetcher pipelien faster #217

Make Fetcher pipelien faster #217

Conversation

wkelly17 commented Aug 19, 2024 • edited Loading

Please include a summary of your changes.

sonarqubecloud bot commented Aug 19, 2024

Quality Gate passed

danparisd left a comment

Choose a reason for hiding this comment

wkelly17 commented Aug 19, 2024 •

edited

Loading