Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Fetcher pipelien faster #217

Merged
merged 6 commits into from
Aug 21, 2024
Merged

Conversation

wkelly17
Copy link
Contributor

@wkelly17 wkelly17 commented Aug 19, 2024

adjust fetcher pipeline to reduce glob overhead, and use a thread pool for IO ops

Please include a summary of your changes.

  1. Small change - I replace the -t (trace) argument with -l (log level that matches to standard py log levels.). The current debug or warning wasn't granular enough. Debug spits out too much, so I added some higher level logs under info and made that the default.
  2. At the start of each loop -> We're just gonna call one rglob(*). There were already multiple call for *.wav, and several for nested mp3 that was likely causing pretty significant overhead. I wasn't completely sure whether the workers had any interdependence or not (i.e. the files created by one worker affected another). I tried to refactor on the "safe side" by assumin so. To avoid having to looks for new files with an rglob, the mutable set of all_files is passes around. When I worker finihses, it updates the set by union and by exclusion based on what it has created or deleted. I did some ad hoc testing, and I don't think we're at the scale where reading into memory matters too much yet. I testing rglob on my own comptuer from home directory and read 2.5 million files before it started to churn. You can also fit about 50 million paths into a set in about 1.5 mins on my machine, and then that's only about 2gb of memory (sort I think it should be fine to just read the whole generator into memory based on those ad hoc tests).
  3. Since I made the assumption the workers might be interdependent (I just couldn't tell looking only at the code level), I decided to parallelize the work in each worker. I think the majority of the input is most likely IO, which makes it a pretty good case for green threads based on some stuff Craig sent me, so basically each worker was refactored to internally leverage a thread pool.
  4. I adjusted the base docker image for the build to python 3.12 since ther was an bug report for prior to 3.12 with regards to some rglob internals causing significant slowdowns (see Path.rglob performance issues in deeply nested directories compared to glob.glob(recursive=True) python/cpython#102613). And then it bumped the node base version since they are bundled together, but node 14 is definitely EOL so it probably needed bumping too.

Some of this stuff was fairly new to me, so it may not be the most pythonic, and it may not even quite be correct, but I figured it was a good first stab of the two things that almost certainly are the where the largest amount of time is being spent.

Please feel free to review and revise. I would have tested it more, but I don't have a local directory that matches what the workers are looking for mocked, and I don't have ssh access to run it against the dev cdn or anything of that nature.

I'm sure there is room for improvement, but I think that'd require some changes I can't control (i.e. hashes vs timestamps) and a more thoughtful rearchitecting.


This change is Reviewable

mXaln and others added 5 commits October 31, 2023 06:13
* ignore hash or query params in load more button (#190)

* Added LogRocket script (#193)

* Added LogRocket script

* break long line

* simplify null check

Co-Authored-By: Will Kelly <[email protected]>
Co-Authored-By: Maxim <[email protected]>
@wkelly17 wkelly17 requested review from PurpleGuitar and mXaln August 19, 2024 21:57
Copy link

danparisd
danparisd previously approved these changes Aug 21, 2024
Copy link
Contributor

@danparisd danparisd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danparisd danparisd changed the base branch from audio.bibleineverylanguage.org to audiobieldev.walink.org August 21, 2024 14:42
@danparisd danparisd dismissed their stale review August 21, 2024 14:42

The base branch was changed.

@danparisd danparisd merged commit 796ad83 into audiobieldev.walink.org Aug 21, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants