-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract common code paths in indexing pipeline #2275
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2275 +/- ##
============================================
+ Coverage 63.41% 63.82% +0.41%
- Complexity 1287 1296 +9
============================================
Files 196 197 +1
Lines 11463 11251 -212
Branches 1457 1421 -36
============================================
- Hits 7269 7181 -88
+ Misses 3672 3593 -79
+ Partials 522 477 -45 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx @lintool LGTM
public int memoryBuffer = 16384; | ||
|
||
@Option(name = "-threads", metaVar = "[num]", usage = "Number of indexing threads.") | ||
public int threads = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: what about defaulting to Runtime.getRuntime().availableProcessors()
so that indexing concurrency would automatically try to use all available resources, without requiring users to pass the right number of threads? (Possibly for another PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, potentially - but:
(1) defaulting to consuming all available resources can be rude ;)
(2) it depends on the corpus as well... e.g., if the corpus only has 10 files, then 16 threads isn't going to help since we only do per-file parallelism.
Tests are proceeding, but not finished yet. Stand by. |
Major refactoring of indexing pipeline (IndexCollection, IndexHnswDenseVectors, and IndexInvertedDenseVectors), extracting common code paths into AbstractIndexer.
@tteofili @MXueguang still running tests, but ready for review in parallel.