Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the parsing of the text index builder #1695

Merged
merged 19 commits into from
Jan 22, 2025

Conversation

Flixtastic
Copy link
Contributor

@Flixtastic Flixtastic commented Dec 28, 2024

Split up large functions, modernize code, choose better names and add some documentation

@Flixtastic
Copy link
Contributor Author

One question I have is that I didn't find a solution to convert absl::StrSplit to a std::range or std::view and therefore resulted to using another cppcoro generator. I've seen the idea to avoid these generators but am I right that it is only possible to use these new Iterators through creating classes that implement them?

Copy link

codecov bot commented Dec 28, 2024

Codecov Report

Attention: Patch coverage is 88.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 89.86%. Comparing base (acb6633) to head (349be6d).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/index/IndexImpl.Text.cpp 75.00% 9 Missing and 3 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1695   +/-   ##
=======================================
  Coverage   89.86%   89.86%           
=======================================
  Files         389      390    +1     
  Lines       37308    37339   +31     
  Branches     4204     4205    +1     
=======================================
+ Hits        33527    33556   +29     
+ Misses       2485     2483    -2     
- Partials     1296     1300    +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, this absolutely goes into the right direction.
I have some initial comments for the cleaning up, let me know if you need further advice.

src/index/IndexImpl.Text.cpp Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.cpp Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.cpp Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.cpp Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.cpp Outdated Show resolved Hide resolved
test/WordsAndDocsFileLineCreator.h Outdated Show resolved Hide resolved
test/WordsAndDocsFileLineCreator.h Outdated Show resolved Hide resolved
test/WordsAndDocsFileParserTest.cpp Outdated Show resolved Hide resolved
Flixtastic and others added 2 commits January 9, 2025 12:44
…sts in WordsAndDocsFileParserTest.cpp. Renamed methods in WordsAndDocsFileLineCreator.h to reduce ambiguity. Incorporated requested small changes of PR.
@Flixtastic Flixtastic requested a review from joka921 January 9, 2025 15:54
Signed-off-by: Johannes Kalmbach <[email protected]>
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small suggestions.

Also have a look at the sonarcloud issues,

src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
test/WordsAndDocsFileParserTest.cpp Outdated Show resolved Hide resolved
ASSERT_EQ(std::get<0>(testLine), std::get<0>(expectedResult.at(i)));
ASSERT_EQ(std::get<1>(testLine), std::get<1>(expectedResult.at(i)));
ASSERT_EQ(std::get<2>(testLine), std::get<2>(expectedResult.at(i)));
ASSERT_EQ(std::get<3>(testLine), std::get<3>(expectedResult.at(i)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is much better with the helper functions (There are even cleaner ways with better error messages in GoogleTest, but this refactoring is nice because now all improviements can be applied locally!

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a very small suggestion.

src/index/IndexImpl.Text.cpp Outdated Show resolved Hide resolved
@Flixtastic Flixtastic requested a review from joka921 January 10, 2025 18:00
@Flixtastic
Copy link
Contributor Author

One possible solution to the current coverage problem is to start a file IndexImplHelpers.h and a corresponding cpp to outsource the helper methods and test them seperately. This would leed to even more references being passed to the functions.

Currently I am unsure whether to do this or not. Also maybe there is another way to reduce the nesting at the positions where the helper functions are now at play as solution.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a very small thing, otherwise this now looks much cleaner.

@@ -53,8 +53,7 @@ cppcoro::generator<WordsFileLine> IndexImpl::wordsInTextRecords(
std::string_view textView = text;
textView = textView.substr(0, textView.rfind('"'));
textView.remove_prefix(1);
auto normalizedWords = tokenizeAndNormalizeText(textView, localeManager);
for (auto word : normalizedWords) {
for (auto word : tokenizeAndNormalizeText(textView, localeManager)) {
WordsFileLine wordLine{word, false, contextId, 1};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we benefit from a std::move(word) here?

@sparql-conformance
Copy link

@Flixtastic Flixtastic requested a review from joka921 January 21, 2025 11:49
@joka921 joka921 changed the title Better parsing for the words- and docsfile Refactor the parsing of the text index builder Jan 22, 2025
@joka921 joka921 merged commit 3213257 into ad-freiburg:master Jan 22, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants