-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zimcheck: better and faster redirect loop check #312
zimcheck: better and faster redirect loop check #312
Conversation
CI fails because of the dependence on openzim/libzim#716 |
openzim/libzim#716 was merged and participated in the nightly CI build. I reran the CI here today. Macos picked up the change and passed, but Linux failed since it refers to the xenial version of deps, which are no longer being built by the kiwix-build CI. |
@legoktm Something for you? |
d87dd35
to
ddf3cab
Compare
Codecov Report
@@ Coverage Diff @@
## master #312 +/- ##
==========================================
+ Coverage 43.54% 44.25% +0.71%
==========================================
Files 23 23
Lines 2338 2368 +30
Branches 1304 1318 +14
==========================================
+ Hits 1018 1048 +30
Misses 1304 1304
Partials 16 16
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@veloman-yunkan Two small "problems" are reported by Codefactor. |
ddf3cab
to
ffd1714
Compare
Not anymore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could do even better on the algorithm.
Your algorithm loop twice on the redirection table to resolve the redirect. Once in detectLoopStatus
to know if a redirection chain is looping or not. And a second one in resolveLoopStatus
to mark the redirection chain as looping or not.
If we not mark the redirection loop with LOOP
/NONLOOP
, we could mark the loopStatus only with a integer indicating the number of the resolveLoopStatus
run. And then only store if the nth run of detectLoopStatus
resolve to a loop or not. When reporting, we loop on the loopStatus for what run has covered it and look if the run is loop or not.
Something like (not tested):
class RedirectionTable
{
private: // types
enum LoopStatus
{
UNKNOWN,
NONLOOP,
BASE_RUN_NUMBER
};
public: // functions
RedirectionTable() {
// NONLOOP (1) is not looping
runStatus.push_back(false);
}
void addRedirectionEntry(zim::entry_index_type targetEntryIndex)
{
redirTable.push_back(targetEntryIndex);
loopStatus.push_back(LoopStatus::UNKNOWN);
}
void addItem()
{
redirTable.push_back(redirTable.size());
loopStatus.push_back(LoopStatus::NONLOOP);
}
size_t size() const { return redirTable.size(); }
bool isInRedirectionLoop(zim::entry_index_type i)
{
if ( loopStatus[i] == UNKNOWN )
{
resolveLoopStatus(i);
}
return runStatus[loopStatus[i]-1];
}
private: // functions
void resolveLoopStatus(zim::entry_index_type i)
{
auto runNumber = self.nextRunNumber++;
while (true)
{
if ( loopStatus[i] != LoopStatus::UNKNOWN ) {
if (loopStatus[i] == runNumber) {
// The current run is a loop
runStatus.push_back(true);
} else {
// The current run is the same status than the first know status found.
runStatus.push_back(runStatus[loopStatus[i]-1]);
};
return;
}
loopStatus[i] = runNumber;
i = redirTable[i];
}
}
private: // data
std::vector<zim::entry_index_type> redirTable;
std::vector<uint32_t> loopStatus;
std::vector<runStatus> runStatus;
uint32_t nextRunNumber = BASE_RUN_NUMBER;
};
} // unnamed namespace
void test_redirect_loop(const zim::Archive& archive, ErrorLogger& reporter) {
reporter.infoMsg("[INFO] Checking for redirect loops...");
RedirectionTable redirTable;
for(const auto& entry: archive.iterByPath())
{
if ( entry.isRedirect() )
redirTable.addRedirectionEntry(entry.getRedirectEntryIndex());
else
redirTable.addItem();
}
for(zim::entry_index_type i = 0; i < redirTable.size(); ++i )
{
if(redirTable.isInRedirectionLoop(i)){
const auto entry = archive.getEntryByPath(i);
reporter.addMsg(MsgId::REDIRECT_LOOP, {{"entry_path", entry.getPath()}});
}
}
src/zimcheck/checks.cpp
Outdated
std::deque<zim::entry_index_type> redirTable; | ||
std::deque<LoopStatus> loopStatus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using std::deque
? std::vector
already have push_back
and we never front_pop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of std::vector
's higher memory usage requirements during growth. When you incrementally populate an std::vector
with N>>1
elements its high-watermark memory usage is at about 3/2N (~N/2 of previous storage + ~N new storage during the last reallocation). std::deque
typically being implemented as a vector of dynamically allocated fixed size blocks is devoid of that disadvantage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could reserve the space once at initialization as we know the number of entries to add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but then we will need to separately track the count of items already added. What are the cons of std::deque
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but then we will need to separately track the count of items already added.
With std::vector::reserve
we can pre-allocate the memory for the vector without changing the semantic.
Or I misunderstand your argument...
What are the cons of std::deque?
The semantic. std::deque
is a queue. Which means we use to to pass elements in order (add with push_back
, read with pop_front
). Here we have the semantic of a growing vector, so we should use a vector.
We may have the hack of using deque for memory allocation improvement. But as we can reserve the memory with
redirTable.reserve(archive.getAllEntryCount())
, there is no real reason to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With
std::vector::reserve
we can pre-allocate the memory for the vector without changing the semantic.
Or I misunderstand your argument...
Indeed. Totally forgot about reserve()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
You try to save runtime by using more memory ( |
I have no opinion on this specific algorithm question because I don't have the numbers. But, regarding zimcheck, the speed is definitly the priority point to take in consideration. That said, the problem reported by @veloman-yunkan is real: the memory usage should be "reasonable" and should not scale proportionaly to the number of errors/warnings or even size of the ZIM. To conclude here, if we can do significantly faster without using significantly more memory, then this is welcome IMO. Otherwise, probably not. |
My idea will have a memory usage growing proportionally to the number of entry in the ZIM file. (8*nbEntries+1*nbRedirectionChains). It is more than @veloman-yunkan solution (5*nbEntries) but it is in the proportion of what we already do in libzim:
On top of that, we have to remember that the performance improvement is theoretical for now, we must do some testing to be sure that one solution is more performant than the other. |
This one can be halved.
During redirect loop detection the working set is limited to the data maintained in 1 the latter, BTW, could be reduced to |
I agree that you current algorithm is already a big improvement compared to the previous one. Let's move on with it. We may come back on this when/if we want to improve it further more. Let's fix the discussion about std::deque/std::vector and we will be good. |
src/zimcheck/checks.cpp
Outdated
explicit RedirectionTable(size_t entryCount) | ||
{ | ||
loopStatus.reserve(entryCount); | ||
redirTable.reserve(entryCount); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation
Please fix the indentation and rebase-fixup. |
With this implementation deep chains of redirections are not mis-reported as loops. Besides it is faster compared to the old implementation for the following reasons: - Redirection info is read from every entry/dirent exactly once; all subsequent processing is with minimal in-memory data required for the task. - When a standalone loop redirection is performed (for example, `zimcheck -L` with no other option) the auxiliary effient-order-to-by-path-order conversion table is not computed. In this case, in addition to shorter runtime, the memory usage is lower, too.
c0755b7
to
ddb1cf3
Compare
The Packages CI flows fail since they don't see the change brought by openzim/libzim#716. |
@kelson42 Should I merge? Or we better resolve the issue with the Packages builds first? |
Fixes #161
Depends on openzim/libzim#716
With this implementation deep chains of redirections are not mis-reported as loops. Besides it is faster compared to the old
implementation for the following reasons:
Redirection info is read from every entry/dirent exactly once; all subsequent processing is with minimal in-memory data required for the task.
When a standalone loop redirection is performed (for example,
zimcheck -L
with no other option) the auxiliary effient-order-to-by-path-order conversion table is not computed. In this case, in addition to shorter runtime, the memory usage is lower, too.Benchmark data for
zimcheck -L
stackoverflow.com_en_all_2022-05.zim
: