zimcheck: better and faster redirect loop check #312

veloman-yunkan · 2022-07-27T15:34:26Z

Fixes #161

With this implementation deep chains of redirections are not mis-reported as loops. Besides it is faster compared to the old
implementation for the following reasons:

Redirection info is read from every entry/dirent exactly once; all subsequent processing is with minimal in-memory data required for the task.
When a standalone loop redirection is performed (for example, zimcheck -L with no other option) the auxiliary effient-order-to-by-path-order conversion table is not computed. In this case, in addition to shorter runtime, the memory usage is lower, too.

Benchmark data for zimcheck -L stackoverflow.com_en_all_2022-05.zim:

	Run time	Memory usage
Old implementation (cold start):	631 seconds	1097 MiB
Old implementation (warm start):	500 seconds	1097 MiB
This PR (cold start):	287 seconds	927 MiB
This PR (warm start):	275 seconds	927 MiB

veloman-yunkan · 2022-07-27T15:36:37Z

CI fails because of the dependence on openzim/libzim#716

veloman-yunkan · 2022-08-03T08:19:03Z

openzim/libzim#716 was merged and participated in the nightly CI build. I reran the CI here today. Macos picked up the change and passed, but Linux failed since it refers to the xenial version of deps, which are no longer being built by the kiwix-build CI.

kelson42 · 2022-08-03T09:18:52Z

@legoktm Something for you?

codecov · 2022-08-04T14:51:30Z

Codecov Report

Merging #312 (c0755b7) into master (e7853d1) will increase coverage by 0.71%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #312      +/-   ##
==========================================
+ Coverage   43.54%   44.25%   +0.71%     
==========================================
  Files          23       23              
  Lines        2338     2368      +30     
  Branches     1304     1318      +14     
==========================================
+ Hits         1018     1048      +30     
  Misses       1304     1304              
  Partials       16       16

Impacted Files	Coverage Δ
src/zimcheck/checks.cpp	`96.57% <100.00%> (+0.32%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

kelson42 · 2022-08-06T14:18:05Z

@veloman-yunkan Two small "problems" are reported by Codefactor.

veloman-yunkan · 2022-08-06T17:24:53Z

@veloman-yunkan Two small "problems" are reported by Codefactor.

Not anymore

mgautierfr

I wonder if we could do even better on the algorithm.
Your algorithm loop twice on the redirection table to resolve the redirect. Once in detectLoopStatus to know if a redirection chain is looping or not. And a second one in resolveLoopStatus to mark the redirection chain as looping or not.

If we not mark the redirection loop with LOOP/NONLOOP, we could mark the loopStatus only with a integer indicating the number of the resolveLoopStatus run. And then only store if the nth run of detectLoopStatus resolve to a loop or not. When reporting, we loop on the loopStatus for what run has covered it and look if the run is loop or not.

Something like (not tested):

class RedirectionTable
{
private: // types
    enum LoopStatus
    {
        UNKNOWN,
        NONLOOP,
        BASE_RUN_NUMBER
    };


public: // functions
    RedirectionTable() {
       // NONLOOP (1) is not looping
       runStatus.push_back(false);
    }
    void addRedirectionEntry(zim::entry_index_type targetEntryIndex)
    {
        redirTable.push_back(targetEntryIndex);
        loopStatus.push_back(LoopStatus::UNKNOWN);
    }

    void addItem()
    {
        redirTable.push_back(redirTable.size());
        loopStatus.push_back(LoopStatus::NONLOOP);
    }

    size_t size() const { return redirTable.size(); }

    bool isInRedirectionLoop(zim::entry_index_type i)
    {
        if ( loopStatus[i] == UNKNOWN )
        {
            resolveLoopStatus(i);
        }

        return runStatus[loopStatus[i]-1];
    }

private: // functions
    void resolveLoopStatus(zim::entry_index_type i)
    {
        auto runNumber = self.nextRunNumber++;
        while (true)
        {
            if ( loopStatus[i] != LoopStatus::UNKNOWN ) {
                if (loopStatus[i] == runNumber) {
                  // The current run is a loop
                  runStatus.push_back(true);
                } else {
                  // The current run is the same status than the first know status found.
                  runStatus.push_back(runStatus[loopStatus[i]-1]);
                };
                return;
            }
            loopStatus[i] = runNumber;
            i = redirTable[i];
        }
    }

private: // data
    std::vector<zim::entry_index_type> redirTable;
    std::vector<uint32_t> loopStatus;
    std::vector<runStatus> runStatus;
    uint32_t nextRunNumber = BASE_RUN_NUMBER;
};

} // unnamed namespace

void test_redirect_loop(const zim::Archive& archive, ErrorLogger& reporter) {
    reporter.infoMsg("[INFO] Checking for redirect loops...");

    RedirectionTable redirTable;
    for(const auto& entry: archive.iterByPath())
    {
        if ( entry.isRedirect() )
            redirTable.addRedirectionEntry(entry.getRedirectEntryIndex());
        else
            redirTable.addItem();
    }

    for(zim::entry_index_type i = 0; i < redirTable.size(); ++i )
    {
        if(redirTable.isInRedirectionLoop(i)){
            const auto entry = archive.getEntryByPath(i);
            reporter.addMsg(MsgId::REDIRECT_LOOP, {{"entry_path", entry.getPath()}});
        }
    }

mgautierfr · 2022-08-09T12:58:48Z

src/zimcheck/checks.cpp

+    std::deque<zim::entry_index_type> redirTable;
+    std::deque<LoopStatus> loopStatus;


Why using std::deque ? std::vector already have push_back and we never front_pop

Because of std::vector's higher memory usage requirements during growth. When you incrementally populate an std::vector with N>>1 elements its high-watermark memory usage is at about 3/2N (~N/2 of previous storage + ~N new storage during the last reallocation). std::deque typically being implemented as a vector of dynamically allocated fixed size blocks is devoid of that disadvantage.

We could reserve the space once at initialization as we know the number of entries to add

Yes, but then we will need to separately track the count of items already added. What are the cons of std::deque?

Yes, but then we will need to separately track the count of items already added.

With std::vector::reserve we can pre-allocate the memory for the vector without changing the semantic.
Or I misunderstand your argument...

What are the cons of std::deque?

The semantic. std::deque is a queue. Which means we use to to pass elements in order (add with push_back, read with pop_front). Here we have the semantic of a growing vector, so we should use a vector.
We may have the hack of using deque for memory allocation improvement. But as we can reserve the memory with
redirTable.reserve(archive.getAllEntryCount()), there is no real reason to use it.

With std::vector::reserve we can pre-allocate the memory for the vector without changing the semantic.
Or I misunderstand your argument...

Indeed. Totally forgot about reserve()

veloman-yunkan · 2022-08-09T14:17:34Z

I wonder if we could do even better on the algorithm.

You try to save runtime by using more memory (uint32_t instead of uint_8). Sometimes, higher memory usage can result in slower execution. So, though your approach will definitely work, it's not a fact that it will be faster. And I think that memory usage is more important than the runtime - if a program is simply slow you can wait till it completes, whereas if it requires too much memory it simply doesn't run.

kelson42 · 2022-08-10T06:46:30Z

I have no opinion on this specific algorithm question because I don't have the numbers. But, regarding zimcheck, the speed is definitly the priority point to take in consideration. That said, the problem reported by @veloman-yunkan is real: the memory usage should be "reasonable" and should not scale proportionaly to the number of errors/warnings or even size of the ZIM. To conclude here, if we can do significantly faster without using significantly more memory, then this is welcome IMO. Otherwise, probably not.

mgautierfr · 2022-08-10T07:08:15Z

That said, the problem reported by @veloman-yunkan is real: the memory usage should be "reasonable" and should not scale proportionally to the number of errors/warnings or even size of the ZIM

My idea will have a memory usage growing proportionally to the number of entry in the ZIM file. (8*nbEntries+1*nbRedirectionChains). It is more than @veloman-yunkan solution (5*nbEntries) but it is in the proportion of what we already do in libzim:

We mmap the urlPtrList (8*nbEntries)
We mmap the titlePtrList (4*nbEntries)
For fast iteration, we also create a vector (8*nbEntries) to store the entries order

On top of that, we have to remember that the performance improvement is theoretical for now, we must do some testing to be sure that one solution is more performant than the other.

veloman-yunkan · 2022-08-11T10:07:50Z

My idea will have a memory usage growing proportionally to the number of entry in the ZIM file. (8nbEntries+1nbRedirectionChains). It is more than @veloman-yunkan solution (5*nbEntries) but it is in the proportion of what we already do in libzim:
* We mmap the urlPtrList (8*nbEntries)

* We mmap the titlePtrList (4*nbEntries)

mmap-ed memory shouldn't be accounted in this analysis as 100% memory consumption since it can be easily discarded by the OS if there is not enough RAM for the real working set size.

* For fast iteration, we also create a vector (8*nbEntries) to store the entries order

This one can be halved.

On top of that, we have to remember that the performance improvement is theoretical for now, we must do some testing to be sure that one solution is more performant than the other.

During redirect loop detection the working set is limited to the data maintained in RedirectionTable (i.e. (8*nbEntries+1*nbRedirectionChains) vs (5*nbEntries)¹. The size of that data, the size of the CPU data cache, the latency of the memory bus and the access patterns (coming from the redirections) will affect the actual performance. I don't think that spending time on evaluating the performance on multiple ZIM files and diverse hardware is justified in the pursuit of (some potential) extra speed-up in one of the cheapest checks of zimcheck (at the cost of sacrificing the memory usage improvement brought by this PR).

¹ the latter, BTW, could be reduced to (4*nbEntries) by not maintaining a separate loopStatus table and reusing redirTable for that purpose, but I didn't like that hack because of the reduced clarity.

mgautierfr · 2022-08-11T14:06:08Z

I agree that you current algorithm is already a big improvement compared to the previous one. Let's move on with it. We may come back on this when/if we want to improve it further more.

Let's fix the discussion about std::deque/std::vector and we will be good.

mgautierfr · 2022-08-11T14:49:11Z

src/zimcheck/checks.cpp

+    explicit RedirectionTable(size_t entryCount)
+    {
+      loopStatus.reserve(entryCount);
+      redirTable.reserve(entryCount);
+    }


Inconsistent indentation

mgautierfr · 2022-08-11T14:50:16Z

Please fix the indentation and rebase-fixup.
You can directly merge the PR, no need for another review round.

With this implementation deep chains of redirections are not mis-reported as loops. Besides it is faster compared to the old implementation for the following reasons: - Redirection info is read from every entry/dirent exactly once; all subsequent processing is with minimal in-memory data required for the task. - When a standalone loop redirection is performed (for example, `zimcheck -L` with no other option) the auxiliary effient-order-to-by-path-order conversion table is not computed. In this case, in addition to shorter runtime, the memory usage is lower, too.

veloman-yunkan · 2022-08-12T08:59:35Z

The Packages CI flows fail since they don't see the change brought by openzim/libzim#716.

veloman-yunkan · 2022-08-12T09:03:49Z

@kelson42 Should I merge? Or we better resolve the issue with the Packages builds first?

veloman-yunkan mentioned this pull request Jul 27, 2022

Added Entry::getRedirectEntryIndex() openzim/libzim#716

Merged

veloman-yunkan requested a review from mgautierfr July 27, 2022 15:36

kelson42 force-pushed the zimcheck_faster_standalone_redirect_loop_check branch from d87dd35 to ddf3cab Compare August 4, 2022 14:48

veloman-yunkan force-pushed the zimcheck_faster_standalone_redirect_loop_check branch from ddf3cab to ffd1714 Compare August 6, 2022 17:22

mgautierfr reviewed Aug 9, 2022

View reviewed changes

veloman-yunkan requested a review from mgautierfr August 11, 2022 14:20

mgautierfr reviewed Aug 11, 2022

View reviewed changes

mgautierfr self-requested a review August 11, 2022 14:51

mgautierfr approved these changes Aug 11, 2022

View reviewed changes

veloman-yunkan force-pushed the zimcheck_faster_standalone_redirect_loop_check branch from c0755b7 to ddb1cf3 Compare August 12, 2022 08:49

kelson42 merged commit 43a80f4 into master Aug 12, 2022

kelson42 deleted the zimcheck_faster_standalone_redirect_loop_check branch August 12, 2022 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zimcheck: better and faster redirect loop check #312

zimcheck: better and faster redirect loop check #312

veloman-yunkan commented Jul 27, 2022 •

edited by kelson42

Loading

veloman-yunkan commented Jul 27, 2022

veloman-yunkan commented Aug 3, 2022

kelson42 commented Aug 3, 2022

codecov bot commented Aug 4, 2022 •

edited

Loading

kelson42 commented Aug 6, 2022

veloman-yunkan commented Aug 6, 2022

mgautierfr left a comment

mgautierfr Aug 9, 2022

veloman-yunkan Aug 9, 2022

mgautierfr Aug 10, 2022

veloman-yunkan Aug 11, 2022

mgautierfr Aug 11, 2022

veloman-yunkan Aug 11, 2022

veloman-yunkan Aug 11, 2022

veloman-yunkan commented Aug 9, 2022

kelson42 commented Aug 10, 2022

mgautierfr commented Aug 10, 2022

veloman-yunkan commented Aug 11, 2022

mgautierfr commented Aug 11, 2022

mgautierfr Aug 11, 2022

mgautierfr commented Aug 11, 2022

veloman-yunkan commented Aug 12, 2022

veloman-yunkan commented Aug 12, 2022

		std::deque<zim::entry_index_type> redirTable;
		std::deque<LoopStatus> loopStatus;

zimcheck: better and faster redirect loop check #312

zimcheck: better and faster redirect loop check #312

Conversation

veloman-yunkan commented Jul 27, 2022 • edited by kelson42 Loading

veloman-yunkan commented Jul 27, 2022

veloman-yunkan commented Aug 3, 2022

kelson42 commented Aug 3, 2022

codecov bot commented Aug 4, 2022 • edited Loading

Codecov Report

kelson42 commented Aug 6, 2022

veloman-yunkan commented Aug 6, 2022

mgautierfr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

veloman-yunkan commented Aug 9, 2022

kelson42 commented Aug 10, 2022

mgautierfr commented Aug 10, 2022

veloman-yunkan commented Aug 11, 2022

mgautierfr commented Aug 11, 2022

Choose a reason for hiding this comment

mgautierfr commented Aug 11, 2022

veloman-yunkan commented Aug 12, 2022

veloman-yunkan commented Aug 12, 2022

veloman-yunkan commented Jul 27, 2022 •

edited by kelson42

Loading

codecov bot commented Aug 4, 2022 •

edited

Loading