Skip to content

Add new Kiwix download mirror

benoit74 edited this page Aug 17, 2023 · 8 revisions

Kiwix has many mirrors available to ease retrieval of ZIMs, nightly builds and other artifacts (see https://download.kiwix.org/README for details).

Current list of mirrors is available at https://download.kiwix.org/mirrors.html

MirrorBrain is used to manage these mirrors.

This document describes how to add a new mirror to the list of existing ones.

It does not describes how to create a mirror, only how to add it to this list.

Prerequisites

Following info is needed from the mirror owner in order to add a new mirror:

  • operator name + URL (who has to be credited for making this mirror available)
  • URLs:
    • rsync: mandatory to allow MirrorBrain to check mirror status (but could be opened only to our IP)
    • http and/or ftp: must be public (sic)
  • location (country code + continent code, could be inferred from server IP) of the mirror
  • admin email address + name (to contact in case of issue)

Tests

Before configuring anything, confirm that:

  • you can communicate with the admin email
  • operator URL is working
  • rsync URL is OK : rsync -avn rsync://xxxx
  • HTTP/FTP URLs are working

Score / Priority

Internally at Kiwix, we have to decide on the score (reversed priority, mirror with lower score have less priority) we give to the mirror. The following rule of thumb is used:

  • 100: servers with limited bandwidth (based on some tests on our side; we prefer to give very low priority to slow mirrors to avoid users raising issues for something we have no control over)
  • 500: Kiwix master mirror (since it is used by other mirrors to retrieved data, we prefer that end-users do not rely on this one)
  • 3000: good mirrors
  • 5000: very good mirrors

Mirrorbrain database update

We first have to add the mirror to MB database.

Open a shell on the apache container of the mirrorbrain-web-deployment pod located in the zim namespace.

Install vim: apt update && apt install -y vim

Some useful commands:

  • mb list => show the list of configured mirror identifiers
  • mb show <identifier> => show one mirror configuration
  • mb edit <identifier> => edit one mirror configuration

You first need to decide the mirror identifier ; usually, we use the mirror hostname as identifier.

Then, you should create the new mirror with mb new (spoiler: read till the end, this won't work):

mb new <identifier> -c <country_code> -r <continent_code> -H <http_url> -R <rsync_url> -e <admin_email> -a <admin_name> --operator-url=<operator_url> --operator-name=<operator_name>

E.g.

mb new mirror-sites-fr.mblibrary.info -c FR -r EU -H https://mirror-sites-fr.mblibrary.info/mirror-sites/download.kiwix.org/ -R rsync://mirror-sites-fr.mblibrary.info/download.kiwix.org/ -e [email protected] -a "Dr. Mamdouh Barakat" --operator-url="https://www.mbgroup.global/" --operator-name="MB Group"

Unfortunately, this command is broken at the stage where it tries to retrieve country code + continent code + coordinates based on mirror IP. You could launch the command above to detect the Python file which has to be manually updated to set dummy infos instead of retrieving them from the GeoIP databases (/usr/local/lib/python2.7/dist-packages/mb/geoip.py normally).

Update this file manually to return dummy values ("fr", "eu", and "(0.000,0.000)" below):

  • lookup_country_code function will return "fr"
  • lookup_region_code function will return "eu"
  • lookup_coordinates function will return (0.000,0.000)

Launch the command mb new ... as expected before (no need to pass -c and -r args theoretically since they are overridden ... but they are mandatory ...).

Edit the configuration with mb edit <identifier> to:

  • set country real value (must be lower-case)
  • set region real value (continent, must be lower-case)
  • set enabled to True
  • set statusBaseurl to True

Theoretically, you then have to run mb test to confirm mirror configuration but this command fails with HTTPS mirror.

Update DB script

We have a cronjob mb-update-db to update MirrorBrain DB with latest mirrors status multiple times per day.

This cronjob uses the following script: https://github.com/kiwix/container-images/blob/main/mirrorbrain/bin/update_mirrorbrain_db.sh

This script must be updated to scan new mirrors (either for ALLDIRS if all directories are mirrored, or ZIMDIRS/WMDIRS if only a portion of the data is mirrored).

Beware that adding new mirrors will increase the cronjob duration which might need to be adapted (to be discussed, do not worry too much, the cronjob configuration avoids two jobs parallel execution, since scanMirror operations cannot be run in parallel).

Push your modifications to the main branch and wait for CI completion (to rebuild MB image).

Relaunch

Relaunch the mirrorbrain-web-deployment pod to use the new latest image (this will in addition discard any local modifications you've made, which is pretty good).

This is mandatory because the cronjobs are not pulling the image (imagePullPolicy is IfNotPresent), only mirrorbrain-web-deployment is always pulling the new image (imagePullPolicy is Always). This is done on purpose to ensure that all pods are using the same image since they are all running on the same node.

Monitor

Wait for a full run of the mb-update-db cronjob after image update ; check logs for info regarding the scan of new mirror.

Once scan is complete, check the list of mirror for ~ 3 files files in various folders (e.g. zims, nightly builds, ...). For every file:

Check that https://mirror.download.kiwix.org/mirrors.html includes your mirror (beware that it will take time for this file to be mirrored to other mirrors)

Open a private access to our rsync master server

By default, people are using rsync on download.kiwix.org as mirroring source, but this mirror allows only 3 concurrent connections since we do not want to be overwhelmed.

For official mirrors, we open access to a private mirror master.download.kiwix.org as mirroring source, whitelisted based on the target mirror IP, with a reserved seat for every official mirror.

To open access to the new official mirror, edit the file https://github.com/kiwix/k8s/blob/main/zim/rsyncd/rsyncd.yaml:

See https://github.com/kiwix/k8s/commit/fb48b67a4eb4498566471a711d2898e8eb84a042 for a sample change.

Deploy this change manually, first shutting down rsyncd and then starting it again:

  • set the number of replicas to 0 for the deployment (https://github.com/kiwix/k8s/blob/fb48b67a4eb4498566471a711d2898e8eb84a042/zim/rsyncd/rsyncd.yaml#L86) ; this is mandatory because rsyncd is using a node port so we cannot have a container terminating and another one creating at the same time)
  • apply the file : kubectl apply -f zim/rsyncd/rsyncd.yaml (this will also update the configmap with the configuration changes made above)
  • set the number of replicas to 1 for the deployment
  • apply the file : kubectl apply -f zim/rsyncd/rsyncd.yaml

Do not forget to push your changes to Github.

You then have to inform the mirror owner that he has been granted access to master.download.kiwix.org:

Template de mail

Dear XXX,

I am pleased to inform you that your mirror is now part of the Kiwix downloads load-balancer. You should already be getting some traffic.

As mentioned before, official mirrors gets a reserved slot on the rsync server (anonymous is limited and frequently clogged). You are thus invited to change your rsync conf to point to the master.download.kiwix.org module instead of download.kiwix.org. This module will only work from your IP (xx.xxx.xxx.xxx).

From:

rsync -vzrlptD --delete master.download.kiwix.org::download.kiwix.org/zim/ ./zim/

To:

rsync -vzrlptD --delete master.download.kiwix.org::master.download.kiwix.org/zim/ ./zim/

Please let us know once everything is fine on your end and please do not hesitate to contact us should you notice any unexpected behavior.

All the best,