You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some general remarks:
- why layout is gibberish in terms of line returns ? (maybe my fault)
- "per default" looks like a frenchy translation, "by default" is probably more appropriate
- other comments are inline in bold italic underline
> docker run --rm -it ghcr.io/openzim/mwoffliner:1.13.0 mwoffliner --help
starting redis-server in the background…
Create a fancy HTML dump of a Mediawiki instance in a ZIM file
Why "fancy"? Why "HTML dump"?
Usage: mwofflin
er
Example, as a system tool:
mwoffliner --mwUrl=https://en.wikipedia.org/ -
[email protected]
Or, as a node script:
node mwoffliner.js --mwUrl=htt
ps://en.wikipedia.org/ [email protected]
Or, as a npm script: '
npm r
un mwoffliner -- --mwUrl=https://en.wikipedia.org/ [email protected]
Options:
--version Show version number [boolean]
--help Show help [boolean]
--mwUrl Mediawiki base URL. [required]
More precision would be welcomed (should I include the /wiki/ ?)
--adminEmail Email of the mwoffliner user which will be put in
the HTTP user-agent string [required]
What is it used for? Does the user needs to exists in the Mediawiki instance?
--articleList List of articles to include. Can be a comma sepera
seperated => separated (typo)
ted list of titles or a local path or http(s) URL
to a file with one title (in UTF8) per line
--articleListToIgnore List of articles to ignore. Can be a comma seperat
seperated => separated (typo)
ed list of titles or a local path or http(s) URL t
o a file with one title (in UTF8) per line
--customZimFavicon Use this option to give a path to a PNG favicon, i
t will be used in place of the Mediawiki logo. Thi
s can be a local path or an HTTP(S) url
Favicons are not used anymore if I'm not mistaken, is it used as an illustration? What is the expected resolution? Is it automatically scaled to fill both resolutions (48 and 96)?
--customZimTitle Allow to configure a custom ZIM file title.
--customZimDescription Allow to configure a custom ZIM file description.
Max length is 80 chars.
--customZimLongDescription Allow to configure a custom ZIM file long descript
ion. Max length is 4000 chars.
--customZimTags Allow to configure custom ZIM file tags (semi-colo
n separated).
--customZimLanguage Allow to configure a custom ISO639-3 content langu
age code.
--customMainPage Allow to configure a custom page as welcome page.
--filenamePrefix For the part of the ZIM filename which is before t
he format & date parts.
--format Specify a flavour for the scraping. If missing, sc
rape all article contents. Each --format argument
will cause a new local file to be created but opti
ons can be combined. Supported options are:
* nov
id: no video & audio content
* nopic: no pictures
(implies "novid")
* nopdf: no PDF files
* nodet
: only the first/head paragraph (implies "novid")
Format names can also be aliased using a ":"
Examp
le: "... --format=nopic:mini --format=novid,nopdf"
What is the format alias used for?
--keepEmptyParagraphs Keep all paragraphs, even empty ones.
--mwWikiPath Mediawiki wiki base path (per default "/wiki/")
--mwApiPath Mediawiki API path (per default "/w/api.php")
--mwRestApiPath Mediawiki Rest API path (per default "/api/rest_v1
")
Rest => REST
--mwModulePath Mediawiki module load path (per default "/w/load.p
hp")
Are we speaking about https://www.mediawiki.org/wiki/Manual:Load.php? If yes, I would suggest to rename to "Mediawiki ResourceLoader path"
--mwDomain Mediawiki user domain (thought for private wikis)
--mwUsername Mediawiki username (thought for private wikis)
--mwPassword Mediawiki user password (thought for private wikis
)
--minifyHtml Try to reduce the size of the HTML
--outputDirectory Directory to write the downloaded content
--publisher ZIM publisher meta data, per default 'Kiwix'
--redis Redis path (redis:// URL or path to UNIX socket)
--requestTimeout Request timeout - in seconds(default value is 120
seconds)
--resume Do not overwrite if ZIM file already created
It is not clear, will it restart from the last article processed?
--speed Multiplicator for the number of parallel HTTP requ
ests on Parsoid backend (per default the number of
CPU cores). The default value is 1.
--verbose Print information to the stdout if the level is "i
nfo" or "log", and to the stderr, if the level is
warn or error. The option can be empty or one of "
info", "log", "warn", "error", or "quiet". Option
with an empty value is equal to "info".The default
level is "error". If you choose the lower level t
hen you will see messages also from the more high
levels. For example, if you use warn then you will
see warnings and errors.
I absolutely don't get what goes to stdout, what goes to stderr, what is the default ; and why is it named "verbose", usually such flags are booleans, here we can set a value as well?
--withoutZimFullTextIndex Don't include a fulltext search index to the ZIM
--webp Convert all jpeg, png and gif images to webp forma
t
--addNamespaces Force additional namespace (comma separated number
s)
--getCategories [WIP] Download category pages
What does "WIP" means (i.e. what works and what is not working)
--osTmpDir Override default operating system temporary direct
ory path environment variable
--customFlavour A custom processor that can filter and process art
icles (see extensions/*.js)
It should be a path to the custom processor JS? (not clear)
--optimisationCacheUrl S3 url, including credentials and bucket name
Not clear, you should precise this is a cache in the description as well, something like "S3 url to a bucket under which the scraper will cache i_dont_know_what ; the url must include credentials (keyId and secretAccessKey) as well as bucket name, e.g. https://s3.myprovider.com/?keyId=THISISAKEYID&secretAccessKey=THISISASECRETKEY&bucketName=this-is-my-bucket
The text was updated successfully, but these errors were encountered:
The text was updated successfully, but these errors were encountered: