Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the README #1983

Closed
kelson42 opened this issue Feb 2, 2024 · 0 comments · Fixed by #1996
Closed

Improve the README #1983

kelson42 opened this issue Feb 2, 2024 · 0 comments · Fixed by #1996
Assignees
Milestone

Comments

@kelson42
Copy link
Collaborator

kelson42 commented Feb 2, 2024

Some general remarks:
    - why layout is gibberish in terms of line returns ? (maybe my fault)
    - "per default" looks like a frenchy translation, "by default" is probably more appropriate
    - other comments are inline in bold italic underline

> docker run --rm -it ghcr.io/openzim/mwoffliner:1.13.0 mwoffliner --help
starting redis-server in the background…
Create a fancy HTML dump of a Mediawiki instance in a ZIM file
Why "fancy"? Why "HTML dump"? 
  Usage: mwofflin
er
  Example, as a system tool:
  mwoffliner --mwUrl=https://en.wikipedia.org/ -
[email protected]
  Or, as a node script:
  node mwoffliner.js --mwUrl=htt
ps://en.wikipedia.org/ [email protected]
  Or, as a npm script: '
  npm r
un mwoffliner -- --mwUrl=https://en.wikipedia.org/ [email protected]

Options:
  --version                   Show version number                      [boolean]
  --help                      Show help                                [boolean]
  --mwUrl                     Mediawiki base URL.                     [required]
More precision would be welcomed (should I include the /wiki/ ?)
  --adminEmail                Email of the mwoffliner user which will be put in
                              the HTTP user-agent string              [required]
What is it used for? Does the user needs to exists in the Mediawiki instance?
  --articleList               List of articles to include. Can be a comma sepera
seperated => separated (typo)
                              ted list of titles or a local path or http(s) URL
                              to a file with one title (in UTF8) per line
  --articleListToIgnore       List of articles to ignore. Can be a comma seperat
seperated => separated (typo)
                              ed list of titles or a local path or http(s) URL t
                              o a file with one title (in UTF8) per line
  --customZimFavicon          Use this option to give a path to a PNG favicon, i
                              t will be used in place of the Mediawiki logo. Thi
                              s can be a local path or an HTTP(S) url
Favicons are not used anymore if I'm not mistaken, is it used as an illustration? What is the expected resolution? Is it automatically scaled to fill both resolutions (48 and 96)?
  --customZimTitle            Allow to configure a custom ZIM file title.
  --customZimDescription      Allow to configure a custom ZIM file description.
                              Max length is 80 chars.
  --customZimLongDescription  Allow to configure a custom ZIM file long descript
                              ion. Max length is 4000 chars.
  --customZimTags             Allow to configure custom ZIM file tags (semi-colo
                              n separated).
  --customZimLanguage         Allow to configure a custom ISO639-3 content langu
                              age code.
  --customMainPage            Allow to configure a custom page as welcome page.
  --filenamePrefix            For the part of the ZIM filename which is before t
                              he format & date parts.
  --format                    Specify a flavour for the scraping. If missing, sc
                              rape all article contents. Each --format argument
                              will cause a new local file to be created but opti
                              ons can be combined. Supported options are:
                               * nov
                              id: no video & audio content
                               * nopic: no pictures
                               (implies "novid")
                               * nopdf: no PDF files
                               * nodet
                              : only the first/head paragraph (implies "novid")

                              Format names can also be aliased using a ":"
                              Examp
                              le: "... --format=nopic:mini --format=novid,nopdf"
What is the format alias used for?
  --keepEmptyParagraphs       Keep all paragraphs, even empty ones.
  --mwWikiPath                Mediawiki wiki base path (per default "/wiki/")
  --mwApiPath                 Mediawiki API path (per default "/w/api.php")
  --mwRestApiPath             Mediawiki Rest API path (per default "/api/rest_v1
                              ")
Rest => REST
  --mwModulePath              Mediawiki module load path (per default "/w/load.p
                              hp")
Are we speaking about https://www.mediawiki.org/wiki/Manual:Load.php? If yes, I would suggest to rename to "Mediawiki ResourceLoader path"
  --mwDomain                  Mediawiki user domain (thought for private wikis)
  --mwUsername                Mediawiki username (thought for private wikis)
  --mwPassword                Mediawiki user password (thought for private wikis
                              )
  --minifyHtml                Try to reduce the size of the HTML
  --outputDirectory           Directory to write the downloaded content
  --publisher                 ZIM publisher meta data, per default 'Kiwix'
  --redis                     Redis path (redis:// URL or path to UNIX socket)
  --requestTimeout            Request timeout - in seconds(default value is 120
                              seconds)
  --resume                    Do not overwrite if ZIM file already created
It is not clear, will it restart from the last article processed?
  --speed                     Multiplicator for the number of parallel HTTP requ
                              ests on Parsoid backend (per default the number of
                               CPU cores). The default value is 1.
  --verbose                   Print information to the stdout if the level is "i
                              nfo" or "log", and to the stderr, if the level is
                              warn or error. The option can be empty or one of "
                              info", "log", "warn", "error", or "quiet". Option
                              with an empty value is equal to "info".The default
                               level is "error". If you choose the lower level t
                              hen you will see messages also from the more high
                              levels. For example, if you use warn then you will
                               see warnings and errors.
I absolutely don't get what goes to stdout, what goes to stderr, what is the default ; and why is it named "verbose", usually such flags are booleans, here we can set a value as well?
  --withoutZimFullTextIndex   Don't include a fulltext search index to the ZIM
  --webp                      Convert all jpeg, png and gif images to webp forma
                              t
  --addNamespaces             Force additional namespace (comma separated number
                              s)
  --getCategories             [WIP] Download category pages
What does "WIP" means (i.e. what works and what is not working)
  --osTmpDir                  Override default operating system temporary direct
                              ory path environment variable
  --customFlavour             A custom processor that can filter and process art
                              icles (see extensions/*.js)
It should be a path to the custom processor JS? (not clear)
  --optimisationCacheUrl      S3 url, including credentials and bucket name
Not clear, you should precise this is a cache in the description as well, something like "S3 url to a bucket under which the scraper will cache i_dont_know_what ; the url must include credentials (keyId and secretAccessKey) as well as bucket name, e.g. https://s3.myprovider.com/?keyId=THISISAKEYID&secretAccessKey=THISISASECRETKEY&bucketName=this-is-my-bucket
@kelson42 kelson42 added this to the 1.14.0 milestone Feb 2, 2024
@kelson42 kelson42 self-assigned this Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant