Skip to content
Oliver Beckstein edited this page Nov 22, 2021 · 20 revisions

The search functionality is provided by algolia and is known as algolia DocSearch. We are running docsearch v3.

Configuration

Documentation

Hosted search

We are using the hosted search option where Algolia runs the docsearch-scraper.

specific issues

docsearch-scraper

One can run the scraper by oneself and then serve that index. That's also recommended for debugging. If we do this, here are links to get started:

Relevant issues

For details, look through the issue comments

  • add search box #73
  • restrict DocSearch to relevant parts of the site #77
  • sitemapindex #79
  • update to v3 #211

Configuration

For v3, use the crawler interface https://crawler.algolia.com/

To change the configuration, make a PR against https://github.com/algolia/docsearch-configs/blob/master/configs/mdanalysis.json. The syntax is explained at https://docsearch.algolia.com/docs/config-file/

Selectors

In order for anything to be indexed it must match one of the CSS selectors

  • levels are mapped to heading tags
  • text is mapped to p, li, and similar tags
  • examine the produced documentation with the Firefox Web Developer Tool or similar to see which CSS elements apply to the content that should be indexed

Example selectors

selectors": {
    "lvl0": "[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1",
    "lvl1": "[itemprop='articleBody'] > .section h2, .page h2, .post h2, .body > .section h2",
    "lvl2": "[itemprop='articleBody'] > .section h3, .page h3, .post h3, .body > .section h3",
    "lvl3": "[itemprop='articleBody'] > .section h4, .page h4, .post h4, .body > .section h4",
    "lvl4": "[itemprop='articleBody'] > .section h5, .page h5, .post h5, .body > .section h5",
    "text": "[itemprop='articleBody'] > .section p, .page p, .post p, .body > .section p, [itemprop='articleBody'] > .section li, .page li, .post li, .body > .section li"
  },

mdanalysis.json

Snap shot of mdanalysis.json

{
  "index_name": "mdanalysis",
  "sitemap_urls": [
    "https://www.mdanalysis.org/sitemapindex.xml"
  ],
  "start_urls": [
    "https://docs.mdanalysis.org",
    "https://userguide.mdanalysis.org",
    "https://www.mdanalysis.org"
  ],
  "stop_urls": [
    "https://www.mdanalysis.org/.*?//.*?",
    "https://www.mdanalysis.org/blog",
    "https://www.mdanalysis.org/mdanalysis",
    "https://www.mdanalysis.org/docs",
    "https://docs.mdanalysis.org/stable/.*",
    "https://docs.mdanalysis.org/.*index.html$",
    "https://userguide.mdanalysis.org/stable/.*",
    "https://userguide.mdanalysis.org/.*-dev.*/.*",
    "https://www.mdanalysis.org/.*index.html$",
    "\\/_"
  ],
  "selectors": {
    "lvl0": "[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1",
    "lvl1": "[itemprop='articleBody'] > .section h2, .page h2, .post h2, .body > .section h2",
    "lvl2": "[itemprop='articleBody'] > .section h3, .page h3, .post h3, .body > .section h3",
    "lvl3": "[itemprop='articleBody'] > .section h4, .page h4, .post h4, .body > .section h4",
    "lvl4": "[itemprop='articleBody'] > .section h5, .page h5, .post h5, .body > .section h5",
    "text": "[itemprop='articleBody'] > .section p, .page p, .post p, .body > .section p, [itemprop='articleBody'] > .section li, .page li, .post li, .body > .section li, [itemprop='articleBody'] > .section dt, .body > .section dt"
  },
  "conversation_id": [
    "569445928"
  ],
  "nb_hits": 18529
}

Working with sitemaps

When making a PR

Please:

Debugging search (v2)

Run a local version of the scraper that has index submission to algolia disabled (to avoid running in limits for the free plan). For example, install https://github.com/orbeckst/docsearch-scraper/tree/dryrun

Have the config file handy (e.g., by cloning https://github.com/algolia/docsearch-configs).

Run the scraper and check the output

./docsearch run ../docsearch-configs/configs/mdanalysis.json 2>&1 | tee RUN.log
less RUN.log

Example output

> DocSearch: https://www.mdanalysis.org 0 records)
> Ignored: from start url https://userguide.mdanalysis.org/stable/index.html
> Ignored: from start url https://docs.mdanalysis.org/stable/index.html
> DocSearch: https://www.mdanalysis.org/pages/privacy/ 12 records)
> DocSearch: https://www.mdanalysis.org/pages/used-by/ 30 records)
...
...
> DocSearch: https://www.mdanalysis.org/2015/12/15/The_benefit_of_social_coding/ 6 records)
> DocSearch: https://www.mdanalysis.org/distopia/search.html 0 records)
> Ignored from sitemap: https://www.mdanalysis.org/distopia/genindex.html
> Ignored from sitemap: https://www.mdanalysis.org/distopia/index.html
> DocSearch: https://www.mdanalysis.org/distopia/api/vector_triple.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/api/helper_functions.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/api/distopia.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/building_distopia.html 0 records)

Interpretation of results

  • lines with N records where N > 0: this is desired and shows that the scraper collected data records for the index
  • lines with 0 records: the rules do not seem to correctly catch elements on the page for scraping
  • Ignored: from start url: started scraping by following but then hit a stop_url
  • Ignored from sitemap: : started scraping from sitemap (which is good!) and then hit a stop_url
  • Missing pages (e.g., nothing on the User Guide): check the sitemap file!!