-
Notifications
You must be signed in to change notification settings - Fork 41
Search
The search functionality is provided by algolia and is known as algolia DocSearch. We are running docsearch v3.
- "mdanalysis" dashboard (application ID Y8HJT3NO22)
- crawler configuration
We are using the hosted search option where Algolia runs the docsearch-scraper.
-
docsearch v2 (legacy) docs — we are still using v2 but they are migrating v2 to v3- FAQ v2
- update: we migrated to v3 in Novemeber 2021
- docsearch v3 docs
- docsearch Discourse forum
-
Experience and advice for dealing with indexing of code in software documentation?: indexing
pre
andcode
tags? faq Advice: don't but you could add selectors for valuable occurrences (but we cannot easily do it in sphinx-generated docs).
One can run the scraper by oneself and then serve that index. That's also recommended for debugging. If we do this, here are links to get started:
For details, look through the issue comments
- add search box #73
- restrict DocSearch to relevant parts of the site #77
- sitemapindex #79
- update to v3 #211
For v3, use the crawler interface https://crawler.algolia.com/
To change the configuration, make a PR against https://github.com/algolia/docsearch-configs/blob/master/configs/mdanalysis.json. The syntax is explained at https://docsearch.algolia.com/docs/config-file/
In order for anything to be indexed it must match one of the CSS selectors
- levels are mapped to heading tags
- text is mapped to p, li, and similar tags
- examine the produced documentation with the Firefox Web Developer Tool or similar to see which CSS elements apply to the content that should be indexed
Example selectors
selectors": {
"lvl0": "[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1",
"lvl1": "[itemprop='articleBody'] > .section h2, .page h2, .post h2, .body > .section h2",
"lvl2": "[itemprop='articleBody'] > .section h3, .page h3, .post h3, .body > .section h3",
"lvl3": "[itemprop='articleBody'] > .section h4, .page h4, .post h4, .body > .section h4",
"lvl4": "[itemprop='articleBody'] > .section h5, .page h5, .post h5, .body > .section h5",
"text": "[itemprop='articleBody'] > .section p, .page p, .post p, .body > .section p, [itemprop='articleBody'] > .section li, .page li, .post li, .body > .section li"
},
Snap shot of mdanalysis.json
{
"index_name": "mdanalysis",
"sitemap_urls": [
"https://www.mdanalysis.org/sitemapindex.xml"
],
"start_urls": [
"https://docs.mdanalysis.org",
"https://userguide.mdanalysis.org",
"https://www.mdanalysis.org"
],
"stop_urls": [
"https://www.mdanalysis.org/.*?//.*?",
"https://www.mdanalysis.org/blog",
"https://www.mdanalysis.org/mdanalysis",
"https://www.mdanalysis.org/docs",
"https://docs.mdanalysis.org/stable/.*",
"https://docs.mdanalysis.org/.*index.html$",
"https://userguide.mdanalysis.org/stable/.*",
"https://userguide.mdanalysis.org/.*-dev.*/.*",
"https://www.mdanalysis.org/.*index.html$",
"\\/_"
],
"selectors": {
"lvl0": "[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1",
"lvl1": "[itemprop='articleBody'] > .section h2, .page h2, .post h2, .body > .section h2",
"lvl2": "[itemprop='articleBody'] > .section h3, .page h3, .post h3, .body > .section h3",
"lvl3": "[itemprop='articleBody'] > .section h4, .page h4, .post h4, .body > .section h4",
"lvl4": "[itemprop='articleBody'] > .section h5, .page h5, .post h5, .body > .section h5",
"text": "[itemprop='articleBody'] > .section p, .page p, .post p, .body > .section p, [itemprop='articleBody'] > .section li, .page li, .post li, .body > .section li, [itemprop='articleBody'] > .section dt, .body > .section dt"
},
"conversation_id": [
"569445928"
],
"nb_hits": 18529
}
- sitemap.org protocol definition (defines sitemap)
- validators
Please:
- provide enough information so that others can review your pull request.
- double check the dedicated documentation available here
- try to implement the recommendations
- please feature a sitemap, it will be the most complete source of truth for our crawling.
- Allow edits from maintainer
Run a local version of the scraper that has index submission to algolia disabled (to avoid running in limits for the free plan). For example, install https://github.com/orbeckst/docsearch-scraper/tree/dryrun
Have the config file handy (e.g., by cloning https://github.com/algolia/docsearch-configs).
Run the scraper and check the output
./docsearch run ../docsearch-configs/configs/mdanalysis.json 2>&1 | tee RUN.log
less RUN.log
> DocSearch: https://www.mdanalysis.org 0 records)
> Ignored: from start url https://userguide.mdanalysis.org/stable/index.html
> Ignored: from start url https://docs.mdanalysis.org/stable/index.html
> DocSearch: https://www.mdanalysis.org/pages/privacy/ 12 records)
> DocSearch: https://www.mdanalysis.org/pages/used-by/ 30 records)
...
...
> DocSearch: https://www.mdanalysis.org/2015/12/15/The_benefit_of_social_coding/ 6 records)
> DocSearch: https://www.mdanalysis.org/distopia/search.html 0 records)
> Ignored from sitemap: https://www.mdanalysis.org/distopia/genindex.html
> Ignored from sitemap: https://www.mdanalysis.org/distopia/index.html
> DocSearch: https://www.mdanalysis.org/distopia/api/vector_triple.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/api/helper_functions.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/api/distopia.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/building_distopia.html 0 records)
- lines with N records where N > 0: this is desired and shows that the scraper collected data records for the index
- lines with 0 records: the rules do not seem to correctly catch elements on the page for scraping
- Ignored: from start url: started scraping by following but then hit a stop_url
- Ignored from sitemap: : started scraping from sitemap (which is good!) and then hit a stop_url
- Missing pages (e.g., nothing on the User Guide): check the sitemap file!!