Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Elasticsearch index integration #3

Open
sharun-s opened this issue Jun 24, 2018 · 1 comment
Open

Document Elasticsearch index integration #3

sharun-s opened this issue Jun 24, 2018 · 1 comment
Assignees

Comments

@sharun-s
Copy link
Owner

To support more advanced search using more indexes than just the current title/url index as discussed here - kiwix/kiwix-js#290 three steps are required and need to be documented properly.

  1. Get the right wiki exported cirrus index dump corresponding to the local archive from -https://dumps.wikimedia.org/other/cirrussearch/current/

  2. Import it into local barebones single shard no replication instance of elasticsearch

  3. Add new Advanced search page thats hooked up with elasticsearch REST APIs

@sharun-s sharun-s self-assigned this Jun 24, 2018
@sharun-s
Copy link
Owner Author

sharun-s commented Jun 25, 2018

Installation:
Check if Java is installed and JAVA_HOME is set.
Install the elasticsearch version that wikipedia uses not the latest. Version can be found here -
tar -xvf elasticsearch-5.5.2.tar.gz
Download an index from here
Note: there are 2 indexes per archive - content and general. Content has the article index while general has the talk, template, user content indexed

Install plugins - analysis-icu, search-extra [Unclear if this step is required]
./bin/elasticsearch-plugin install <plugin-name>
start elasticsearch from the decompressed directory
./bin/elasticsearch

Setup index to use a single shard with no replication
Create a settings.json file or just pass the json string via cmdline. File is better as settings string can get quite long -
{"settings":{"index":{"number_of_shards":1, "number_of_replicas":0}}}
curl -XPUT localhost:9200/<indexname> -H 'Content-Type:applcation/json' [email protected]
Check if index has been created
curl -XGET localhost:9200/_cat/indices?v

Load dump [example uses the wikiquotes-content index]
zcat enwikiquote-xvy-cirrussearch-content.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/<indexname>/_bulk --data-binary @- > /dev/null'

Run a test query in browser
open localhost:9200/<indexname>/page/_search?q=category:Indians&_source=opening_text&size=10

To run geo related queries, coordinates field containing lat and long fields must be mapped to type geo_point This is done by creating a mappings json object which is passed just like the settings object during index creation. If an index is already created to see the default mapping (auto created) during the bulk import look at
localhost:9200/<indexname>/_mappings?pretty
Check if the coordinates field is of type geo_point. If not the index has to be dumped and recreated again with an appropriately modified mappings object and data reimported.
NOTE: In some cases data is read as lat, lon and others as lon,lat see Warning note.

Example of a geo-distance query on a wikivoyage index - finds all locations within a 100Km from Chennai

curl -XGET "localhost:9200/voyage/page/_search?pretty -H "Content-Type:application/json" -d'
{
    "query": {
        "bool" : {
            "must" : {
                "match_all" : {}
            },
            "filter" : {
                "geo_distance" : {
                    "distance" : "100km",
                    "coordinates.coord" : {
                        "lat" : 13.083889,
                        "lon" : 80.27001
                    }
                }
            }
        }
    }
}
'   

To print out just the titles pipe above output to jq '. | .hits.hits[]._source.coordinates[0].coord'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant