- Version: 0.3.1
- Author: Nick Dunn
- Build Date: 2012-03-06
- Requirements: Symphony 2.2.5
The ElasticSearch extension integrates Symphony with ElasticSearch to provide powerful indexing and search for your site.
- Add the
elasticsearch
folder to your Extensions directory - Enable the extension from the Extensions page
- Check that a directory was created at
/workspace/elasticsearch
- Create mapping files in
/workspace/elasticsearch/mappings
- Send your mappings to ElasticSearch (using the ElasticSearch > Mappings page)
- Install ElasticSearch
- Configure the Symphony extension
- Fulltext search data source
- Autocomplete
- Multilingual Search
- Logging and analysis
You will need to install ElasticSearch (ES) on your server:
Linux:
# replace 0.18.7 with latest stable tag
cd ~
wget https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.18.7.tar.gz -O elasticsearch.tar.gz
tar -xf elasticsearch.tar.gz
rm elasticsearch.tar.gz
sudo mv elasticsearch-* elasticsearch
sudo mv elasticsearch /usr/local/share
Mac OSX:
brew install elasticsearch
There are several ES plugins you will find useful. They are all easy to install and use.
This plugin installs a service
shortcut to start/stop the ES service on your server. Install the plugin (assumes paths above, will be different for an OSX Homebrew install):
curl -L http://github.com/elasticsearch/elasticsearch-servicewrapper/tarball/master | tar -xz
mv *servicewrapper*/service /usr/local/share/elasticsearch/bin/
rm -Rf *servicewrapper*
sudo /usr/local/share/elasticsearch/bin/service/elasticsearch install
sudo ln -s `readlink -f /usr/local/share/elasticsearch/bin/service/elasticsearch` /usr/local/bin/rcelasticsearch
You can now start
, stop
or restart
ElasticSearch using the following command from anywhere:
service elasticsearch start
ElasticSearch runs on port 9200 by default, therefore a successful installation should yield some Douglas Adams gold at:
http://localhost:9200/
This provides a UI for browsing your ES cluster, its indexes and content. Use it to test queries and explore new things.
# installs from https://github.com/Aconex/elasticsearch-head
sudo /usr/local/share/elasticsearch/bin/plugin -install Aconex/elasticsearch-head
Once installed you can view the plugin at:
http://localhost:9200/_plugin/head/
This allows you to index the contents of binary files such as Word, PDF and others. Once installed you can use a field type of attachment
when configuring section mappings (more on this later).
# replace 1.2.0 with latest stable tag https://github.com/elasticsearch/elasticsearch-mapper-attachments
sudo /usr/local/share/elasticsearch/bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.2.0
By default ElasticSearch runs on port 9200 and is open and public. If running ElasticSearch on a public webserver, you can lock down access using Basic HTTP authentication. This is provided by the elasticsearch-http-basic
plugin. Install by downloading the .jar file to your ES plugins directory
mkdir /usr/local/share/elasticsearch/plugins/http-basic
cd /usr/local/share/elasticsearch/plugins/http-basic
wget https://github.com/downloads/Asquera/elasticsearch-http-basic/elasticsearch-http-basic-1.0.3.jar /usr/local/share/elasticsearch/plugins/http-basic
Then add the plugin configuration to your elasticsearch.yaml
file:
http.basic.enabled: true
http.basic.user: "my_username"
http.basic.password: "my_password"
Restart ElasticSearch:
service elasticsearch start
The root of your ElasticSearch server (e.g. http://localhost:9200/) will still return JSON, so you can easily check server status. But other requests will be blocked. Add your username and password to the System > Preferences page in Symphony.
Before we go any further, you should know that ElasticSearch is powerful. It uses Lucene under the hood, so it supports a ton of things like word stemming, stop words, ngrams, wildcards, accent folding, more like this, synonyms and more. I have written this ElasticSearch extension to provide you with a set of sensible defaults for fulltext search. If you want to change the way this works, then it's simply a case of modifying JSON files. But the idea is that this extension should give you excellent results 90% of the time.
On installation the extension will have created a directory in your workspace folder named elasticsearch
containing the following:
/workspace
/elasticsearch
.htaccess
index.json
/mappings
The .htaccess
file keeps your files private. index.json
is a JSON document which contains the configuration passed when an ElasticSearch index is created. Specifically, this config file specifies two custom analysers (symphony_fulltext
and symphony_autocomplete
) and two custom filter (custom_synonyms
and custom_stop
). More on these later.
(If your permissions prevented these files from being created, create them now by copying the files from the templates
directory of the extension.)
It's probably best to first describe the nomenclature you need to be familiar with when using ElasticSearch.
ElasticSearch runs as a service on your webserver. If you are running a single server you are running a single ElasticSearch cluster. A single cluster can house the search for more than one site, each of which is stored in an index. When you use this extension, it will create a single index for your site (e.g. my-site
). Within an index are types. These map nicely onto Symphony sections, e.g. articles
, products
or comments
. ElasticSearch stores documents (Symphony entries) which are made up of fields.
Fields within a document can be strings, numbers, dates, arrays/collections, or several others. Although ElasticSearch will automatically create a new type when you throw a new type of document at it, it is usually best to define the structure of a type first. This is very much like defining a section in Symphony: you define the fields, and properties of these fields. the structure of a type is known as a mapping, and is formatted in a JSON file.
The final things to understand are query types, analysers, tokenisers and filters. Stay with me, OK?
A query type is how to query ElasticSearch e.g. text, boolean, wildcard, fuzzy. This extension just uses two types: query_string and match_all.
Analysers are the logic that is run against both the content you are indexing (an entry) and what you are searching for (a keyword). An analyser comprises a tokeniser, which specifies how the tokens (usually words) are broken up (usually based on spaces between words), and filters, which work their magic on each word (such as removing stop words, reducing a word to its stem, or replacing with a synonym).
ElasticSearch provides a suite of analysers which all have different combinations of tokenisers and filters. To prevent you from having to read, understand and apply these, this extension provides two custom analysers which are good for most situations. They are called symphony_fulltext
and symphony_autocomplete
and are used for fulltext search and search input autocomplete respectively.
They are configured in the index.json
file in your workspace directory.
The fulltext analyser uses a suite of filters to strip down text into its most basic form:
stop
applies Lucene's default stop words listasciifolding
converts accented characters, e.g.é
becomese
snowball
applies word stemming for European languages, e.g.library
andlibraries
becomelibrari
lowercase
makes all words case-insensitivecustom_synonyms
applies a list of user-defined synonymscustom_stop
applies a list of user-defined stop words
The autocomplete analyser is more forgiving than the fulltext analyser and just applies asciifolding
and lowercase
filters.
It is important to note that the same analyser must be applied both to the indexed entry and and search keywords. For example if the indexed entry contains the text School Library
, it would be indexed as school librari
. If a user searched for School Library
then it would not be matched! The user's input keywords must also be run through the same analyser, so school
and lirari
can be matched.
This is where the you put your new knowledge to the test, and you map your Symphony sections into ElasticSearch types. This is achieved by creating two files in the workspace/elasticsearch/mappings
directory for each section you want to index. Let's assume you want to index a section named Articles
which has four fields:
- Title (input)
- Content (textarea)
- Is Published (checkbox)
- Document (file upload)
Begin by creating a file named articles.json
in the mappings directory. This file will define the fields that the ElasticSearch document will contain when it indexes an article. You decide that you only want the Title, Content and Document fields indexed for search:
{
"articles": {
"properties": {
"title": {
"type" : "multi_field",
"store": "yes",
"fields": {
"title": {"type" : "string"},
"symphony_fulltext" : {"type" : "string", "analyzer": "symphony_fulltext"},
"symphony_autocomplete" : {"type" : "string", "analyzer": "symphony_autocomplete"}
},
"boost": 3.0
},
"content": {
"type" : "multi_field",
"store": "yes",
"fields": {
"content": {"type" : "string"},
"symphony_fulltext" : {"type" : "string", "analyzer": "symphony_fulltext"}
},
"symphony_highlight": "yes"
},
"document": {
"type" : "multi_field",
"store": "yes",
"fields": {
"document": {"type" : "attachment"},
"symphony_fulltext" : {"type" : "attachment", "analyzer": "symphony_fulltext"}
},
"symphony_highlight": "yes"
}
}
}
}
Wow, what is all this about? It's easy. It's an object that matches the handle of your section. Each field that you want indexed is in there. Each field could be defined as a core type such as string, number or date, but we're doing something clever and defining them as a "multi_field" type. This means that for each field we can index them in several (three) different ways:
(default)
uses the same name as the field, and just indexes the field as normal, as if we weren't using a multi_type field at allsymphony_fulltext
indexes the field again but runs its content through the aforementionedsymphony_fulltext
analyser. Fields that define asymphony_fulltext
index are searched against for fulltext search when using the search datasource bundled with this extensionsymphony_autocomplete
indexes the field again but runs its content through the aforementionedsymphony_autocomplete
analyser. Fields that define asymphony_autocomplete
index are searched against for autocomplete/suggest search when using the autocomplete search datasource bundled with this extension
Adding a boost
property for title
ranks this field three times more important than other fields in the section. Adding a symphony_highlight
property for content
and document
configures ElasticSearch to return excerpts from these fields with the search terms highlighted. See Example XML response for more.
(Note: symphony_highlight
is a custom property you won't find in the ElasticSearch docs — it is just used by this extension).
Creating the JSON mapping is the first of two steps. The second involves converting Symphony entry data from an array into the JSON that ElasticSearch expects. Again, it's easy. For your Articles section, create a file also in workspace/elasticsearch/mappings
named articles.php
.
<?php
class elasticsearch_articles {
public function mapData(Array $data, Entry $entry) {
$json = array();
// var_dump($data);
$json['_boost'] = 1;
if($data['is-published']['value'] !== 'yes') return;
$json['title'] = $data['title']['value'];
$json['content'] = $data['content']['value'];
$json['document'] = base64_encode(file_get_contents($data['document']['file']));
return $json;
}
}
First of all the class name should match the section handle. Hyphens become underscores. The mapData
method is provided with the entry's data as an array, and the raw Entry
object if you need it (you usually won't). This method should return a JSON object containing the data for all fields you specified in the mapping JSON file above.
Adding a _boost
(note the underscore) to the object will boost this section above others in search results.
If you need to prevent some entries from being indexed, then check them here. In our example an entry has an Is Published checkbox, so we must check that it's value is yes, otherwise the entry should not be indexed. Return false
from the method to prevent the entry being indexed.
Files can be indexed if you've got the attachment
plugin installed. Send the file's contents as a base64 encoded string.
First things first, we need to create our master index. Navigate to System > Preferences in Symphony and find the ElasticSearch settings.
Host
is the full hostname of your ElasticSearch server. If running on the same webserver as Symphony usehttp://localhost:9200/
Index Name
is the handle of your index. Maybeyour-site-name
When you Save Changes, Symphony will connect to ElasticSearch and creates the index. This process sends the index.json
document, which configures the index as it is created. If you modify the index.json
file (e.g. you add new stop words or synonyms) then you must recreate the index from scratch. The easiest way to do this is simply to change the Index Name
and save changes. The old index will be destroyed.
Creating the mapping JSON and PHP files is one step, but the JSON needs to be sent to ElasticSearch for it to build the mapping internally. To do this, navigate to the ElasticSearch > Mappings page in Symphony and you will see a list of sections you have written mappings for.
Select the row and choose Rebuild Mapping
from the With Selected menu. This will create the mapping type in ElasticSearch and you will be able to index entries!
From the ElasticSearch > Mappings page, select a row and choose Reindex Entries
from the With Select menu. This will cycle through all entries in the section and batch-submit them for indexing. This should occur in near real time.
All configuration options are stored in the Symphony config file.
host
root URL of your ElasticSearch server (e.g.http://localhost:9200/
)index-name
index name (e.g.my-site
)reindex-batch-size
(default20
) number of simultaneously entries to reindex when manually reindexing a sectionreindex-batch-delay
(default0
) number of seconds between each batch (reduce if you find this activity hogs server resources, allows the server to recover between each batch!)per-page
(default20
) default number of entries per page of search resultssort
(default_score
) default sort field (use ElasticSearch fields like_score
or_id
for best results)direction
(defaultdesc
) default sort directionhighlight-fragment-size
(default200
) maximum number of characters of each excerpt highlighthighlight-per-field
(default1
) maximum number of highlights returned per fieldbuild-entry-xml
(defaultno
) whether to build full entry XML (all fields) in search resultsdefault-sections
(default ``) default list of section handles to search in (comma-delimited)default-language
(default ``) default list of languages to search in (comma-delimited)logging
(defaultyes
) whether to log each keyword search
You have installed ElasticSearch, configured the Symphony extension, and mapped your Symphony sections to be indexed.
Fulltext search across your indexed actions can be achieved using the custom ElasticSearch data source included with this extension. Attach this data source to you search results page and invoke it using the following GET parameters:
keywords
the string to search on e.g.foo bar
page
the results page numbersort
(default_score
) the field to sort results bydirection
(defaultdesc
) eitherasc
ordesc
per-page
(default20
) number of results per pagesections
a comma-delimited list of section handles to search within (only indexed sections will work) e.g.articles,comments
language
an optional string referring to the language code of your indexed fields (see Multilingual search)
The datasource executes a query_string query against any multi_type field with a field name of symphony_fulltext
.
Your search form might look like this:
<form action="/search/" method="get">
<label>Search <input type="text" name="keywords" /></label>
<input type="hidden" name="per-page" value="10" />
<input type="hidden" name="sections" value="articles,comments,categories" />
</form>
Note that all of these variables (except for keywords
) have defaults in config.php
. Change them in your config file and omit them from the URL.
The XML returned from this data source looks like this:
<elasticsearch took="54ms" max-score="0.7293">
<keywords>foo bar</keywords>
<pagination total-entries="5" total-pages="1" entries-per-page="20" current-page="1" />
<facets>
<facet handle="filtered-sections">
<term handle="articles" entries="3" active="yes">Articles</term>
<term handle="comments" entries="2" active="yes">Comments</term>
</facet>
<facet handle="all-sections">
<term handle="articles" entries="100" active="yes">Articles</term>
<term handle="comments" entries="391" active="yes">Comments</term>
</facet>
</facets>
<entries>
<entry id="2" section="articles" score="0.7293">
<highlight field="title">My favourite words are <strong class="highlight">foo</strong> and <strong class="highlight">bar</strong>, but don't tell fred!</highlight>
</entry>
<entry id="4" section="comments" score="0.6213">...</entry>
<entry id="3" section="articles" score="0.5004">...</entry>
<entry id="1" section="articles" score="0.4277">...</entry>
<entry id="5" section="comments" score="0.2651">...</entry>
</entries>
</elasticsearch>
The query returns two facets which are used as a breakdown of entries across sections. filtered-sections
lists the sections for which entries were found, and how many. all-sections
lists all sections and how many entries, regardless of the search query. The @active
attribute is yes
if the search is running on that section:
- if
?sections=articles,comments
is passed on the querystring then these sections will be used - if not, the
default-sections
list from Symphony'smanifest/config.php
file will be used - if not, all indexed sections are used
However this output is not sufficient to build a search results page (SERP) — you need the entries themselves. You can achieve this in one of two ways:
- use the
$ds-elasticsearch
output parameter from this data source to chain additional data sources to return the full entry XML - set
build-entry-xml
toyes
in Symphony'smanifest/config.php
, and the entry's fields will be appended to the XML
There is a second ElasticSearch: Suggest
data source provided by this extension which, given a partial search term, will perform a wildcard search and return suggested phrases. This can be used for a basic autocomplete search box.
Create a new page, give it a page type of XML
, attach the suggest data source, and this XSLT:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:template match="/">
<xsl:copy-of select="data/elasticsearch-suggest/words" />
</xsl:template>
</xsl:stylesheet>
This query will be run against any multi_type field with a name of symphony_autocomplete
, so only specify this name for fields that make sense for autocomplete. Choose fields like post titles, product SKUs and people's names.
Pass the following querystring parameters:
keywords
the string to search on e.g.foo bar
sort
(default_score
) the field to sort results bysections
a comma-delimited list of section handles to search within (only indexed sections will work) e.g.articles,comments
The XML result looks like:
<words>
<word>
<raw>The Story of Foo and his Bar!</raw>
<highlighted>The Story Of <strong>Foo</strong> and his <strong>Bar</strong>!</highlighted>
</word>
...
</words>
The raw
element contains plain text while highlighted
contains the string with matching full words highlighed. The result is entity-encoded to make JavaScript processing easier (treat it as plain text).
While ElasticSearch does not support multilingual content out of the box, it is still still possible to index and search your multilingual entries by adhering to a simple naming convention that this extension uses.
Let's say you have an Articles section with two multilingual fields: Title and Content. When you create the section mapping, you can map each of these fields for each language. For example mapping the two fields for English and German:
{
"articles": {
"properties": {
"title_en": {
"type" : "multi_field",
"store": "yes",
"fields": {
"title_en": {"type" : "string"},
...
}
},
"title_de": {
"type" : "multi_field",
"store": "yes",
"fields": {
"title_de": {"type" : "string"},
...
}
},
"content_en": {
"type" : "multi_field",
"store": "yes",
"fields": {
"title_en": {"type" : "string"},
...
}
},
"content_de": {
"type" : "multi_field",
"store": "yes",
"fields": {
"title_de": {"type" : "string"},
...
}
}
}
}
}
And the PHP mapper (note the structure for getting the per-language data might vary depending on which multilingual extension you are using in Symphony):
<?php
class elasticsearch_articles {
public function mapData(Array $data, Entry $entry) {
$json = array();
// title
$json['title_en'] = $data['title']['value']['en'];
$json['title_de'] = $data['title']['value']['de'];
// content
$json['content_en'] = $data['content']['value']['en'];
$json['content_de'] = $data['content']['value']['de'];
return $json;
}
}
Two Symphony fields, mapped to four ElasticSearch fields.
To search a specific language only, you can pass language
URL parameter to your search page. For example:
http://localhost/search/?sections=articles&keywords=foo+bar&language=en
Multiple languages can be searched at once:
http://localhost/search/?sections=articles&keywords=foo+bar&language=en,de
Omit the language
parameter to search all fields. If language
is omitted you can specify a default for the default-language
property the Symphony config file. The same convention also applies to the autocomplete data source.
If you have never looked over a search log, then shame on you. Do yourself a favour and read Lou Rosenfold's Search Analytics For Your Site to be instantly convinced that optimising search will benefit you and your users.
You can configure Google Analytics to track searches on your site. It will show you which terms were searched for, and which pages people started searching from (which usually means that page should contain information regarding their search term!). However Google Analytics isn't a dedicate search term analytics tool and doesn't give you the granular breakdown that analytics nerds so desperately desire.
To this end, this extension logs every search query it makes (disable logging in the config) for you to pore over in your spare time. Logs are broken down by:
- Session Logs shows each individual user session, use this to spot behaviours such as pogo-sticking, the difference between mobile and desktop use, and how users correct search terms (which can suggest synonyms to add to the index)
- Query Logs shows most popular search terms, so you can see which terms are used the most, the least, whether they return many hits, and whether people are prepared to sift through many pages
- support for multilang using naming conventions e.g. *_en.symphony_fulltext