Skip to content

Indexing in SearchPress

Matthew Boynes edited this page Apr 19, 2020 · 5 revisions

Bulk Indexing

When you first activate SearchPress, you need to index your content. That is to say, you need to send all your content to Elasticsearch. There are two ways to do this: via the WordPress Admin or via WP-CLI.

Syncing via the WordPress Admin

SearchPress provides a tool to sync content under Tools → SearchPress. This tool will sync content using the WordPress cron system in batches of 500 posts, roughly once per minute. If your site has 20,000 posts, that should take 40-45 minutes to complete. For a site that size, you should consider using WP-CLI, which will be considerably faster.

If you need to increase or decrease the bulk indexing size, you can do so in your theme (or a plugin) using the following example:

add_action(
	'searchpress_loaded',
	function() {
		// Decrease cron bulk indexing size to 250.
		SP_Sync_Meta()->bulk = 250;
	}
);

Syncing via WP-CLI

If you have WP-CLI installed, SearchPress automatically registers a command to perform various tasks like syncing your content. This is the preferred method for syncing content if you have more than a thousand posts in your site.

To remove any extant data, add the mapping from scratch, and run a full sync on your content, the command would be:

wp searchpress index --flush --put-mapping

This will delete the current index if there is one (--flush), add the mapping (--put-mapping), and index all content. This will sync your content in batches of 2,000 by default, but you can control the batch size by setting the bulk flag, e.g. --bulk=1000. We found 2,000 to be a good default, but you may find that increasing or decreasing this will produce better results for your specific content. Too many at a time and you might have size issues (either in memory in PHP or the size of the payload going to ES); too few at a time and the HTTP overhead might prevent indexing from being as fast as it could be. The command will report on its progress and try to estimate how much longer is left based on the overall average velocity.

Domain

SearchPress uses the current domain (specifically, get_site_url()) when interacting with Elasticsearch. When using WP-CLI, and depending on your host/wp-config/etc., if you don't explicitly set the --url flag, it's possible that the domain will not be the same in WP-CLI as it is when you visit the live site.

Post Updates

When a post or attachment is created or updated, it is automatically synced to Elasticsearch. When it is trashed/deleted, it is automatically removed from the Elasticsearch index.

SP_Post Object

SearchPress prepares a post for indexing by populating an object of the class SP_Post. The SP_Post object is very flexible and allows you to alter any of the data prior to indexing.

Indexing Post Meta

In early versions of SearchPress, SearchPress would index almost all post meta with the post. Starting in the 0.4 release, SearchPress only indexes the post meta that it is explicitly told to index. Further, it only indexes post meta in the data types that a site's developer plans to use. The principal reason behind this change is performance, and to prevent "mappings explosion".

Data type casting will only be attempted for a key if the opt-in callback specifies that type for the key in question (see example below for the full list of possible types). However, the data type will still only be indexed if the type casting is successful. For example, attempting to index the meta value "WordPress" as a long would fail, since it is not a numeric value. This failure is silent, for better or worse, but type casting is overall quite forgiving.

If a meta key is allowed to be indexed, the meta value will always be indexed as an unanalyzed string (post_meta.*.raw) and that type need not be specified. This is primarily for compatibility with ES_WP_Query, which depends on that key in EXISTS queries, among others.

How to index post meta

add_filter(
    'sp_post_allowed_meta',
    function( $allowed_meta ) {
        // Tell SearchPress to index 'some_meta_key' post meta when encountered.
        $allowed_meta['some_meta_key'] = [
            'value',    // Index as an analyzed string.
            'boolean',  // Index as a boolean value.
            'long',     // Index as a "long" (integer).
            'double',   // Index as a "double" (floating point number).
            'date',     // Index as a GMT date-only value in the format Y-m-d.
            'datetime', // Index as a GMT datetime value in the format Y-m-d H:i:s.
            'time',     // Index as a GMT time-only value in the format H:i:s.
        ];
        return $allowed_meta;
    }
);

Actions and Filters

Here are some of the actions and filters available for customization in SP_Post:

Filters:

  • sp_post_allowed_meta Filter which post meta should be indexed and the data types it should be cast to.
    • array $meta The meta to be indexed, as 'meta_key' => (array) $data_types. See above docs on how to index post meta
    • integer $post_id The ID of the current post.
  • sp_post_pre_index Filter the data that will be sent to Elasticsearch immediately before converting it to JSON.
  • sp_post_should_be_indexed Determines if this post should be indexed or not.
    • boolean $should_be_indexed Defaults to true. If you don't want this post to be indexed, return false.
    • SP_Post $sp_post The current SP_Post object.
  • sp_post_index_filtered_data Should the title and content be filtered prior to indexing?
    • boolean $should_filter_data Defaults to false. If true, the title and post_content will be filtered as if it were being displayed on the front end of the site.