Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Postgres full-text search as flexible and easy to implement as Elasticsearch in a Rails app? #139

Closed
monfresh opened this issue Mar 13, 2014 · 28 comments

Comments

@monfresh
Copy link
Member

Postgres also has full-text search capabilities (which we haven't researched yet), and we hope it will be able to provide the same functionality as Elasticsearch. If that's the case, it will allow us to remove another dependency.

@nickborromeo
Copy link

this could be a quick resource for full-text search is Postgres. http://railscasts.com/episodes/343-full-text-search-in-postgresql

@monfresh
Copy link
Member Author

Thanks! It's on my to-do list for next week.

@monfresh monfresh changed the title Replace Elasticsearch with Postgres Is Postgres full-text search as flexible and easy to implement as Elasticsearch in a Rails app? May 1, 2014
@monfresh
Copy link
Member Author

monfresh commented May 1, 2014

I've been doing some initial research on implementing Postgres full-text search (FTS) in Rails, and from what I could find so far, it seems to be less flexible and less performant than Elasticsearch. If anyone out there reading this has experience with Postgres FTS in Rails, I would greatly appreciate your input.

The main reason for wanting to switch is to reduce dependencies, but as long as certain functionality and flexibility is retained. The switch shouldn't sacrifice the following features IMO:

  • Maintainability of the code
  • Readability of the code
  • Flexibility in making changes to which fields get searched and adjusting the weights/boost of search results based on various conditions
  • Ability to do AND searches with multiple search parameters (i.e. show me all locations in ZIP code 94403 that also match the text "food pantry")
  • Ability to do geospatial queries
  • Ability to do time-based searches (show me all locations that are open between 9am and noon on Wednesday)
  • Ability to use one model for searching, while being able to search on all the model's associations, including ones 2 levels deep. All searches should return Locations. A Location belongs to an Organization, has many Services, and Services have and belong to many Categories. I should be able to do a generic keyword search that will look up the search term within all of those models. Elasticsearch is able to do this in 8ms with about 2000 records in the database.
  • Ability to define the JSON output via a DSL (nice to have, not required).
  • Maintaining the performance provided by Elasticsearch, or at least keeping queries under 50ms

In terms of Rails integration, there are only 2 gems people mention: pg_search and textacular. I've only tried pg_search so far, and it doesn't seem to support searching on multiple parameters at the same time. For example, I want to be able to find records that match all criteria passed in the search request, like http://ohana-api-demo.herokuapp.com/api/search?keyword=health&location=94402. An issue was opened 10 months ago asking if this feature was supported, but it hasn't been answered yet. Textacular seems to support this, so I'll try it out soon.

The one thing that seems to be the big gotcha based on the documentation for both of those gems, is the difficulty in setting up the right indexes for performance, and the inflexibility in making changes that affect the indexes. For example, if you create an index via a Rails migration, then later decide you want to search on different or more fields, or adjust the weights of certain fields, then you have to set up a new migration, making sure to have the old setup defined properly in the down migration in case you want to rollback. This also comes at a cost — writing to the DB will be blocked while the index is created — which surprisingly I've only seen mentioned in this post from Thoughtbot, which recommends creating indices concurrently in Rails.

In comparison, making changes with Elasticsearch is a lot easier. It's just a matter of making changes to the model's code, especially if you're using a library with a great DSL like Tire.

There isn't much documentation about how to create and update these indexes, and the suggestions vary from author to author. Textacular recommends creating a "view", like in this gist, but doesn't mention how the view gets updated when a record is created or updated. The index documentation on pg_search is very sparse.

Another blog post recommends setting up a tsvector column with a trigger to update the index upon INSERT or UPDATE. The post mentions this setup alongside pg_search. I'm not sure if it would work with textacular.

All in all, I've found the documentation and tutorials around Postgres FTS lacking in comparison to Elasticsearch, especially around setting up indexes properly. The responsiveness and helpfulness of Karel Minarik (as well as his ample documentation and examples), the maintainer of Tire, is one of the main reasons I stuck with Elasticsearch.

So, all of this is to ask you to please provide any insight if you have experience implementing complex search in Rails with Postgres FTS. Here's the current Elasticsearch search method if you want to see what Postgres should be able to do.

@ahhrrr I see that you've contributed code to Textacular. Would love to hear your thoughts!

Thanks!

@monfresh
Copy link
Member Author

monfresh commented May 2, 2014

I also wanted to add that we are writing this software for other people to use; people who might not be as technically proficient as us, so I think it's super important for the app to be as easy to use and customize as possible.

Search cannot be built in a one-size-fits-all fashion, so people who redeploy this app in their communities will mostly likely want to adjust various search settings, and the easier it is to do so, the better.

I'm not sure if PostGIS is required to combine FTS with geospatial queries, but it's worth noting that PostGIS support on Heroku is "in beta and subject to change in the future," and only available on Production tier plans starting at $50/month.

@spara
Copy link

spara commented May 2, 2014

Yes, PostGIS is needed for geospatial queries, but install is easy if
you're running 9.3

CREATE EXTENSION postgis;

That's it. After that you'll have to create a point column from the
lat/longs and you should be good to go as far geo goes.

On Thu, May 1, 2014 at 10:08 PM, Moncef Belyamani
[email protected]:

I also wanted to add that we are writing this software for other people to
use; people who might not be as technically proficient as us, so I think
it's super important for the app to be as easy to use and customize as
possible.

Search cannot be built in a one-size-fits-all fashion, so people who
redeploy this app in their communities will mostly likely want to adjust
various search settings, and the easier it is to do so, the better.

I'm not sure if PostGIS is required to combine FTS with geospatial
queries, but it's worth noting that PostGIS support on Heroku is "in beta
and subject to change in the future," and only available on Production
tier planshttps://devcenter.heroku.com/articles/heroku-postgres-extensions-postgis-full-text-searchstarting at $50/month.


Reply to this email directly or view it on GitHubhttps://github.com//issues/139#issuecomment-41981125
.

@migurski
Copy link
Contributor

migurski commented May 2, 2014

Hi Moncef, very thorough!

I’m not sure what the JSON bit is about. The application layer should be determining the output JSON structure, not the data store.

Regarding geospatial queries, it really depends on what you're trying to do. PostGIS is needed for geometry predicates (e.g. points inside polygons) and fast spatial indexes, but for the data volumes handled by Ohana there's a lot of headroom to get by with simple/stupid approaches. Sorting a ~2,000 row table by naive 2D distance calculated on the fly should present no noticeable performance impact, for example.

For indexes, I'd love to see some concrete performance numbers. 8ms responses are nice, but anything under 100ms will probably be fine. The data volumes handled by Ohana are so small that fine-tuning indexes might be a waste of effort, in contrast to doing a simple table scan. There’s just not that much data here.

As far as the technical proficiency bit goes, I have my doubts about people’s desire to fine-tune anything in the search index.

@migurski
Copy link
Contributor

migurski commented May 2, 2014

Also, regarding views: they are kept up-to-date by Postgres, internally, with no action or intervention required on the application’s part. “Consistency” is the “C” in ACID. Making this Postgres’s problem so you can strip index maintenance out of your code completely is why I’m advocating so strongly for this direction.

@spara
Copy link

spara commented May 2, 2014

I agree with Mike that the application layer should produce the json, while
postgres has a number of functions to deal with arrays and aggregate them
to produce json, it can take a lot of time to get the sql to produce
something close but not quite what you wanted. I spent a fair bit of time
working on this in February.

On Thu, May 1, 2014 at 11:34 PM, migurski [email protected] wrote:

Hi Moncef, very thorough!

I’m not sure what the JSON bit is about. The application layer should be
determining the output JSON structure, not the data store.

Regarding geospatial queries, it really depends on what you're trying to
do. PostGIS is needed for geometry predicates (e.g. points inside polygons)
and fast spatial indexes, but for the data volumes handled by Ohana there's
a lot of headroom to get by with simple/stupid approaches. Sorting a ~2,000
row table by naive 2D distance calculated on the fly should present no
noticeable performance impact, for example.

For indexes, I'd love to see some concrete performance numbers. 8ms
responses are nice, but anything under 100ms will probably be fine. The
data volumes handled by Ohana are so small that fine-tuning indexes might
be a waste of effort, in contrast to doing a simple table scan. There’s
just not that much data here.

As far as the technical proficiency bit goes, I have my doubts about
people’s desire to fine-tune anything in the search index.


Reply to this email directly or view it on GitHubhttps://github.com//issues/139#issuecomment-41987143
.

@monfresh
Copy link
Member Author

monfresh commented May 2, 2014

@spara and @migurski, thanks for your input! I will try Textacular and see if I can get it to at least pass the search tests. I will try something simple like Geocoder at first for the location queries, then see if using PostGIS instead makes a significant difference. Once the tests pass, I will measure the performance.

Sorry for the confusion about JSON. The JSON is indeed handled by the application, but the Tire gem, for example, provides a DSL to make it easier to construct. If the Location model has 20 fields, 18 of which I want to include in the JSON, I can just do this:

def to_indexed_json
  to_json except: ['updated_at', 'created_at']
end

As opposed to having to manually define a huge page-long hash that contains every single field I want.

@monfresh
Copy link
Member Author

monfresh commented May 2, 2014

So, I spent the past few hours digging into Textacular, and out of the box, it doesn't provide the same functionality as Elasticearch/Tire. I opened this issue to ask if what I want to do is possible, and the answer is that it would require some "serious AREL fu", which I certainly don't have. This is what I meant by Elasticsearch being a lot more approachable.

At this point, I can't progress without help. If anyone with experience building complex queries like this could work with me, I'd greatly appreciate it.

@migurski
Copy link
Contributor

migurski commented May 2, 2014

Do you have the ability to drop down to plain old SQL via your ORM? This is a case where the underlying functionality is a straightforward application of decades-old SQL features ("SELECT stuff FROM place WHERE this AND that…") and the layering of gems on top and dependency on 3rd party developers to add functionality seems to be getting in your way.

@monfresh
Copy link
Member Author

monfresh commented May 2, 2014

Yes, you can. As a first step, do you know how to join tables two levels deep? For example, I found this example of a Materialized View:

CREATE MATERIALIZED VIEW search_index AS 
SELECT post.id,
       post.title,
       setweight(to_tsvector(post.language::regconfig, post.title), 'A') || 
       setweight(to_tsvector(post.language::regconfig, post.content), 'B') ||
       setweight(to_tsvector('simple', author.name), 'C') ||
       setweight(to_tsvector('simple', coalesce(string_agg(tag.name, ' '))), 'A') as document
FROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id

In the example above, a Post would belong to an Author, and a Post would have many tags, just like we have a Location that belongs to an Organization and has many Services.

But next, I want to be able to include in that view, the names of the Categories that belong to Services. Services have and belong to many Categories, so there is a join table called categories_services that defines which categories are assigned to which services.

How do I add another JOIN in the view above based on the Location table to get the Category names for the Services that belong to a particular Location?

@spara
Copy link

spara commented May 2, 2014

I can help with queries. Could you dump your database and send it to me?

On Fri, May 2, 2014 at 1:36 PM, migurski [email protected] wrote:

Do you have the ability to drop down to plain old SQL via your ORM? This
is a case where the underlying functionality is a straightforward
application of decades-old SQL features ("SELECT stuff FROM place WHERE
this AND that…") and the layering of gems on top and dependency on 3rd
party developers to add functionality seems to be getting in your way.


Reply to this email directly or view it on GitHubhttps://github.com//issues/139#issuecomment-42063541
.

@monfresh
Copy link
Member Author

monfresh commented May 2, 2014

Thanks, @spara. It's probably easiest if you just run the app on your machine. Assuming you have all the prerequisites on your machine, just clone the app, then run script/bootstrap. Or, just set up the virtual machine.

@spara
Copy link

spara commented May 2, 2014

cool! Forgot that the vm was built from the repo.

On Fri, May 2, 2014 at 2:21 PM, Moncef Belyamani
[email protected]:

Thanks, @spara https://github.com/spara. It's probably easiest if you
just run the app on your machine. Assuming you have all the prerequisites
on your machine, just clone the app, then run script/bootstrap. Or, just
set up the virtual machinehttps://github.com/codeforamerica/ohana-api-dev-box
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/139#issuecomment-42068311
.

@monfresh
Copy link
Member Author

monfresh commented May 5, 2014

Here's what I've found so far after some more tinkering.
I decided to start simple with just Active Record, and was able to search for text within the fields of various tables, and filter the results, but it's hella slow. Here's a crude method I put together:

def self.find_locations(params)
  locations = Location.order(:name)

  if params[:keyword].present?
    locations = locations.joins(:services => :categories).
      where(
        "locations.name @@ :q OR
        locations.description @@ :q OR
        services.keywords @@ :q OR
        categories.name @@ :q",
        q: params[:keyword]
      ).uniq
  end

  if params[:category].present?
    locations = locations.joins(:services => :categories).
      where(categories: { name: params[:category] }).uniq
  end

  locations = locations.near(params[:location], 5) if params[:location].present?
  locations
end

With this setup, I can do a query that can narrow down results by various parameters:

Location.find_locations(keyword: "union members", location: '94403', category: 'Health')

Performance is not so great, though. The less parameters, the slower the query:

Location.find_locations(keyword: "union members")
Location Load (1019.5ms)

Location.find_locations(keyword: "union members", location: '94403')
Location Load (263.5ms)

Location.find_locations(keyword: "union members", location: '94403', category: 'Health')
Location Load (22.3ms)

These times are with gin indexes on all the searched fields, for example:

CREATE INDEX index_services_on_keywords ON services USING gin(to_tsvector('english', keywords));

By comparison, the same queries are run by Elasticsearch out of the box (without hitting the DB) in the following times: 3ms, 4ms, 2ms.

Next, I tried to speed up the search by using a "view" along with the Textacular gem, as explained here and here.

Here's what my view migration looks like:

class AddSearchesView < ActiveRecord::Migration
  def up
    ActiveRecord::Base.connection.execute <<-SQL
      CREATE VIEW searches AS
        SELECT DISTINCT locations.id AS searchable_id,
                'Location' AS searchable_type,
                locations.name AS term
        FROM locations

        UNION

        SELECT DISTINCT locations.id AS searchable_id,
                'Location' AS searchable_type,
                locations.description AS term
        FROM locations

        UNION

        SELECT DISTINCT locations.id AS searchable_id,
                'Location' AS searchable_type,
                organizations.name AS term
        FROM locations
        JOIN organizations ON organizations.id = locations.organization_id

        UNION

        SELECT DISTINCT locations.id AS searchable_id,
                'Location' AS searchable_type,
                services.name AS term
        FROM locations
        JOIN services ON services.location_id = locations.id

        UNION

        SELECT DISTINCT locations.id AS searchable_id,
                'Location' AS searchable_type,
                services.description AS term
        FROM locations
        JOIN services ON services.location_id = locations.id

        UNION

        SELECT DISTINCT locations.id AS searchable_id,
                'Location' AS searchable_type,
                services.keywords AS term
        FROM locations
        JOIN services ON services.location_id = locations.id

        UNION

        SELECT DISTINCT locations.id AS searchable_id,
                'Location' AS searchable_type,
                categories.name AS term
        FROM locations
        JOIN services ON services.location_id = locations.id
        JOIN categories_services ON categories_services.service_id = services.id
        JOIN categories ON categories.id = categories_services.category_id
    SQL
  end

  def down
    ActiveRecord::Base.connection.execute <<-SQL
      DROP VIEW searches;
    SQL
  end
end

This allows me to search for text in all the same fields as before, but much faster:

Search.new('union members')
Search Load (11.0ms)
Location Load (1.3ms)

Note that with this setup, there are two queries being made: one against the searches view, and then the locations that correspond to the results are fetched.

The Search model looks like this:

class Search < ActiveRecord::Base
  extend Textacular

  belongs_to :searchable, polymorphic: true

  def self.new(params)
    query = query.to_s
    return [] if query.empty?
    self.search(query).map!(&:searchable)
  end
end

This view method works fine when you have very basic requirements, and all you need to do is search on one parameter. But we need to be able to do more than that, especially filtering. The problem with this setup is that this search returns an Array instead of an Active Record::Relation, so I can't do further filtering of the results like I did in the Active Record-based code at the top of this comment. I haven't been able to figure out how to take advantage of this view in conjunction with filtering. If someone knows how to do this, please let me know, but read on first because the view might not be the most performant method.

After I tried the "view," I decided to give pg_search another try, because I know it returns Active Record::Relations. The README for pg_search warns that although it supports searching columns in associated tables, "it will be impossible to speed up searches with database indexes." One method I read about to improve performance when searching in various tables is to use a tsvector column, as explained here. I tried that approach and came up with these migrations:

First, add the tsvector column to the locations table and add an index for it:

class AddTsvectorColumnToLocations < ActiveRecord::Migration
  def up
    add_column :locations, :search_vector, :tsvector
    add_index  :locations, :search_vector, using: 'gin'
  end

  def down
    remove_column :locations, :search_vector
  end
end

Next, create the function and trigger to fill the search vector:

class CreateFunctionAndTriggerForFillingLocationsSearchVector < ActiveRecord::Migration
  def up
    execute <<-SQL
      CREATE OR REPLACE FUNCTION fill_search_vector_for_location() RETURNS trigger LANGUAGE plpgsql AS $$
      declare
        location_organization record;
        location_services_keywords record;
        location_services_description record;
        location_services_name record;
        service_categories record;

      begin
        select name into location_organization from organizations where id = new.organization_id;
        select string_agg(keywords, ' ') as keywords into location_services_keywords from services where location_id = new.id;
        select string_agg(description, ' ') as description into location_services_description from services where location_id = new.id;
        select string_agg(name, ' ') as name into location_services_name from services where location_id = new.id;
        select string_agg(categories.name, ' ') as name into service_categories from locations
        JOIN services ON services.location_id = locations.id
        JOIN categories_services ON categories_services.service_id = services.id
        JOIN categories ON categories.id = categories_services.category_id;


        new.search_vector :=
          setweight(to_tsvector('pg_catalog.english', coalesce(new.name, '')), 'B')                  ||
          setweight(to_tsvector('pg_catalog.english', coalesce(new.description, '')), 'A')                ||
          setweight(to_tsvector('pg_catalog.english', coalesce(location_organization.name, '')), 'B')        ||
          setweight(to_tsvector('pg_catalog.english', coalesce(location_services_description.description, '')), 'A')    ||
          setweight(to_tsvector('pg_catalog.english', coalesce(location_services_name.name, '')), 'B')  ||
          setweight(to_tsvector('pg_catalog.english', coalesce(location_services_keywords.keywords, '')), 'A') ||
          setweight(to_tsvector('pg_catalog.english', coalesce(service_categories.name, '')), 'B');

        return new;
      end
      $$;
    SQL

    execute <<-SQL
      CREATE TRIGGER locations_search_content_trigger BEFORE INSERT OR UPDATE
        ON locations FOR EACH ROW EXECUTE PROCEDURE fill_search_vector_for_location();
    SQL

    Location.find_each(&:touch)
  end

  def down
    execute <<-SQL
      DROP TRIGGER locations_search_content_trigger ON locations;
      DROP FUNCTION fill_search_vector_for_location();
    SQL
  end
end

The way you use this tsvector column with pg_search is by adding the following code to your model:

include PgSearch

  pg_search_scope :search, against: :search_vector,
    using: {
      tsearch: {
        dictionary: 'english',
        any_word: false,
        prefix: true,
        tsvector_column: 'search_vector'
      }
    }

According to the tsvector columns section in the pg_search README, it's supposed to speed up search dramatically. Compared to the "view method," I'm getting mixed results. When the number of results is low, the tsvector column is almost 3 times faster. Location.search(keyword: 'union members') is now down to 3.7ms from 12.3ms. However, searches that contain many results, and if you don't use pagination to limit the results that are returned at once, then it's actually almost 5 times slower. Location.search(keyword: 'health'), which returns 939 results, goes from 117ms to 527.6ms. Maybe it's because the view method returned an Array and the tsvector column returns an Active Record::Relation. It could also be that the view method didn't return that many results. I'd have to go back to check the total count.

This is the query it runs:

SELECT "locations".*, ((ts_rank(("locations"."search_vector"), (to_tsquery('english', ''' ' || 'health' || ' ''')), 0))) AS pg_search_rank FROM "locations" WHERE ((("locations"."search_vector") @@ (to_tsquery('english', ''' ' || 'health' || ' ''')))) ORDER BY pg_search_rank DESC, "locations"."id" ASC

It could also be that I didn't properly set up the function. I have zero experience with tsvector columns, so I would greatly appreciate someone looking over how I set it up. I'm assuming that every column you want to search on has to be declared as a separate record. Initially, I thought that I could add different columns from the same table to the same record, but that gave me an error that a record didn't have a particular field.

Note that this is when returning all results at once, but still, compared to Elasticsearch, it's almost 6x slower (Elasticsearch can return 867 results at once in 90ms). If we limit the results to 30 per page (the default), then speeds are back down to high double digits. The API allows the client to set the per_page parameter to up to 100, and if clients exercise that option, then we are starting to creep into the 3-digit territory:

Location.search(keyword: 'health', location: '94403').page(1).per(100)
Location Load (124.3ms)

The other issue I need to look more into is that this tsvector column business is returning a lot more results than Elasticsearch, so I'll have to do some testing to make sure the right results are being returned.

That's where I'm at now, and my brain is pretty fried, so I'm gonna step away from the computer now 😄

In the meantime, I'd greatly appreciate advice from someone who has successfully implemented PG full-text search, preferably in a Rails app. It seems like the tsvector option is better than the view, and it certainly is in terms of Rails integration, but I'd like to hear from someone with more PG experience than me.

Is the tsvector column the way to go? Did I set it up right? Should I create a separate tsvector column for each parameter the API supports? For example, the keyword parameter is for a generic search across multiple tables, which is what I have a tsvector column created for. There is also the language parameter that lets you find locations that match a particular language, and the category parameter to search specifically in the name column of the Categories table. Should I create a separate tsvector column in the Locations table for each of those parameters?

Thanks!

@monfresh
Copy link
Member Author

monfresh commented May 5, 2014

Another important thing I forgot to mention in terms of performance is that all the times I've been reporting were solely for retrieving records from the database. It doesn't include the time it takes to generate the JSON, which requires further DB calls, because although Location records are initially returned, the JSON also includes all the associated tables (so that clients can get a Location's address by making one API call instead of two, for example). This JSON presentation is done by the grape-entity library, and results in an additional 7-10ms on average per record. Here's a sample DB log:

  Address Load (0.7ms)  SELECT "addresses".* FROM "addresses" WHERE "addresses"."location_id" = $1 LIMIT 1  [["location_id", 1217]]
  Contact Load (0.6ms)  SELECT "contacts".* FROM "contacts" WHERE "contacts"."location_id" = $1  [["location_id", 1217]]
  Fax Load (0.5ms)  SELECT "faxes".* FROM "faxes" WHERE "faxes"."location_id" = $1  [["location_id", 1217]]
  MailAddress Load (2.8ms)  SELECT "mail_addresses".* FROM "mail_addresses" WHERE "mail_addresses"."location_id" = $1 LIMIT 1  [["location_id", 1217]]
  Phone Load (0.5ms)  SELECT "phones".* FROM "phones" WHERE "phones"."location_id" = $1  [["location_id", 1217]]
  Service Load (2.8ms)  SELECT "services".* FROM "services" WHERE "services"."location_id" = $1  [["location_id", 1217]]
  Category Load (0.9ms)  SELECT DISTINCT "categories".* FROM "categories" INNER JOIN "categories_services" ON "categories"."id" = "categories_services"."category_id" WHERE "categories_services"."service_id" = $1  [["service_id", 1217]]
  Organization Load (0.5ms)  SELECT "organizations".* FROM "organizations" WHERE "organizations"."id" = $1 LIMIT 1  [["id", 1216]]

This means that a page that returns 30 results would add about 210 to 300ms. Compare that to the times that I've been reporting for Elasticsearch, which include the JSON representation, and it makes Elasticsearch 100 times faster!

If there's a way to generate that JSON without hitting the DB, like Elasticsearch does, I would love to know about it!

@migurski
Copy link
Contributor

migurski commented May 5, 2014

Great writeup. I'd be curious to see a complete query log from the tsvector method—it sounds like it’s very fast at retrieving a list of IDs, but then slow at retrieving the complete linked information for each of those IDs because it’s going back to Postgres for each one in a followup query? Your second comment seems to confirm this.

I suspect that ElasticSearch is fast because you’re actively populating it with complete results in your application code, which speeds up retrieval by putting data where it’s needed ahead of time but also introduces risk by requiring you to maintain consistency between two data stores.

In non-Rails applications, I've seen a few approaches to deal with this. The big source of slow-down here is probably query overhead between the application and the database. One approach to reduce database chatter is batching queries and asking for things in groups, e.g. getting a list of mail_addresses all at once instead of asking for each individually. Another is to cache individual entities in memcache, ask for them individually from the cache if available, set long expiry times on the cached representations, and then proactively delete them when records are updated, unfortunately pulling back some of the consistency responsibility into the app.

Do you have a way of integrating some of the tiny queries for things like addresses & phone numbers into joined columns on the locations table? More aggressively, could you move that information from separate tables into the locations table directly? That would remove a major source of query overhead and result in a DB schema that’s a bit more streamlined.

@migurski
Copy link
Contributor

migurski commented May 5, 2014

@pui tells me that ActiveRecord’s eager loading feature is designed to address this specific issue, and generates sub-select queries in place of multiple followup queries:

13. Eager Loading Associations

Eager loading is the mechanism for loading the associated records of the objects returned by Model.find using as few queries as possible.

Active Record lets you specify in advance all the associations that are going to be loaded. This is possible by specifying the includes method of the Model.find call. With includes, Active Record ensures that all of the specified associations are loaded using the minimum possible number of queries.

@monfresh
Copy link
Member Author

monfresh commented May 5, 2014

duh! I thought eager loading was already happening. Including all the tables speeds things up nicely, but it looks like there's a bug with grape-entity. It's fetching tables twice: once for just the first result, then again for all the results:

Location Load (17.4ms)  SELECT "locations".*, ((ts_rank((to_tsvector('simple', coalesce("locations"."languages"::text, ''))), (to_tsquery('simple', ''' ' || 'french' || ' ''')), 0))) AS pg_search_rank FROM "locations" WHERE (((to_tsvector('simple', coalesce("locations"."languages"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'french' || ' ''')))) ORDER BY pg_search_rank DESC, "locations"."id" ASC LIMIT 1 OFFSET 0
  Organization Load (0.8ms)  SELECT "organizations".* FROM "organizations" WHERE "organizations"."id" IN (2)
  Address Load (0.9ms)  SELECT "addresses".* FROM "addresses" WHERE "addresses"."location_id" IN (3)
  MailAddress Load (0.9ms)  SELECT "mail_addresses".* FROM "mail_addresses" WHERE "mail_addresses"."location_id" IN (3)
  Contact Load (0.8ms)  SELECT "contacts".* FROM "contacts" WHERE "contacts"."location_id" IN (3)
  Phone Load (0.8ms)  SELECT "phones".* FROM "phones" WHERE "phones"."location_id" IN (3)
  Fax Load (1.2ms)  SELECT "faxes".* FROM "faxes" WHERE "faxes"."location_id" IN (3)
  Service Load (0.9ms)  SELECT "services".* FROM "services" WHERE "services"."location_id" IN (3)
  SQL (1.3ms)  SELECT "categories".*, "t0"."service_id" AS ar_association_key_name FROM "categories" INNER JOIN "categories_services" "t0" ON "categories"."id" = "t0"."category_id" WHERE "t0"."service_id" IN (3)
  Location Load (15.0ms)  SELECT "locations".*, ((ts_rank((to_tsvector('simple', coalesce("locations"."languages"::text, ''))), (to_tsquery('simple', ''' ' || 'french' || ' ''')), 0))) AS pg_search_rank FROM "locations" WHERE (((to_tsvector('simple', coalesce("locations"."languages"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'french' || ' ''')))) ORDER BY pg_search_rank DESC, "locations"."id" ASC LIMIT 30 OFFSET 0
  Organization Load (0.9ms)  SELECT "organizations".* FROM "organizations" WHERE "organizations"."id" IN (2, 17, 24, 30, 69, 89, 101, 108, 110, 129, 132, 150, 153, 175, 203, 206, 212, 219, 230, 312, 314, 338, 349, 351, 356, 363, 382, 406, 408, 412)
  Address Load (0.7ms)  SELECT "addresses".* FROM "addresses" WHERE "addresses"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
  MailAddress Load (1.0ms)  SELECT "mail_addresses".* FROM "mail_addresses" WHERE "mail_addresses"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
  Contact Load (0.9ms)  SELECT "contacts".* FROM "contacts" WHERE "contacts"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
  Phone Load (1.0ms)  SELECT "phones".* FROM "phones" WHERE "phones"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
  Fax Load (0.7ms)  SELECT "faxes".* FROM "faxes" WHERE "faxes"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
  Service Load (4.9ms)  SELECT "services".* FROM "services" WHERE "services"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
  SQL (1.7ms)  SELECT "categories".*, "t0"."service_id" AS ar_association_key_name FROM "categories" INNER JOIN "categories_services" "t0" ON "categories"."id" = "t0"."category_id" WHERE "t0"."service_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)

If, instead of using grape-entity, I just define the JSON myself by overriding the as_json method, then it doesn't do this double fetching. So, I'll see if I can define the JSON easily myself, while keeping the method clean and readable. If not, I'll try some other JSON builders like RABL and JBuilder.

@migurski
Copy link
Contributor

migurski commented May 5, 2014

Schwing!

monfresh added a commit that referenced this issue May 10, 2014
Replace Elasticsearch with Postgres full-text search. Fixes #139
@ylluminate
Copy link

How's this looking as of April 2015 @monfresh? Any updates or performance improvements? I've been considering using pg_search to simplify an ElasticSearch nightmare we've been having.

@monfresh
Copy link
Member Author

monfresh commented Apr 7, 2015

It's been working great. We've been using pg_search in production (Heroku) for several months without any issues.

@ylluminate
Copy link

@monfresh 👍 thanks for the feedback. Certainly would like more benchmark info, but honestly the headaches we've had with ElasticSearch likely will be outweighed even if this is just a little slower.

@monfresh
Copy link
Member Author

monfresh commented Apr 8, 2015

I don't have benchmark comparisons for you unfortunately, and your numbers will depend on how you configure your server, how much traffic you get, and what kind of caching you implement. What I can tell you is that our Heroku server gets between 20,000 and 50,000 requests per week (they don't all hit the DB), and our database response time averages under 15ms, and the app response time has been between 40ms and 80ms on average according to New Relic.

@orlando
Copy link

orlando commented May 29, 2015

@monfresh this is a great write up. Did you made a post out of this thread?.

@monfresh
Copy link
Member Author

@orlando I did not, but that's a good idea.

@thebenedict
Copy link

@monfresh This is exactly the information I was looking for and couldn't find in the pg_search docs or other blog posts. Well worth making a post if you have the time. Thank you for the thorough work and having this discussion in a public thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants