-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is Postgres full-text search as flexible and easy to implement as Elasticsearch in a Rails app? #139
Comments
this could be a quick resource for full-text search is Postgres. http://railscasts.com/episodes/343-full-text-search-in-postgresql |
Thanks! It's on my to-do list for next week. |
I've been doing some initial research on implementing Postgres full-text search (FTS) in Rails, and from what I could find so far, it seems to be less flexible and less performant than Elasticsearch. If anyone out there reading this has experience with Postgres FTS in Rails, I would greatly appreciate your input. The main reason for wanting to switch is to reduce dependencies, but as long as certain functionality and flexibility is retained. The switch shouldn't sacrifice the following features IMO:
In terms of Rails integration, there are only 2 gems people mention: pg_search and textacular. I've only tried pg_search so far, and it doesn't seem to support searching on multiple parameters at the same time. For example, I want to be able to find records that match all criteria passed in the search request, like http://ohana-api-demo.herokuapp.com/api/search?keyword=health&location=94402. An issue was opened 10 months ago asking if this feature was supported, but it hasn't been answered yet. Textacular seems to support this, so I'll try it out soon. The one thing that seems to be the big gotcha based on the documentation for both of those gems, is the difficulty in setting up the right indexes for performance, and the inflexibility in making changes that affect the indexes. For example, if you create an index via a Rails migration, then later decide you want to search on different or more fields, or adjust the weights of certain fields, then you have to set up a new migration, making sure to have the old setup defined properly in the In comparison, making changes with Elasticsearch is a lot easier. It's just a matter of making changes to the model's code, especially if you're using a library with a great DSL like Tire. There isn't much documentation about how to create and update these indexes, and the suggestions vary from author to author. Textacular recommends creating a "view", like in this gist, but doesn't mention how the view gets updated when a record is created or updated. The index documentation on pg_search is very sparse. Another blog post recommends setting up a tsvector column with a trigger to update the index upon INSERT or UPDATE. The post mentions this setup alongside pg_search. I'm not sure if it would work with textacular. All in all, I've found the documentation and tutorials around Postgres FTS lacking in comparison to Elasticsearch, especially around setting up indexes properly. The responsiveness and helpfulness of Karel Minarik (as well as his ample documentation and examples), the maintainer of Tire, is one of the main reasons I stuck with Elasticsearch. So, all of this is to ask you to please provide any insight if you have experience implementing complex search in Rails with Postgres FTS. Here's the current Elasticsearch search method if you want to see what Postgres should be able to do. @ahhrrr I see that you've contributed code to Textacular. Would love to hear your thoughts! Thanks! |
I also wanted to add that we are writing this software for other people to use; people who might not be as technically proficient as us, so I think it's super important for the app to be as easy to use and customize as possible. Search cannot be built in a one-size-fits-all fashion, so people who redeploy this app in their communities will mostly likely want to adjust various search settings, and the easier it is to do so, the better. I'm not sure if PostGIS is required to combine FTS with geospatial queries, but it's worth noting that PostGIS support on Heroku is "in beta and subject to change in the future," and only available on Production tier plans starting at $50/month. |
Yes, PostGIS is needed for geospatial queries, but install is easy if CREATE EXTENSION postgis; That's it. After that you'll have to create a point column from the On Thu, May 1, 2014 at 10:08 PM, Moncef Belyamani
|
Hi Moncef, very thorough! I’m not sure what the JSON bit is about. The application layer should be determining the output JSON structure, not the data store. Regarding geospatial queries, it really depends on what you're trying to do. PostGIS is needed for geometry predicates (e.g. points inside polygons) and fast spatial indexes, but for the data volumes handled by Ohana there's a lot of headroom to get by with simple/stupid approaches. Sorting a ~2,000 row table by naive 2D distance calculated on the fly should present no noticeable performance impact, for example. For indexes, I'd love to see some concrete performance numbers. 8ms responses are nice, but anything under 100ms will probably be fine. The data volumes handled by Ohana are so small that fine-tuning indexes might be a waste of effort, in contrast to doing a simple table scan. There’s just not that much data here. As far as the technical proficiency bit goes, I have my doubts about people’s desire to fine-tune anything in the search index. |
Also, regarding views: they are kept up-to-date by Postgres, internally, with no action or intervention required on the application’s part. “Consistency” is the “C” in ACID. Making this Postgres’s problem so you can strip index maintenance out of your code completely is why I’m advocating so strongly for this direction. |
I agree with Mike that the application layer should produce the json, while On Thu, May 1, 2014 at 11:34 PM, migurski [email protected] wrote:
|
@spara and @migurski, thanks for your input! I will try Textacular and see if I can get it to at least pass the search tests. I will try something simple like Geocoder at first for the location queries, then see if using PostGIS instead makes a significant difference. Once the tests pass, I will measure the performance. Sorry for the confusion about JSON. The JSON is indeed handled by the application, but the Tire gem, for example, provides a DSL to make it easier to construct. If the Location model has 20 fields, 18 of which I want to include in the JSON, I can just do this: def to_indexed_json
to_json except: ['updated_at', 'created_at']
end As opposed to having to manually define a huge page-long hash that contains every single field I want. |
So, I spent the past few hours digging into Textacular, and out of the box, it doesn't provide the same functionality as Elasticearch/Tire. I opened this issue to ask if what I want to do is possible, and the answer is that it would require some "serious AREL fu", which I certainly don't have. This is what I meant by Elasticsearch being a lot more approachable. At this point, I can't progress without help. If anyone with experience building complex queries like this could work with me, I'd greatly appreciate it. |
Do you have the ability to drop down to plain old SQL via your ORM? This is a case where the underlying functionality is a straightforward application of decades-old SQL features ("SELECT stuff FROM place WHERE this AND that…") and the layering of gems on top and dependency on 3rd party developers to add functionality seems to be getting in your way. |
Yes, you can. As a first step, do you know how to join tables two levels deep? For example, I found this example of a Materialized View: CREATE MATERIALIZED VIEW search_index AS
SELECT post.id,
post.title,
setweight(to_tsvector(post.language::regconfig, post.title), 'A') ||
setweight(to_tsvector(post.language::regconfig, post.content), 'B') ||
setweight(to_tsvector('simple', author.name), 'C') ||
setweight(to_tsvector('simple', coalesce(string_agg(tag.name, ' '))), 'A') as document
FROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id In the example above, a Post would belong to an Author, and a Post would have many tags, just like we have a Location that belongs to an Organization and has many Services. But next, I want to be able to include in that view, the names of the Categories that belong to Services. Services have and belong to many Categories, so there is a join table called How do I add another JOIN in the view above based on the Location table to get the Category names for the Services that belong to a particular Location? |
I can help with queries. Could you dump your database and send it to me? On Fri, May 2, 2014 at 1:36 PM, migurski [email protected] wrote:
|
Thanks, @spara. It's probably easiest if you just run the app on your machine. Assuming you have all the prerequisites on your machine, just clone the app, then run |
cool! Forgot that the vm was built from the repo. On Fri, May 2, 2014 at 2:21 PM, Moncef Belyamani
|
Here's what I've found so far after some more tinkering. def self.find_locations(params)
locations = Location.order(:name)
if params[:keyword].present?
locations = locations.joins(:services => :categories).
where(
"locations.name @@ :q OR
locations.description @@ :q OR
services.keywords @@ :q OR
categories.name @@ :q",
q: params[:keyword]
).uniq
end
if params[:category].present?
locations = locations.joins(:services => :categories).
where(categories: { name: params[:category] }).uniq
end
locations = locations.near(params[:location], 5) if params[:location].present?
locations
end With this setup, I can do a query that can narrow down results by various parameters: Location.find_locations(keyword: "union members", location: '94403', category: 'Health') Performance is not so great, though. The less parameters, the slower the query: Location.find_locations(keyword: "union members")
Location Load (1019.5ms)
Location.find_locations(keyword: "union members", location: '94403')
Location Load (263.5ms)
Location.find_locations(keyword: "union members", location: '94403', category: 'Health')
Location Load (22.3ms) These times are with gin indexes on all the searched fields, for example: CREATE INDEX index_services_on_keywords ON services USING gin(to_tsvector('english', keywords)); By comparison, the same queries are run by Elasticsearch out of the box (without hitting the DB) in the following times: 3ms, 4ms, 2ms. Next, I tried to speed up the search by using a "view" along with the Textacular gem, as explained here and here. Here's what my view migration looks like: class AddSearchesView < ActiveRecord::Migration
def up
ActiveRecord::Base.connection.execute <<-SQL
CREATE VIEW searches AS
SELECT DISTINCT locations.id AS searchable_id,
'Location' AS searchable_type,
locations.name AS term
FROM locations
UNION
SELECT DISTINCT locations.id AS searchable_id,
'Location' AS searchable_type,
locations.description AS term
FROM locations
UNION
SELECT DISTINCT locations.id AS searchable_id,
'Location' AS searchable_type,
organizations.name AS term
FROM locations
JOIN organizations ON organizations.id = locations.organization_id
UNION
SELECT DISTINCT locations.id AS searchable_id,
'Location' AS searchable_type,
services.name AS term
FROM locations
JOIN services ON services.location_id = locations.id
UNION
SELECT DISTINCT locations.id AS searchable_id,
'Location' AS searchable_type,
services.description AS term
FROM locations
JOIN services ON services.location_id = locations.id
UNION
SELECT DISTINCT locations.id AS searchable_id,
'Location' AS searchable_type,
services.keywords AS term
FROM locations
JOIN services ON services.location_id = locations.id
UNION
SELECT DISTINCT locations.id AS searchable_id,
'Location' AS searchable_type,
categories.name AS term
FROM locations
JOIN services ON services.location_id = locations.id
JOIN categories_services ON categories_services.service_id = services.id
JOIN categories ON categories.id = categories_services.category_id
SQL
end
def down
ActiveRecord::Base.connection.execute <<-SQL
DROP VIEW searches;
SQL
end
end This allows me to search for text in all the same fields as before, but much faster: Search.new('union members')
Search Load (11.0ms)
Location Load (1.3ms) Note that with this setup, there are two queries being made: one against the searches view, and then the locations that correspond to the results are fetched. The Search model looks like this: class Search < ActiveRecord::Base
extend Textacular
belongs_to :searchable, polymorphic: true
def self.new(params)
query = query.to_s
return [] if query.empty?
self.search(query).map!(&:searchable)
end
end This view method works fine when you have very basic requirements, and all you need to do is search on one parameter. But we need to be able to do more than that, especially filtering. The problem with this setup is that this search returns an After I tried the "view," I decided to give pg_search another try, because I know it returns First, add the tsvector column to the locations table and add an index for it: class AddTsvectorColumnToLocations < ActiveRecord::Migration
def up
add_column :locations, :search_vector, :tsvector
add_index :locations, :search_vector, using: 'gin'
end
def down
remove_column :locations, :search_vector
end
end Next, create the function and trigger to fill the search vector: class CreateFunctionAndTriggerForFillingLocationsSearchVector < ActiveRecord::Migration
def up
execute <<-SQL
CREATE OR REPLACE FUNCTION fill_search_vector_for_location() RETURNS trigger LANGUAGE plpgsql AS $$
declare
location_organization record;
location_services_keywords record;
location_services_description record;
location_services_name record;
service_categories record;
begin
select name into location_organization from organizations where id = new.organization_id;
select string_agg(keywords, ' ') as keywords into location_services_keywords from services where location_id = new.id;
select string_agg(description, ' ') as description into location_services_description from services where location_id = new.id;
select string_agg(name, ' ') as name into location_services_name from services where location_id = new.id;
select string_agg(categories.name, ' ') as name into service_categories from locations
JOIN services ON services.location_id = locations.id
JOIN categories_services ON categories_services.service_id = services.id
JOIN categories ON categories.id = categories_services.category_id;
new.search_vector :=
setweight(to_tsvector('pg_catalog.english', coalesce(new.name, '')), 'B') ||
setweight(to_tsvector('pg_catalog.english', coalesce(new.description, '')), 'A') ||
setweight(to_tsvector('pg_catalog.english', coalesce(location_organization.name, '')), 'B') ||
setweight(to_tsvector('pg_catalog.english', coalesce(location_services_description.description, '')), 'A') ||
setweight(to_tsvector('pg_catalog.english', coalesce(location_services_name.name, '')), 'B') ||
setweight(to_tsvector('pg_catalog.english', coalesce(location_services_keywords.keywords, '')), 'A') ||
setweight(to_tsvector('pg_catalog.english', coalesce(service_categories.name, '')), 'B');
return new;
end
$$;
SQL
execute <<-SQL
CREATE TRIGGER locations_search_content_trigger BEFORE INSERT OR UPDATE
ON locations FOR EACH ROW EXECUTE PROCEDURE fill_search_vector_for_location();
SQL
Location.find_each(&:touch)
end
def down
execute <<-SQL
DROP TRIGGER locations_search_content_trigger ON locations;
DROP FUNCTION fill_search_vector_for_location();
SQL
end
end The way you use this tsvector column with pg_search is by adding the following code to your model: include PgSearch
pg_search_scope :search, against: :search_vector,
using: {
tsearch: {
dictionary: 'english',
any_word: false,
prefix: true,
tsvector_column: 'search_vector'
}
} According to the tsvector columns section in the pg_search README, it's supposed to speed up search dramatically. Compared to the "view method," I'm getting mixed results. When the number of results is low, the tsvector column is almost 3 times faster. This is the query it runs: SELECT "locations".*, ((ts_rank(("locations"."search_vector"), (to_tsquery('english', ''' ' || 'health' || ' ''')), 0))) AS pg_search_rank FROM "locations" WHERE ((("locations"."search_vector") @@ (to_tsquery('english', ''' ' || 'health' || ' ''')))) ORDER BY pg_search_rank DESC, "locations"."id" ASC It could also be that I didn't properly set up the function. I have zero experience with tsvector columns, so I would greatly appreciate someone looking over how I set it up. I'm assuming that every column you want to search on has to be declared as a separate record. Initially, I thought that I could add different columns from the same table to the same record, but that gave me an error that a record didn't have a particular field. Note that this is when returning all results at once, but still, compared to Elasticsearch, it's almost 6x slower (Elasticsearch can return 867 results at once in 90ms). If we limit the results to 30 per page (the default), then speeds are back down to high double digits. The API allows the client to set the Location.search(keyword: 'health', location: '94403').page(1).per(100)
Location Load (124.3ms) The other issue I need to look more into is that this tsvector column business is returning a lot more results than Elasticsearch, so I'll have to do some testing to make sure the right results are being returned. That's where I'm at now, and my brain is pretty fried, so I'm gonna step away from the computer now 😄 In the meantime, I'd greatly appreciate advice from someone who has successfully implemented PG full-text search, preferably in a Rails app. It seems like the tsvector option is better than the view, and it certainly is in terms of Rails integration, but I'd like to hear from someone with more PG experience than me. Is the tsvector column the way to go? Did I set it up right? Should I create a separate tsvector column for each parameter the API supports? For example, the keyword parameter is for a generic search across multiple tables, which is what I have a tsvector column created for. There is also the Thanks! |
Another important thing I forgot to mention in terms of performance is that all the times I've been reporting were solely for retrieving records from the database. It doesn't include the time it takes to generate the JSON, which requires further DB calls, because although Location records are initially returned, the JSON also includes all the associated tables (so that clients can get a Location's address by making one API call instead of two, for example). This JSON presentation is done by the grape-entity library, and results in an additional 7-10ms on average per record. Here's a sample DB log:
This means that a page that returns 30 results would add about 210 to 300ms. Compare that to the times that I've been reporting for Elasticsearch, which include the JSON representation, and it makes Elasticsearch 100 times faster! If there's a way to generate that JSON without hitting the DB, like Elasticsearch does, I would love to know about it! |
Great writeup. I'd be curious to see a complete query log from the tsvector method—it sounds like it’s very fast at retrieving a list of IDs, but then slow at retrieving the complete linked information for each of those IDs because it’s going back to Postgres for each one in a followup query? Your second comment seems to confirm this. I suspect that ElasticSearch is fast because you’re actively populating it with complete results in your application code, which speeds up retrieval by putting data where it’s needed ahead of time but also introduces risk by requiring you to maintain consistency between two data stores. In non-Rails applications, I've seen a few approaches to deal with this. The big source of slow-down here is probably query overhead between the application and the database. One approach to reduce database chatter is batching queries and asking for things in groups, e.g. getting a list of Do you have a way of integrating some of the tiny queries for things like addresses & phone numbers into joined columns on the locations table? More aggressively, could you move that information from separate tables into the locations table directly? That would remove a major source of query overhead and result in a DB schema that’s a bit more streamlined. |
@pui tells me that ActiveRecord’s eager loading feature is designed to address this specific issue, and generates sub-select queries in place of multiple followup queries:
|
duh! I thought eager loading was already happening. Including all the tables speeds things up nicely, but it looks like there's a bug with grape-entity. It's fetching tables twice: once for just the first result, then again for all the results: Location Load (17.4ms) SELECT "locations".*, ((ts_rank((to_tsvector('simple', coalesce("locations"."languages"::text, ''))), (to_tsquery('simple', ''' ' || 'french' || ' ''')), 0))) AS pg_search_rank FROM "locations" WHERE (((to_tsvector('simple', coalesce("locations"."languages"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'french' || ' ''')))) ORDER BY pg_search_rank DESC, "locations"."id" ASC LIMIT 1 OFFSET 0
Organization Load (0.8ms) SELECT "organizations".* FROM "organizations" WHERE "organizations"."id" IN (2)
Address Load (0.9ms) SELECT "addresses".* FROM "addresses" WHERE "addresses"."location_id" IN (3)
MailAddress Load (0.9ms) SELECT "mail_addresses".* FROM "mail_addresses" WHERE "mail_addresses"."location_id" IN (3)
Contact Load (0.8ms) SELECT "contacts".* FROM "contacts" WHERE "contacts"."location_id" IN (3)
Phone Load (0.8ms) SELECT "phones".* FROM "phones" WHERE "phones"."location_id" IN (3)
Fax Load (1.2ms) SELECT "faxes".* FROM "faxes" WHERE "faxes"."location_id" IN (3)
Service Load (0.9ms) SELECT "services".* FROM "services" WHERE "services"."location_id" IN (3)
SQL (1.3ms) SELECT "categories".*, "t0"."service_id" AS ar_association_key_name FROM "categories" INNER JOIN "categories_services" "t0" ON "categories"."id" = "t0"."category_id" WHERE "t0"."service_id" IN (3)
Location Load (15.0ms) SELECT "locations".*, ((ts_rank((to_tsvector('simple', coalesce("locations"."languages"::text, ''))), (to_tsquery('simple', ''' ' || 'french' || ' ''')), 0))) AS pg_search_rank FROM "locations" WHERE (((to_tsvector('simple', coalesce("locations"."languages"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'french' || ' ''')))) ORDER BY pg_search_rank DESC, "locations"."id" ASC LIMIT 30 OFFSET 0
Organization Load (0.9ms) SELECT "organizations".* FROM "organizations" WHERE "organizations"."id" IN (2, 17, 24, 30, 69, 89, 101, 108, 110, 129, 132, 150, 153, 175, 203, 206, 212, 219, 230, 312, 314, 338, 349, 351, 356, 363, 382, 406, 408, 412)
Address Load (0.7ms) SELECT "addresses".* FROM "addresses" WHERE "addresses"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
MailAddress Load (1.0ms) SELECT "mail_addresses".* FROM "mail_addresses" WHERE "mail_addresses"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
Contact Load (0.9ms) SELECT "contacts".* FROM "contacts" WHERE "contacts"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
Phone Load (1.0ms) SELECT "phones".* FROM "phones" WHERE "phones"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
Fax Load (0.7ms) SELECT "faxes".* FROM "faxes" WHERE "faxes"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
Service Load (4.9ms) SELECT "services".* FROM "services" WHERE "services"."location_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413)
SQL (1.7ms) SELECT "categories".*, "t0"."service_id" AS ar_association_key_name FROM "categories" INNER JOIN "categories_services" "t0" ON "categories"."id" = "t0"."category_id" WHERE "t0"."service_id" IN (3, 18, 25, 31, 70, 90, 102, 109, 111, 130, 133, 151, 154, 176, 204, 207, 213, 220, 231, 313, 315, 339, 350, 352, 357, 364, 383, 407, 409, 413) If, instead of using grape-entity, I just define the JSON myself by overriding the |
Schwing! |
Replace Elasticsearch with Postgres full-text search. Fixes #139
It's been working great. We've been using pg_search in production (Heroku) for several months without any issues. |
@monfresh 👍 thanks for the feedback. Certainly would like more benchmark info, but honestly the headaches we've had with ElasticSearch likely will be outweighed even if this is just a little slower. |
I don't have benchmark comparisons for you unfortunately, and your numbers will depend on how you configure your server, how much traffic you get, and what kind of caching you implement. What I can tell you is that our Heroku server gets between 20,000 and 50,000 requests per week (they don't all hit the DB), and our database response time averages under 15ms, and the app response time has been between 40ms and 80ms on average according to New Relic. |
@monfresh this is a great write up. Did you made a post out of this thread?. |
@orlando I did not, but that's a good idea. |
@monfresh This is exactly the information I was looking for and couldn't find in the pg_search docs or other blog posts. Well worth making a post if you have the time. Thank you for the thorough work and having this discussion in a public thread. |
Postgres also has full-text search capabilities (which we haven't researched yet), and we hope it will be able to provide the same functionality as Elasticsearch. If that's the case, it will allow us to remove another dependency.
The text was updated successfully, but these errors were encountered: