-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: The ability to "join" parent and children #761
Comments
subscribe |
+1 this would be great |
+1 |
I ran into this, too. Nested documents are a bit too closely tied (specifically that you always get all the nested documents back and not just the matching one(s)) and with parent/child documents I can't get both the matching lower level and the upper level back, either – unless I am missing something. |
+1 |
Lucene 3.6 will support a join query: https://issues.apache.org/jira/browse/LUCENE-3602 |
Lucene 3.6 added query-time joining: https://issues.apache.org/jira/browse/LUCENE-3602 What's the timeline for ES using Lucene 3.6, @kimchy? |
The join query is not really relevant here. Parent child support is similar to the join aspect, its a matter of returning different data set than what is provided now. Note, there will never be a cross shard join in elasticsearch, so any join will happen within a shard, which the parent-child support does now. |
[+1] |
+1 |
1 similar comment
+1 |
This would be incredibly useful. |
Any update on this? Would love to have this rather than having to use seperate requests to get the children. |
+1 |
+1 this would be great to have. |
+1 |
3 similar comments
+1 |
+1 |
+1 |
So what would the response actually look like? Don't forget that parents and children are separate documents. Presumably you'd want children grouped with parents somehow? A parent may have millions of matching children - how many of those do we return? The top_hits aggregation #6124 isn't a good solution for this as you would have to aggregate on parent_id, of which there may be millions in the resultset. By far the most efficient way of doing this is in two queries:
While this requires two steps, it gives you all the flexibility you need which would otherwise have to be provided by adding new structures to the query dsl and to the response. Anybody want to flesh out this feature request a bit more? |
No further feedback. Closing |
Oh, no! This is the exact feature that will help complete my elasticsearch implementation. Let me give a hypothetical use case for this feature that is analogous to what I need to do in my implementation. Please forgive me for any misgivings as I am fairly new to elasticsearch and brand new to commenting on issues in GitHub. Use Case: I want to be able to populate a grid of events at parks in a given city, and allow filtering based upon whether the event is at a "safe" park. MappingsWe want three types here in a grandparent/parent/child relation. Citycurl -XPUT 'http://localhost:9200/parkinfo/city/_mapping' -d '{
"city" : {
"_id" : { "path" : "cityName" },
"properties" : {
"cityName" : { "type" : "string" },
"state" : { "type" : "string" }
}
}
}' Parkcurl -XPUT 'http://localhost:9200/parkinfo/park/_mapping' -d '{
"park" : {
"_parent":{
"type" : "city"
},
"_id" : { "path" : "parkName" },
"properties" : {
"parkName" : { "type" : "string" },
"address" : { "type" : "string" }
}
}
}' Park Eventcurl -XPUT 'http://localhost:9200/parkinfo/park_event/_mapping' -d '{
"park_event" : {
"_parent":{
"type" : "park"
},
"properties" : {
"eventName" : { "type" : "string" },
"eventType" : { "type" : "string" },
"time" : { "type" : "date" }
}
}
}' DataLet's now consider the data that we'd like to put in this index: Citiescurl -XPUT 'http://localhost:9200/parkinfo/city/SanDiego?routing=SanDiego' -d '{
"cityName" : "SanDiego",
"state" : "California"
}'
curl -XPUT 'http://localhost:9200/parkinfo/city/LosAngeles?routing=LosAngeles' -d '{
"cityName" : "LosAngeles",
"state" : "California"
}' Parks in San Diegocurl -XPUT 'http://localhost:9200/parkinfo/park/Balboa?parent=SanDiego&routing=SanDiego' -d '{
"parkName" : "Balboa",
"address" : "1549 El Prado"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park/Glen?parent=SanDiego&routing=SanDiego' -d '{
"parkName" : "Glen",
"address" : "2149 Orinda Dr"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park/KateSessions?parent=SanDiego&routing=SanDiego' -d '{
"parkName" : "KateSessions",
"address" : "5115 Soledad Rd"
}' Parks in Los Angelescurl -XPUT 'http://localhost:9200/parkinfo/park/48thSt?parent=LosAngeles&routing=LosAngeles' -d '{
"parkName" : "48thSt",
"address" : "4800 South Hoover"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park/Alma?parent=LosAngeles&routing=LosAngeles' -d '{
"parkName" : "Alma",
"address" : "21st and Meyler"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park/Canal?parent=LosAngeles&routing=LosAngeles' -d '{
"parkName" : "Canal",
"address" : "200 Linnie Canal and Venice"
}' Events in Parks in San Diegocurl -XPUT 'http://localhost:9200/parkinfo/park_event/1?parent=Balboa&routing=SanDiego' -d '{
"eventName" : "Scary Stuff",
"eventType" : "crime",
"time" : "2014-08-15T22:58:00"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park_event/2?parent=Balboa&routing=SanDiego' -d '{
"eventName" : "Bocce Ball Summer 2014",
"eventType" : "tournament",
"time" : "2014-08-25T12:00:00"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park_event/3?parent=Glen&routing=SanDiego' -d '{
"eventName" : "Basketball Summer 2014",
"eventType" : "tournament",
"time" : "2014-08-23T12:00:00"
}' Events in Parks in Los Angelescurl -XPUT 'http://localhost:9200/parkinfo/park_event/4?parent=48thSt&routing=LosAngeles' -d '{
"eventName" : "More Scary Stuff",
"eventType" : "crime",
"time" : "2014-08-15T22:58:00"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park_event/5?parent=Alma&routing=LosAngeles' -d '{
"eventName" : "Really Scary Stuff",
"eventType" : "crime",
"time" : "2014-06-25T23:14:00"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park_event/6?parent=Canal&routing=LosAngeles' -d '{
"eventName" : "Weight Lifting Summer 2014",
"eventType" : "tournament",
"time" : "2014-08-23T12:00:00"
}' Filtering Stories/RequirementsAs a user I want to be able to display only events that will occur in the next X days in a grid Filtering ImplementationFor these 2 requirements, we need a filter to keep only safe parks and then a query to display events in the next X days and join together the data from the 3 generations Safe Park FilterThis filter must do two things, it must exclude if its parent is considered an unsafe park city and it must exclude if the particular park in question is intrinsically unsafe. The preference would be to be able to do this with a single query. Currently I would expect to have to query cities and save the terms, then use a terms lookup filter. I see return options as being: none, all, matching, #. None would be the default for all children. "filter" : {
"bool" : {
"type" : "park",
"must" : {
"bool" : {
"type" : "city",
"return" : "none",
"must_not" : {
"has_child" : {
"type" : "park",
"min_children": 2,
"return" : "all",
"filter" : {
"must" : {
"has_child" : {
"type" : "park_event",
"filter" : {
"bool" : {
"must" : {
"term" : {
"eventType" : "crime"
},
"range" : {
"time" : {
"gte" : "2014-05-20",
"lte" : "now"
}
}
}
}
}
}
}
}
}
}
}
},
"must_not" : {
"has_child" : {
"type" : "park_event",
"filter" : {
"bool" : {
"must" : {
"term" : {
"eventType" : "crime"
},
"range" : {
"time" : {
"gte" : "2014-07-23",
"lte" : "now"
}
}
}
}
}
}
}
}
} I don't believe that you can list a "type" in a bool filter, but I felt it made things much more clear by including it. It also may be required for that sort of functionality. Returning the data over the next 5 daysAs this is to be put into a grid, we would want the data to be denormalized, and perhaps sortable for using from and size. If denormalized is false, it would be an array based return, in the case that people aren't trying to display in a grid. {
"denormalized" : "true"
"filtered": {
"type" : "park",
"return" : "matching",
"query": {
"has_parent" : {
"return" : "matching"
}
"has_child" : {
"return" : "matching"
"range" : {
"time" : {
"gte" : "now",
"lte" : "2014-08-25"
}
}
}
},
"filter": "SafeParkFilter"
}
} Expected ResultsSafe ParksNo parks in Los Angeles should be considered safe because there were multiple parks in LA with crime events in the past 3 months. Balboa Park should also be considered unsafe because of the crime event in the past 4 weeks. This leaves the safe parks as: Events at Safe ParksGiven that only Glen Park and Kate Sessions Park are safe parks in this scenario, we should only be returning events from those parks which will be held in the next 5 days
Please let me know if any of this is unclear or doesn't make sense. This is also likely more than the orignal request, but this feature set would be very powerful and is the gap between what I have currently implemented on my project and the toolset I need to finish. |
Hi @ILMN-jmccloskey Thanks for the detailed example. It feels very much like you are trying to use Elasticsearch as a relational DB, which isn't the best way to use it. I would definitely avoid using grandparent-parent-child relationships as it is very costly, both with joins and the data required to maintain the relationship. Think about how many times a crime is committed, then how many times your query will run. A much better approach would be to denormalize your data and to update it when you have new events. You want your results to be parks, so you should store all the info you need inside the single park document, including the crimes inside that park and the number of crimes for a particular time period in the city where the park is located. I suggest reading about the various techniques and tradeoffs here: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/modeling-your-data.html |
Thanks for responding. I fully agree with you that the example, as given, doesn't lend itself to normalization. I was trying to be breif (ha!) in the data for the example. Imagine that the city, park and park_event all have anywhere from 10 to 50 fields, which should be able to be updated independtely from each other and you are creating many events per month. This isn't the acutal index I am trying to create, only an example for illustrative purposes. I also am not sure how to fill the requirements of a safe park (2 crimes committed at parks in a particular city within the last 3 months) without the parent/child relationship. It seems to me that this is why the parent/child relationship and min_children were created. Assuming that the example was altered to lend itself to normalization using elasticsearch, are there things that need to be added to demonstrate the value of returning data across documents or perhaps clear up implementation details? Thank you for the link. I will read further to make sure my actual mapping strategy is appropriate for my implementation. Even outside of the number of fields in a type and the ability to update those types independently, it seems to me that I won't be able to look for parks that have some events and not others, or cities that have some events and not others, without a parent/child relationship (**updated: I could if I used use multiple queries). Do you have any suggestions in that regard? |
FWIW, we aggregate data into Elasticsearch from many different disparate sources including unstructured, semi-structured (e.g. XML, RESTful services, etc.), and structured (e.g. relational database records), so our basic schema includes master parent documents (e.g. entities, relationships, etc.), and each of them can have many detail child documents each of which can have dozens and dozens of fields. We do not want to update the master documents, since most of our data ingest pattern is just adding additional details. The performance is more than acceptable. Yes, everyone says not to use Elasticsearch (or Solr) as a relational database replacement, but for data that is primarily write-once/read-many, it is more than an adequate solution as we've proved with Solr and now Elasticsearch. However, without a simple parent/child join capability baked into Elasticseach, it means that every Elasticsearch client must do it the hard way, and pull unnecessary data across the network. Just my $0.02. |
My situation is similar to @asanderson's. I have a parent document that has one or more child documents containing all the data. They generally don't change but are added all the time. It would be incredibly valuable to search the child documents and get results back in terms of the parent document AND also return the data contained in the children along with the parent. |
Closing in favour of #8153 |
There are many times I would like both the parent and children of a record. Currently to find the children of a query (even a has_child query) requires an individual GET for each returned record.
The simplest solution may be to enhance the has_child query, which already specifies parent and children types, allowing the actual children to be returned along with the parents.
Enhance the query DSL to allow the children/parents of any query results to be joined and returned. Perhaps even allowing additional filtering.
Add a join API call.
The text was updated successfully, but these errors were encountered: