Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: The ability to "join" parent and children #761

Closed
merrellb opened this issue Mar 8, 2011 · 28 comments
Closed

Feature Request: The ability to "join" parent and children #761

merrellb opened this issue Mar 8, 2011 · 28 comments
Assignees

Comments

@merrellb
Copy link

merrellb commented Mar 8, 2011

There are many times I would like both the parent and children of a record. Currently to find the children of a query (even a has_child query) requires an individual GET for each returned record.

  1. The simplest solution may be to enhance the has_child query, which already specifies parent and children types, allowing the actual children to be returned along with the parents.

  2. Enhance the query DSL to allow the children/parents of any query results to be joined and returned. Perhaps even allowing additional filtering.

  3. Add a join API call.

@till
Copy link

till commented May 5, 2011

subscribe

@bryangreen
Copy link

+1 this would be great

@mente
Copy link

mente commented Aug 17, 2011

+1

@abh
Copy link
Contributor

abh commented Jan 30, 2012

I ran into this, too. Nested documents are a bit too closely tied (specifically that you always get all the nested documents back and not just the matching one(s)) and with parent/child documents I can't get both the matching lower level and the upper level back, either – unless I am missing something.

@gjb83
Copy link

gjb83 commented Feb 3, 2012

+1

@hlian
Copy link

hlian commented Feb 9, 2012

Lucene 3.6 will support a join query: https://issues.apache.org/jira/browse/LUCENE-3602

@kevingessner
Copy link

Lucene 3.6 added query-time joining: https://issues.apache.org/jira/browse/LUCENE-3602

What's the timeline for ES using Lucene 3.6, @kimchy?

@kimchy
Copy link
Member

kimchy commented Feb 12, 2012

The join query is not really relevant here. Parent child support is similar to the join aspect, its a matter of returning different data set than what is provided now. Note, there will never be a cross shard join in elasticsearch, so any join will happen within a shard, which the parent-child support does now.

@kevingessner
Copy link

@kimchy Sure, makes sense. I don't actually need full join support -- I really need something more like #792 or #1017, to be able to query the parent's field from a search on the child type.

@dhardy92
Copy link

[+1]

@Vineeth-Mohan
Copy link

+1

1 similar comment
@Vineeth-Mohan
Copy link

+1

@nickhoffman
Copy link

This would be incredibly useful.

@ghost
Copy link

ghost commented Aug 8, 2012

Any update on this? Would love to have this rather than having to use seperate requests to get the children.

@gjb83
Copy link

gjb83 commented Sep 28, 2012

+1

@keir
Copy link

keir commented Jun 10, 2013

+1 this would be great to have.

@mvallebr
Copy link

+1

3 similar comments
@isabel12
Copy link

+1

@chaitanya24
Copy link

+1

@vedharish
Copy link

+1

@clintongormley
Copy link
Contributor

So what would the response actually look like? Don't forget that parents and children are separate documents. Presumably you'd want children grouped with parents somehow? A parent may have millions of matching children - how many of those do we return?

The top_hits aggregation #6124 isn't a good solution for this as you would have to aggregate on parent_id, of which there may be millions in the resultset.

By far the most efficient way of doing this is in two queries:

  1. retrieve the top 10 parents matching the query
  2. use an msearch to find (eg) the top 10 children for each parent id

While this requires two steps, it gives you all the flexibility you need which would otherwise have to be provided by adding new structures to the query dsl and to the response.

Anybody want to flesh out this feature request a bit more?

@clintongormley
Copy link
Contributor

No further feedback. Closing

@jason-mccloskey
Copy link

Oh, no! This is the exact feature that will help complete my elasticsearch implementation. Let me give a hypothetical use case for this feature that is analogous to what I need to do in my implementation. Please forgive me for any misgivings as I am fairly new to elasticsearch and brand new to commenting on issues in GitHub.

Use Case: I want to be able to populate a grid of events at parks in a given city, and allow filtering based upon whether the event is at a "safe" park.

Mappings

We want three types here in a grandparent/parent/child relation.

City

curl -XPUT 'http://localhost:9200/parkinfo/city/_mapping' -d '{ 
    "city" : {
        "_id" : { "path" : "cityName" },
        "properties" : {
            "cityName" : { "type" : "string" },
            "state" : { "type" : "string" }
        }
    }
}'

Park

curl -XPUT 'http://localhost:9200/parkinfo/park/_mapping' -d '{ 
    "park" : {
        "_parent":{
            "type" :  "city"
        },
        "_id" : { "path" : "parkName" },
        "properties" : {
            "parkName" : { "type" : "string" },
            "address" : { "type" : "string" }
        }
    }
}'

Park Event

curl -XPUT 'http://localhost:9200/parkinfo/park_event/_mapping' -d '{   
    "park_event" : {
        "_parent":{
            "type" :  "park"
        },
        "properties" : {
            "eventName" : { "type" : "string" },
            "eventType" : { "type" : "string" },
            "time" : { "type" : "date" }
        }
    }
}'

Data

Let's now consider the data that we'd like to put in this index:

Cities

curl -XPUT 'http://localhost:9200/parkinfo/city/SanDiego?routing=SanDiego' -d '{
        "cityName" : "SanDiego",
        "state" : "California"
}'
curl -XPUT 'http://localhost:9200/parkinfo/city/LosAngeles?routing=LosAngeles' -d '{
        "cityName" : "LosAngeles",
        "state" : "California"
}'

Parks in San Diego

curl -XPUT 'http://localhost:9200/parkinfo/park/Balboa?parent=SanDiego&routing=SanDiego' -d '{
        "parkName" : "Balboa",
        "address" : "1549 El Prado"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park/Glen?parent=SanDiego&routing=SanDiego' -d '{
        "parkName" : "Glen",
        "address" : "2149 Orinda Dr"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park/KateSessions?parent=SanDiego&routing=SanDiego' -d '{
        "parkName" : "KateSessions",
        "address" : "5115 Soledad Rd"
}'

Parks in Los Angeles

curl -XPUT 'http://localhost:9200/parkinfo/park/48thSt?parent=LosAngeles&routing=LosAngeles' -d '{
        "parkName" : "48thSt",
        "address" : "4800 South Hoover"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park/Alma?parent=LosAngeles&routing=LosAngeles' -d '{
        "parkName" : "Alma",
        "address" : "21st and Meyler"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park/Canal?parent=LosAngeles&routing=LosAngeles' -d '{
        "parkName" : "Canal",
        "address" : "200 Linnie Canal and Venice"
}'

Events in Parks in San Diego

curl -XPUT 'http://localhost:9200/parkinfo/park_event/1?parent=Balboa&routing=SanDiego' -d '{
        "eventName" : "Scary Stuff",
        "eventType" : "crime",
        "time" : "2014-08-15T22:58:00"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park_event/2?parent=Balboa&routing=SanDiego' -d '{
        "eventName" : "Bocce Ball Summer 2014",
        "eventType" : "tournament",
        "time" : "2014-08-25T12:00:00"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park_event/3?parent=Glen&routing=SanDiego' -d '{
        "eventName" : "Basketball Summer 2014",
        "eventType" : "tournament",
        "time" : "2014-08-23T12:00:00"
}'

Events in Parks in Los Angeles

curl -XPUT 'http://localhost:9200/parkinfo/park_event/4?parent=48thSt&routing=LosAngeles' -d '{
        "eventName" : "More Scary Stuff",
        "eventType" : "crime",
        "time" : "2014-08-15T22:58:00"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park_event/5?parent=Alma&routing=LosAngeles' -d '{
        "eventName" : "Really Scary Stuff",
        "eventType" : "crime",
        "time" : "2014-06-25T23:14:00"
}'
curl -XPUT 'http://localhost:9200/parkinfo/park_event/6?parent=Canal&routing=LosAngeles' -d '{
        "eventName" : "Weight Lifting Summer 2014",
        "eventType" : "tournament",
        "time" : "2014-08-23T12:00:00"
}'

Filtering Stories/Requirements

As a user I want to be able to display only events that will occur in the next X days in a grid
The grid shall have the columns: city, state, park name, address, event name, event type, time
As a user I want to be able to filter for events at safe parks
A park will be determined safe if it has no crime event in the past 4 weeks and there have not been crimes at 2 parks in its the city in the past 3 months

Filtering Implementation

For these 2 requirements, we need a filter to keep only safe parks and then a query to display events in the next X days and join together the data from the 3 generations

Safe Park Filter

This filter must do two things, it must exclude if its parent is considered an unsafe park city and it must exclude if the particular park in question is intrinsically unsafe. The preference would be to be able to do this with a single query. Currently I would expect to have to query cities and save the terms, then use a terms lookup filter.

I see return options as being: none, all, matching, #. None would be the default for all children.

"filter" : {
    "bool" : {
        "type" : "park",
        "must" : {
            "bool" : {
                "type" : "city",
                "return" : "none",
                "must_not" : {
                    "has_child" : {
                        "type" : "park",
                        "min_children": 2,
                        "return" : "all",
                        "filter" : {
                            "must" : {
                                "has_child" : {
                                    "type" : "park_event",
                                    "filter" : {
                                        "bool" : {
                                            "must" : {
                                                "term" : {
                                                    "eventType" : "crime"
                                                    },
                                                "range" : {
                                                    "time" : {
                                                        "gte" : "2014-05-20",
                                                        "lte" : "now"
                                                    }
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        },
        "must_not" : {
            "has_child" : {
                "type" : "park_event",
                "filter" : {
                    "bool" : {
                        "must" : {
                            "term" : {
                                "eventType" : "crime"
                                },
                            "range" : {
                                "time" : {
                                    "gte" : "2014-07-23",   
                                    "lte" : "now"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

I don't believe that you can list a "type" in a bool filter, but I felt it made things much more clear by including it. It also may be required for that sort of functionality.

Returning the data over the next 5 days

As this is to be put into a grid, we would want the data to be denormalized, and perhaps sortable for using from and size. If denormalized is false, it would be an array based return, in the case that people aren't trying to display in a grid.

{
  "denormalized" : "true"
  "filtered": {
        "type" : "park",
        "return" : "matching",
        "query": {
            "has_parent" : {
                "return" : "matching"
            }
            "has_child" : {
                "return" : "matching"
                "range" : {
                    "time" : {
                        "gte" : "now",
                        "lte" : "2014-08-25"
                    }   
                }
            }
        },    
        "filter": "SafeParkFilter"
    }
}

Expected Results

Safe Parks

No parks in Los Angeles should be considered safe because there were multiple parks in LA with crime events in the past 3 months. Balboa Park should also be considered unsafe because of the crime event in the past 4 weeks. This leaves the safe parks as:
Glen
KateSessions

Events at Safe Parks

Given that only Glen Park and Kate Sessions Park are safe parks in this scenario, we should only be returning events from those parks which will be held in the next 5 days

City State Park Name Address Event Name Event Type Time
SanDiego California Glen 2149 Orinda Dr Basketball Summer 2014 tournament 2014-08-23T12:00:00
SanDiego California KateSessions 5115 Soledad Rd Bocce Ball Summer 2014 tournament 2014-08-25T12:00:00

Please let me know if any of this is unclear or doesn't make sense. This is also likely more than the orignal request, but this feature set would be very powerful and is the gap between what I have currently implemented on my project and the toolset I need to finish.

@clintongormley
Copy link
Contributor

Hi @ILMN-jmccloskey

Thanks for the detailed example. It feels very much like you are trying to use Elasticsearch as a relational DB, which isn't the best way to use it. I would definitely avoid using grandparent-parent-child relationships as it is very costly, both with joins and the data required to maintain the relationship.

Think about how many times a crime is committed, then how many times your query will run. A much better approach would be to denormalize your data and to update it when you have new events. You want your results to be parks, so you should store all the info you need inside the single park document, including the crimes inside that park and the number of crimes for a particular time period in the city where the park is located.

I suggest reading about the various techniques and tradeoffs here: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/modeling-your-data.html

@jason-mccloskey
Copy link

Hi @clintongormley

Thanks for responding. I fully agree with you that the example, as given, doesn't lend itself to normalization. I was trying to be breif (ha!) in the data for the example. Imagine that the city, park and park_event all have anywhere from 10 to 50 fields, which should be able to be updated independtely from each other and you are creating many events per month. This isn't the acutal index I am trying to create, only an example for illustrative purposes.

I also am not sure how to fill the requirements of a safe park (2 crimes committed at parks in a particular city within the last 3 months) without the parent/child relationship. It seems to me that this is why the parent/child relationship and min_children were created. Assuming that the example was altered to lend itself to normalization using elasticsearch, are there things that need to be added to demonstrate the value of returning data across documents or perhaps clear up implementation details?

Thank you for the link. I will read further to make sure my actual mapping strategy is appropriate for my implementation. Even outside of the number of fields in a type and the ability to update those types independently, it seems to me that I won't be able to look for parks that have some events and not others, or cities that have some events and not others, without a parent/child relationship (**updated: I could if I used use multiple queries). Do you have any suggestions in that regard?

@asanderson
Copy link

FWIW, we aggregate data into Elasticsearch from many different disparate sources including unstructured, semi-structured (e.g. XML, RESTful services, etc.), and structured (e.g. relational database records), so our basic schema includes master parent documents (e.g. entities, relationships, etc.), and each of them can have many detail child documents each of which can have dozens and dozens of fields.

We do not want to update the master documents, since most of our data ingest pattern is just adding additional details. The performance is more than acceptable.

Yes, everyone says not to use Elasticsearch (or Solr) as a relational database replacement, but for data that is primarily write-once/read-many, it is more than an adequate solution as we've proved with Solr and now Elasticsearch.

However, without a simple parent/child join capability baked into Elasticseach, it means that every Elasticsearch client must do it the hard way, and pull unnecessary data across the network.

Just my $0.02.

@kunklejr
Copy link

kunklejr commented Sep 5, 2014

My situation is similar to @asanderson's. I have a parent document that has one or more child documents containing all the data. They generally don't change but are added all the time. It would be incredibly valuable to search the child documents and get results back in terms of the parent document AND also return the data contained in the children along with the parent.

@clintongormley
Copy link
Contributor

Closing in favour of #8153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests