-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: pre-select terms in TermVector request #3924
Comments
Thanks for the quick response. So big shoutout to @bleskes, too: Thank you! No in my case I am actually after the precise offset, but typically only of a few terms (say 2-5). To describe this in general terms: I am doing a distance based scoring for things in the text. Can I send you a prototype privately? |
If you by privately mean that you want to push a branch to your repo without pull request and then discuss it here - sure! Just link the commit or branch here. |
Thanks, for taking the time. I have tried to do that in a low-brow way and the result is at Now, I have two problems so far:
Any suggestions? |
Thanks for the commit! I assumed you wanted to discuss e7c1e9e, since 5ed795c does some percolator things, right? Let me know if I am mistaken. About the speed-up: I could imagine the speedup is not as great as expected because all term info is loaded from disc no matter how many of terms you actually return. Disc IO probably influences the performance most. I still think the change is useful for large documents with many terms when only few are requested. About the EOF: We should decide if a requested term should be returned with frequency 0 if it is not in the doc, or not return missing terms at all. I prepared a commit fixing the EOF for the former and added a comment for the latter, just so that you know where the changes have to be made here: 27bc8fd What do you think: Return the term with frequency 0 or not return it at all if it is not the document? |
Thanks for the corrections! Sorry, about the confusion with the commit link. I must have gotten something wrong with linking into user-specific commits. I am also a bit confused how this 5ed795c you link to comes turn out to be there? I definitely didn't intend to touch any percolator stuff. Is it coincidence that the first 7 characters match? Anyhow, you seem to have gotten the right commit and in my use-case both solutions work equally well. Though one could argue that given that Disk IO will dominate the runtime of this request anyways, returning a 0 would simply make the result a bit more self-explaining. Thanks again. |
Would you like to make a pull request for your changes? |
Sorry for the delay. Why don't you go ahead since you have the last version 2013/10/23 Britta Weber [email protected]
Max J. Hoffmann |
ok, I'll do that. |
…req 0 If specific terms are requested, they should be returned with freq 0 if they do not appear in the document. To do this efficiently, we compute a sorted list of terms and then intersect this list with the terms in the documents when writing the terms. This commit changes an inconsistency of uri parameter handling: Previously, when selected fields were given in the uri parameters, these fields were added to the list of selected terms. This does not make sense since it is not the other way round. Now, uri parameters are always overwritten by the parameters given in the body. closes elastic#3924 for single term vector request
…ector api uri parameters were not all parsed for the multi term vector request. This commit makes sure that all parameters are parsed and used when creating the requests for the multi term vector request. In order to simplify both code and json request, the request structure now allows two ways to use multi term vectors: 1. Give all parameters for each document requested in the docs array like this: ``` { "docs": [ { "_index": "testidx", "_type": "test", "_id": "2", "terms": [ "fox" ], "term_statistics": true }, { "_index": "testidx", "_type": "test", "_id": "1", "terms": [ "quick", "brown" ], "term_statistics": false } ] } ``` 2. Define a list of ids and give parameters in a separate parameters object like this: ``` { "ids": [ "1", "2" ], "parameters": { "_index": "testidx", "_type": "test", "terms": [ "brown" ] } } ``` uri parameters are global parameters that are set for both cases. They are overwritten by parameter definitions in the body. closes elastic#3924 for multi term vector api
About the speedup: With the current implementation, we load the whole term vector for a document. This makes sense if you need all the terms or do not know in advance which term is requested but also makes it slow. Also, please take a look at pr #4161. It implements access to term vectors in a script. I implemented it wrong there as well (load term vectors and get the information from there instead from the standard DocsEnum) but I can fix that. Also, sorry for the delay. |
I just pushed the script api for term statistics (pr #4161) Could you check if this allows you to do all you need? |
Hi Sorry for the slow reply. The function_score query seems to work as the I think this could still be a nice feature and since parent/child object 2014-01-02 11:33 GMT+01:00 Britta Weber [email protected]:
Max J. Hoffmann |
That would call for a different issue. Just to be sure: Do you need both parent and child statistics in the same script? Can you elaborate a little on what exactly you need maybe with a small example? |
Just played with the script_score feature and thinks this allows a really My parent/child use case is the following (I hope this makes sense). In my Then the 'functions' part of the 'function_score' request could look like 'functions' : [{
2014-02-12 11:01 GMT+01:00 Britta Weber [email protected]:
Max J. Hoffmann |
Hi, sorry for the late reply. Can you check if the Using several documents within a script for computing the score is not supported yet and would need a new issue. Let me know if I can close this one. Thanks! |
Looks good to me.
|
Dear All
I have a feature request regarding the TermVector API. I was really happy to see this commit, which I had fledgingly written as a plugin before. Thanks @brwe ! Though, would it be possible to submit a list of terms and only have the TermVector returned for those? My hope is that it's considerably faster than the full request. I have a use case, where I know the terms for which I need the information before making the request.
Some pointers of how to do it myself are appreciated, too, though I am afraid my solution won't be as efficient.
Best and many, many thanks for the great work.
Max.
The text was updated successfully, but these errors were encountered: