-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702
Comments
Some early stage experiments:1. Extend https://github.com/opensearch-project/OpenSearch/blob/main/plugins/examples/script-expert-scoring/src/main/java/org/opensearch/example/expertscript/ExpertScriptPlugin.java to support several predefinied functions to fully covering all use cases including:
Pros: Very simple implementation 2. Support functions in script_scorea. Simple function expose in painless language Pros: Flexible enough to cover complex scripting use case Some errors when binding
Next steps:
|
Some brainstorming with @nknize @jainankitk @rishabhmaurya on how to exposing the feature
I'm looking for more ideas / opinions on this because it's been a long debate supporting this functionality and there are actual use cases. |
In #8702 (comment), Approach 1 looks like it would tend towards Approach 2, in that the usefulness in having term frequency available to scripting is greatly enhanced by the ability to combine it with other scripting functions. With Approach 1, having term frequency exposed in a custom script engine by itself is not as useful as having it available with all other scripting capabilities in Painless, as in Approach 2. |
This looks promising @noCharger! I pulled the branch down locally and successfully ran some example scripts incorporating |
closing the RFC as the PR is merged. |
Is your feature request related to a problem? Please describe.
In its present state, OpenSearch, a fork of Elasticsearch, offers only constrained access to term-level statistics extracted from Lucene via its scripting functionality. The current process requires setting the similarity model, which can include scripted similarity, at the index level during index creation. This entails defining the settings and mappings for an index, specifying the similarity model for a specific field or for the whole index. Subsequently, during search operations, OpenSearch uses the predefined similarity model to calculate scores for the documents in the index.
This design choice has been made for performance optimization. The similarity model is employed at index time to precompute certain values required at search time. Additionally, considering it influences how the inverted index is stored and queried, altering the similarity settings on a per-query basis is not practical.
Describe the solution you'd like
To enhance OpenSearch's capabilities, we suggest broadening the direct access to detailed statistics like term frequency (termfreq), term frequency-inverse document frequency (tf-idf), total term frequency (totaltermfreq), sum of total term frequencies (sumtotaltermfreq), and payload information. This improved access can spur the creation of more refined information retrieval and ranking algorithms.
We propose augmenting OpenSearch's scripting functionality to include more Lucene ValueSource statistics. This would involve extending existing scripting classes and creating new ones as necessary, leveraging Lucene's existing ValueSource and Similarity classes for the underlying statistics. This new functionality needs to be carefully integrated and thoroughly tested for reliability and performance. This would empower script creators with new tools for customizing information retrieval and ranking in OpenSearch.
Describe alternatives you've considered
<BM25> + boost * <value>
Additional context
Related issue: #7558
The proposed enhancement to OpenSearch's scripting functionality will provide a wider range of statistics for use in complex information retrieval and ranking algorithms. This opens up new possibilities for improving the accuracy and relevance of search results, tailoring the retrieval process to specific use cases, and optimizing performance. These statistics can be particularly useful in domains such as information retrieval research, e-commerce, document classification, and others where fine-grained control over the ranking algorithm is desirable.
The text was updated successfully, but these errors were encountered: