Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

Closed
noCharger opened this issue Jul 14, 2023 · 7 comments
Closed

[RFC] Enhanced Access to Term-Level Statistics in OpenSearch #8702

noCharger opened this issue Jul 14, 2023 · 7 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc

Comments

@noCharger
Copy link
Contributor

noCharger commented Jul 14, 2023

Is your feature request related to a problem? Please describe.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

In its present state, OpenSearch, a fork of Elasticsearch, offers only constrained access to term-level statistics extracted from Lucene via its scripting functionality. The current process requires setting the similarity model, which can include scripted similarity, at the index level during index creation. This entails defining the settings and mappings for an index, specifying the similarity model for a specific field or for the whole index. Subsequently, during search operations, OpenSearch uses the predefined similarity model to calculate scores for the documents in the index.

This design choice has been made for performance optimization. The similarity model is employed at index time to precompute certain values required at search time. Additionally, considering it influences how the inverted index is stored and queried, altering the similarity settings on a per-query basis is not practical.

Describe the solution you'd like

A clear and concise description of what you want to happen.

To enhance OpenSearch's capabilities, we suggest broadening the direct access to detailed statistics like term frequency (termfreq), term frequency-inverse document frequency (tf-idf), total term frequency (totaltermfreq), sum of total term frequencies (sumtotaltermfreq), and payload information. This improved access can spur the creation of more refined information retrieval and ranking algorithms.

We propose augmenting OpenSearch's scripting functionality to include more Lucene ValueSource statistics. This would involve extending existing scripting classes and creating new ones as necessary, leveraging Lucene's existing ValueSource and Similarity classes for the underlying statistics. This new functionality needs to be carefully integrated and thoroughly tested for reliability and performance. This would empower script creators with new tools for customizing information retrieval and ranking in OpenSearch.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

  1. Implementing this functionality outside OpenSearch: This would involve pulling data out of OpenSearch, calculating the statistics externally, and then pushing the data back into OpenSearch. However, this approach is likely to be inefficient and would not benefit from the optimizations available within OpenSearch and Lucene.
  2. Relying solely on OpenSearch's existing scripting functionality: While OpenSearch's scripting does provide some access to term-level statistics, it's not flexible as tuning and customizing during the fetch phase.
    1. Term vector: As described in Expose term frequency in Painless script score context #7558 (comment), it’s not one-pass since the doc ids have to be granted
    2. Rank feature: Rank feature do scoring by adding the weight to the original score, for example: <BM25> + boost * <value>
    3. Scripted similarity: As described in Expose term frequency in Painless script score context #7558 (comment), script similarity doesn't allow parameters to be included into the similarity score on a per query basis. While the multiplier and default_value can be injected by function_score query, the target term must be in query context which is not configurable as params.

Additional context

Add any other context or screenshots about the feature request here.

Related issue: #7558

The proposed enhancement to OpenSearch's scripting functionality will provide a wider range of statistics for use in complex information retrieval and ranking algorithms. This opens up new possibilities for improving the accuracy and relevance of search results, tailoring the retrieval process to specific use cases, and optimizing performance. These statistics can be particularly useful in domains such as information retrieval research, e-commerce, document classification, and others where fine-grained control over the ranking algorithm is desirable.

@noCharger noCharger added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 14, 2023
@noCharger noCharger self-assigned this Jul 14, 2023
@noCharger noCharger added Search Search query, autocomplete ...etc and removed untriaged labels Jul 14, 2023
@noCharger noCharger moved this from 🆕 New to 👀 In review in Search Project Board Jul 14, 2023
@noCharger
Copy link
Contributor Author

noCharger commented Jul 14, 2023

@noCharger
Copy link
Contributor Author

noCharger commented Jul 17, 2023

Some early stage experiments:

1. Extend https://github.com/opensearch-project/OpenSearch/blob/main/plugins/examples/script-expert-scoring/src/main/java/org/opensearch/example/expertscript/ExpertScriptPlugin.java to support several predefinied functions to fully covering all use cases including:

def multiplier = params.multiplier;
for (int x = 0; x < params.fields.length; x++) {
 if (_doc(params.fields[x]) != null) {
   return multiplier * _doc(params.fields[x]).term_freq(params.term);
 }
}

return params.default_value;

Pros: Very simple implementation
Cons: It is not flexible and does not support any scripting language because it is only support functions.
Additional thought: Implementing another ScoreScript with any other scripting language support from stracth is more difficult than it appears.

2. Support functions in script_score

a. Simple function expose in painless language
b Experiments with using TermVectors / ValueSource / PostingsEnum

Pros: Flexible enough to cover complex scripting use case
Cons: A challenging implementation based on painless reflection machnism is hidden behind the elegance.

Some errors when binding Lucene LeafReaderContext and execute during compile time and runtime:


{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "compile error",
        "script_stack": [],
        "script": "\n            termFreq('field', 'foo');\n          ",
        "lang": "painless"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "index1",
        "node": "TRpPyMPvSIW9IDEjvyyZkw",
        "reason": {
          "type": "query_shard_exception",
          "reason": "script_score: the script could not be loaded",
          "index": "index1",
          "index_uuid": "gFl0UNcxQqiPSV1YzSr3yg",
          "caused_by": {
            "type": "script_exception",
            "reason": "compile error",
            "script_stack": [],
            "script": "\n            termFreq('field', 'foo');\n          ",
            "lang": "painless",
            "caused_by": {
              "type": "illegal_argument_exception",
              "reason": "[getLeafReaderContext] has unknown return type [org.apache.lucene.index.LeafReaderContext]. Painless can only support getters with return types that are allowlisted."
            }
          }
        }
      }
    ],
    "caused_by": {
      "type": "script_exception",
      "reason": "compile error",
      "script_stack": [],
      "script": "\n            termFreq('field', 'foo');\n          ",
      "lang": "painless",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "[getLeafReaderContext] has unknown return type [org.apache.lucene.index.LeafReaderContext]. Painless can only support getters with return types that are allowlisted."
      }
    }
  },
  "status": 400
}
{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "compile error",
        "script_stack": [
          "\n            termFreq('field', 'foo'); ...",
          "             ^---- HERE"
        ],
        "script": "\n            termFreq('field', 'foo');\n          ",
        "lang": "painless",
        "position": {
          "offset": 13,
          "start": 0,
          "end": 38
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "index1",
        "node": "FfngdfQ7Tn-crFQzbZTE4g",
        "reason": {
          "type": "query_shard_exception",
          "reason": "script_score: the script could not be loaded",
          "index": "index1",
          "index_uuid": "7QCIjAZFTXaq386R6LGtJw",
          "caused_by": {
            "type": "script_exception",
            "reason": "compile error",
            "script_stack": [
              "\n            termFreq('field', 'foo'); ...",
              "             ^---- HERE"
            ],
            "script": "\n            termFreq('field', 'foo');\n          ",
            "lang": "painless",
            "position": {
              "offset": 13,
              "start": 0,
              "end": 38
            },
            "caused_by": {
              "type": "illegal_argument_exception",
              "reason": "Unknown call [termFreq] with [[org.opensearch.painless.node.EString@22883031, org.opensearch.painless.node.EString@505887d1]] arguments."
            }
          }
        }
      }
    ],
    "caused_by": {
      "type": "script_exception",
      "reason": "compile error",
      "script_stack": [
        "\n            termFreq('field', 'foo'); ...",
        "             ^---- HERE"
      ],
      "script": "\n            termFreq('field', 'foo');\n          ",
      "lang": "painless",
      "position": {
        "offset": 13,
        "start": 0,
        "end": 38
      },
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "Unknown call [termFreq] with [[org.opensearch.painless.node.EString@22883031, org.opensearch.painless.node.EString@505887d1]] arguments."
      }
    }
  },
  "status": 400
}
[2023-07-17T12:01:21,654][INFO ][o.o.p.p.DefaultSemanticAnalysisPhase] [runTask-0] importedMethod: null
[2023-07-17T12:01:21,659][INFO ][o.o.p.p.DefaultSemanticAnalysisPhase] [runTask-0] classBinding: null
[2023-07-17T12:01:21,660][INFO ][o.o.p.p.DefaultSemanticAnalysisPhase] [runTask-0] classBinding: org.opensearch.painless.lookup.PainlessClassBinding@ba1b93dc
[2023-07-17T12:01:21,661][INFO ][o.o.p.p.DefaultSemanticAnalysisPhase] [runTask-0] instanceBinding: null

Next steps:

  1. How is '_doc' / 'ScriptDocValues' available in script context for smooth execution? Is it also dependent on reflection? If not, how is the async doc value being updated?
  2. What does Lucene CollectionStatistics and TermStatistics exposed in ScriptedSimilarity for painless script execution

@noCharger
Copy link
Contributor Author

Some brainstorming with @nknize @jainankitk @rishabhmaurya on how to exposing the feature

  1. Create a new plugin named 'extra-scripts' or 'customize-script-functions' for approach 1, into which developers can add predefined functions. This plugin may ultimately support user-defined functions.
  2. Consider utilizing the sandbox module, which has the advantage of exposing the functionality without the use of a feature flag. We can eventually include new functionalities into the core.
  3. Other options to feature flag the use of these new functionalities include the allowlist in the painless resource directory, the fielddata flag, exposing it in search pipelines, and so on.

I'm looking for more ideas / opinions on this because it's been a long debate supporting this functionality and there are actual use cases.

@russcam
Copy link
Contributor

russcam commented Jul 20, 2023

In #8702 (comment), Approach 1 looks like it would tend towards Approach 2, in that the usefulness in having term frequency available to scripting is greatly enhanced by the ability to combine it with other scripting functions. With Approach 1, having term frequency exposed in a custom script engine by itself is not as useful as having it available with all other scripting capabilities in Painless, as in Approach 2.

@noCharger
Copy link
Contributor Author

@russcam Here's a prototype for Approach 2 that incorporates these functions:

PoC-TermFreq.mov

Will get it in the repo soon. Thanks @msfroh for inspiring me with the idea of using currying in LeafSearchLookup.

@russcam
Copy link
Contributor

russcam commented Aug 3, 2023

This looks promising @noCharger! I pulled the branch down locally and successfully ran some example scripts incorporating termFreq

@mingshl
Copy link
Contributor

mingshl commented Aug 21, 2023

closing the RFC as the PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc
Projects
Archived in project
Development

No branches or pull requests

3 participants