Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

Closed
reuschling opened this issue Jun 5, 2024 · 3 comments · Fixed by #1041
Closed
Assignees

Comments

@reuschling
Copy link

By demand, I copied this feature request from ml-commons opensearch-project/ml-commons#2319. There is also a small discussion about this issue yet.

Like in my FR opensearch-project/ml-commons#2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.

Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}

Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.

So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.

{
  "description": "An NLP ingest pipeline for creating sentence embeddings",
  "processors": [
    {
      "text_embedding": {
        "model_id": "A5Xnx44B89YUJ7QK7T3K",
        "field_map": {
          "title": "embedding_tns_title",
	  "body": "embedding_tns_body",
	  "description": "embedding_tns_description"					
        },
	"ignore_failure": true
      }
    }
  ]
}
@dblock
Copy link
Member

dblock commented Jul 1, 2024

[Catch All Triage - Attendees 1, 2, 3, 4, 5]

Thanks for opening this.

@yizheliu-amazon
Copy link
Contributor

Thank you for opening this. I may take look into it.

@yizheliu-amazon
Copy link
Contributor

During the investigation, I have found one bug for nested list type. Issue: #1024. I will work on that issue first, and come back to this one after it is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants