[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

reuschling · 2024-06-05T09:24:12Z

By demand, I copied this feature request from ml-commons opensearch-project/ml-commons#2319. There is also a small discussion about this issue yet.

Like in my FR opensearch-project/ml-commons#2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.

Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}

Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.

So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.

{
  "description": "An NLP ingest pipeline for creating sentence embeddings",
  "processors": [
    {
      "text_embedding": {
        "model_id": "A5Xnx44B89YUJ7QK7T3K",
        "field_map": {
          "title": "embedding_tns_title",
	  "body": "embedding_tns_body",
	  "description": "embedding_tns_description"					
        },
	"ignore_failure": true
      }
    }
  ]
}

dblock · 2024-07-01T16:10:01Z

[Catch All Triage - Attendees 1, 2, 3, 4, 5]

Thanks for opening this.

yizheliu-amazon · 2024-12-10T23:33:36Z

Thank you for opening this. I may take look into it.

yizheliu-amazon · 2024-12-17T20:13:15Z

During the investigation, I have found one bug for nested list type. Issue: #1024. I will work on that issue first, and come back to this one after it is fixed.

reuschling added enhancement untriaged labels Jun 5, 2024

reuschling mentioned this issue Jun 5, 2024

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map opensearch-project/ml-commons#2319

Closed

dblock removed the untriaged label Jul 1, 2024

naveentatikonda added this to Vector Search RoadMap Sep 18, 2024

github-project-automation bot moved this to Backlog in Vector Search RoadMap Sep 18, 2024

heemin32 assigned yizheliu-amazon Dec 10, 2024

yizheliu-amazon mentioned this issue Dec 24, 2024

Support empty string for fields in text embedding processor #1041

Merged

5 tasks

vibrantvarun closed this as completed in #1041 Dec 27, 2024

github-project-automation bot moved this from Backlog to ✅ Done in Vector Search RoadMap Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

reuschling commented Jun 5, 2024

dblock commented Jul 1, 2024

yizheliu-amazon commented Dec 10, 2024

yizheliu-amazon commented Dec 17, 2024

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

Comments

reuschling commented Jun 5, 2024

dblock commented Jul 1, 2024

yizheliu-amazon commented Dec 10, 2024

yizheliu-amazon commented Dec 17, 2024