You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By demand, I copied this feature request from ml-commons opensearch-project/ml-commons#2319. There is also a small discussion about this issue yet.
Like in my FR opensearch-project/ml-commons#2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.
Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}
Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...
I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.
One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.
So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.
During the investigation, I have found one bug for nested list type. Issue: #1024. I will work on that issue first, and come back to this one after it is fixed.
By demand, I copied this feature request from ml-commons opensearch-project/ml-commons#2319. There is also a small discussion about this issue yet.
Like in my FR opensearch-project/ml-commons#2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.
Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}
Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...
I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.
One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.
So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.
The text was updated successfully, but these errors were encountered: