[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

reuschling · 2024-06-05T09:24:12Z

By demand, I copied this feature request from ml-commons opensearch-project/ml-commons#2319. There is also a small discussion about this issue yet.

Like in my FR opensearch-project/ml-commons#2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.

Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}

Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.

So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.

{
  "description": "An NLP ingest pipeline for creating sentence embeddings",
  "processors": [
    {
      "text_embedding": {
        "model_id": "A5Xnx44B89YUJ7QK7T3K",
        "field_map": {
          "title": "embedding_tns_title",
	  "body": "embedding_tns_body",
	  "description": "embedding_tns_description"					
        },
	"ignore_failure": true
      }
    }
  ]
}

dblock · 2024-07-01T16:10:01Z

[Catch All Triage - Attendees 1, 2, 3, 4, 5]

Thanks for opening this.

reuschling added enhancement untriaged labels Jun 5, 2024

reuschling mentioned this issue Jun 5, 2024

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map opensearch-project/ml-commons#2319

Closed

dblock removed the untriaged label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

reuschling commented Jun 5, 2024

dblock commented Jul 1, 2024

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

Comments

reuschling commented Jun 5, 2024

dblock commented Jul 1, 2024