Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] text_embedding ingest processor: Allow missing or empty fields in field_map #774

Open
reuschling opened this issue Jun 5, 2024 · 1 comment

Comments

@reuschling
Copy link

By demand, I copied this feature request from ml-commons opensearch-project/ml-commons#2319. There is also a small discussion about this issue yet.

Like in my FR opensearch-project/ml-commons#2277, most documents in my index have the field 'body', and sometimes also 'title' and 'description'. Because the data is crawled, we can not make sure that there is valid data for each document. Nevertheless it would be nice if e.g. 'description' will be considered for generating an answer for e.g. hybrid search if there is one.

Currently, the existence of a field specified in "field_map" of the text_embedding processor is mandatory. During indexing, I get the error:
{"create":{"_index":"testindex","_id":"sdfhgsd","status":400,"error":{"type":"illegal_argument_exception","reason":"field [description] has empty string value, cannot process it"}}}

Even if I configure "ignore_failure": true for the processor, the document will not processed at all, i.e. embeddings for an existing 'body' field are missing also if there is no 'description' or 'title' field. There are also documents with empty body but with title only which is a real blocker to configure just embeddings for body. Also, specifying several text_embedding processors - one for each field - is not allowed with the error type": "json_parse_exception", "reason": "Duplicate field 'text_embedding'...

I tried adding empty Strings as fields, but sadly it makes no difference, the processor recognize it.

One of the key concepts in OpenSearch/Lucene is that not all documents must follow the same 'data schema'. This is also valid for search, where only documents with matching fields will be returned.

So, in terms of consistency and robustness please allow fields inside "field_map" that don't have to appear in all documents.

{
  "description": "An NLP ingest pipeline for creating sentence embeddings",
  "processors": [
    {
      "text_embedding": {
        "model_id": "A5Xnx44B89YUJ7QK7T3K",
        "field_map": {
          "title": "embedding_tns_title",
	  "body": "embedding_tns_body",
	  "description": "embedding_tns_description"					
        },
	"ignore_failure": true
      }
    }
  ]
}
@dblock
Copy link
Member

dblock commented Jul 1, 2024

[Catch All Triage - Attendees 1, 2, 3, 4, 5]

Thanks for opening this.

@dblock dblock removed the untriaged label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

2 participants