Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] Update UpdateForV9 in AttachmentProcessor (#118186) #118281

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 68 additions & 23 deletions docs/reference/ingest/processors/attachment.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ representation. The processor will skip the base64 decoding then.
.Attachment options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to get the base64 encoded field from
| `target_field` | no | attachment | The field that will hold the attachment information
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
| `remove_binary` | no | `false` | If `true`, the binary `field` will be removed from the document
| `resource_name` | no | | Field containing the name of the resource to decode. If specified, the processor passes this resource name to the underlying Tika library to enable https://tika.apache.org/1.24.1/detection.html#Resource_Name_Based_Detection[Resource Name Based Detection].
| Name | Required | Default | Description
| `field` | yes | - | The field to get the base64 encoded field from
| `target_field` | no | attachment | The field that will hold the attachment information
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
| `remove_binary` | encouraged | `false` | If `true`, the binary `field` will be removed from the document. This option is not required, but setting it explicitly is encouraged, and omitting it will result in a warning.
| `resource_name` | no | | Field containing the name of the resource to decode. If specified, the processor passes this resource name to the underlying Tika library to enable https://tika.apache.org/1.24.1/detection.html#Resource_Name_Based_Detection[Resource Name Based Detection].
|======

[discrete]
Expand Down Expand Up @@ -58,7 +58,7 @@ PUT _ingest/pipeline/attachment
{
"attachment" : {
"field" : "data",
"remove_binary": false
"remove_binary": true
}
}
]
Expand All @@ -82,7 +82,6 @@ The document's `attachment` object contains extracted properties for the file:
"_seq_no": 22,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
Expand All @@ -94,9 +93,6 @@ The document's `attachment` object contains extracted properties for the file:
----
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]

NOTE: Keeping the binary as a field within the document might consume a lot of resources. It is highly recommended
to remove that field from the document. Set `remove_binary` to `true` to automatically remove the field.

[[attachment-fields]]
==== Exported fields

Expand Down Expand Up @@ -143,7 +139,7 @@ PUT _ingest/pipeline/attachment
"attachment" : {
"field" : "data",
"properties": [ "content", "title" ],
"remove_binary": false
"remove_binary": true
}
}
]
Expand All @@ -154,6 +150,59 @@ NOTE: Extracting contents from binary data is a resource intensive operation and
consumes a lot of resources. It is highly recommended to run pipelines
using this processor in a dedicated ingest node.

[[attachment-keep-binary]]
==== Keeping the attachment binary

Keeping the binary as a field within the document might consume a lot of resources. It is highly recommended to remove
that field from the document, by setting `remove_binary` to `true` to automatically remove the field, as in the other
examples shown on this page. If you _do_ want to keep the binary field, explicitly set `remove_binary` to `false` to
avoid the warning you get from omitting it:

[source,console]
----
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information including original binary",
"processors" : [
{
"attachment" : {
"field" : "data",
"remove_binary": false
}
}
]
}
PUT my-index-000001/_doc/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my-index-000001/_doc/my_id
----

The document's `_source` object includes the original binary field:

[source,console-result]
----
{
"found": true,
"_index": "my-index-000001",
"_id": "my_id",
"_version": 1,
"_seq_no": 22,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}
----
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]

[[attachment-cbor]]
==== Use the attachment processor with CBOR

Expand All @@ -170,7 +219,7 @@ PUT _ingest/pipeline/cbor-attachment
{
"attachment" : {
"field" : "data",
"remove_binary": false
"remove_binary": true
}
}
]
Expand Down Expand Up @@ -226,7 +275,7 @@ PUT _ingest/pipeline/attachment
"field" : "data",
"indexed_chars" : 11,
"indexed_chars_field" : "max_size",
"remove_binary": false
"remove_binary": true
}
}
]
Expand All @@ -250,7 +299,6 @@ Returns this:
"_seq_no": 35,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "is",
Expand All @@ -274,7 +322,7 @@ PUT _ingest/pipeline/attachment
"field" : "data",
"indexed_chars" : 11,
"indexed_chars_field" : "max_size",
"remove_binary": false
"remove_binary": true
}
}
]
Expand All @@ -299,7 +347,6 @@ Returns this:
"_seq_no": 40,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"max_size": 5,
"attachment": {
"content_type": "application/rtf",
Expand Down Expand Up @@ -358,7 +405,7 @@ PUT _ingest/pipeline/attachment
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data",
"remove_binary": false
"remove_binary": true
}
}
}
Expand Down Expand Up @@ -396,7 +443,6 @@ Returns this:
"attachments" : [
{
"filename" : "ipsum.txt",
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"language" : "en",
Expand All @@ -406,7 +452,6 @@ Returns this:
},
{
"filename" : "test.txt",
"data" : "VGhpcyBpcyBhIHRlc3QK",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"language" : "en",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ public IngestDocument execute(IngestDocument ingestDocument) {
* @param property property to add
* @param value value to add
*/
private <T> void addAdditionalField(Map<String, Object> additionalFields, Property property, String value) {
private void addAdditionalField(Map<String, Object> additionalFields, Property property, String value) {
if (properties.contains(property) && Strings.hasLength(value)) {
additionalFields.put(property.toLowerCase(), value);
}
Expand Down Expand Up @@ -233,7 +233,7 @@ public AttachmentProcessor create(
String processorTag,
String description,
Map<String, Object> config
) throws Exception {
) {
String field = readStringProperty(TYPE, processorTag, config, "field");
String resourceName = readOptionalStringProperty(TYPE, processorTag, config, "resource_name");
String targetField = readStringProperty(TYPE, processorTag, config, "target_field", "attachment");
Expand Down