diff --git a/_data-prepper/index.md b/_data-prepper/index.md index 423fe9fe95..e418aa1966 100644 --- a/_data-prepper/index.md +++ b/_data-prepper/index.md @@ -18,42 +18,24 @@ Data Prepper is a server-side data collector capable of filtering, enriching, tr With Data Prepper you can build custom pipelines to improve the operational view of applications. Two common use cases for Data Prepper are trace analytics and log analytics. [Trace analytics]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/trace-analytics/) can help you visualize event flows and identify performance problems. [Log analytics]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/log-analytics/) equips you with tools to enhance your search capabilities, conduct comprehensive analysis, and gain insights into your applications' performance and behavior. -## Concepts +## Key concepts and fundamentals -Data Prepper includes one or more **pipelines** that collect and filter data based on the components set within the pipeline. Each component is pluggable, enabling you to use your own custom implementation of each component. These components include the following: +Data Prepper ingests data through customizable [pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/). These pipelines consist of pluggable components that you can customize to fit your needs, even allowing you to plug in your own implementations. A Data Prepper pipeline consists of the following components: -- One [source](#source) -- One or more [sinks](#sink) -- (Optional) One [buffer](#buffer) -- (Optional) One or more [processors](#processor) +- One [source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/sources/) +- One or more [sinks]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/sinks/) +- (Optional) One [buffer]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/buffers/buffers/) +- (Optional) One or more [processors]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/processors/) -A single instance of Data Prepper can have one or more pipelines. +Each pipeline contains two required components: `source` and `sink`. If a `buffer`, a `processor`, or both are missing from the pipeline, then Data Prepper uses the default `bounded_blocking` buffer and a no-op processor. Note that a single instance of Data Prepper can have one or more pipelines. -Each pipeline definition contains two required components: **source** and **sink**. If buffers and processors are missing from the Data Prepper pipeline, Data Prepper uses the default buffer and a no-op processor. +## Basic pipeline configurations -### Source +To understand how the pipeline components function within a Data Prepper configuration, see the following examples. Each pipeline configuration uses a `yaml` file format. For more information, see [Pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/) for more information and examples. -Source is the input component that defines the mechanism through which a Data Prepper pipeline will consume events. A pipeline can have only one source. The source can consume events either by receiving the events over HTTP or HTTPS or by reading from external endpoints like OTeL Collector for traces and metrics and Amazon Simple Storage Service (Amazon S3). Sources have their own configuration options based on the format of the events (such as string, JSON, Amazon CloudWatch logs, or open telemetry trace). The source component consumes events and writes them to the buffer component. +### Minimal configuration -### Buffer - -The buffer component acts as the layer between the source and the sink. Buffer can be either in-memory or disk based. The default buffer uses an in-memory queue called `bounded_blocking` that is bounded by the number of events. If the buffer component is not explicitly mentioned in the pipeline configuration, Data Prepper uses the default `bounded_blocking`. - -### Sink - -Sink is the output component that defines the destination(s) to which a Data Prepper pipeline publishes events. A sink destination could be a service, such as OpenSearch or Amazon S3, or another Data Prepper pipeline. When using another Data Prepper pipeline as the sink, you can chain multiple pipelines together based on the needs of the data. Sink contains its own configuration options based on the destination type. - -### Processor - -Processors are units within the Data Prepper pipeline that can filter, transform, and enrich events using your desired format before publishing the record to the sink component. The processor is not defined in the pipeline configuration; the events publish in the format defined in the source component. You can have more than one processor within a pipeline. When using multiple processors, the processors are run in the order they are defined inside the pipeline specification. - -## Sample pipeline configurations - -To understand how all pipeline components function within a Data Prepper configuration, see the following examples. Each pipeline configuration uses a `yaml` file format. - -### Minimal component - -This pipeline configuration reads from the file source and writes to another file in the same path. It uses the default options for the buffer and processor. +The following minimal pipeline configuration reads from the file source and writes the data to another file on the same path. It uses the default options for the `buffer` and `processor` components. ```yml sample-pipeline: @@ -65,13 +47,13 @@ sample-pipeline: path: ``` -### All components +### Comprehensive configuration -The following pipeline uses a source that reads string events from the `input-file`. The source then pushes the data to the buffer, bounded by a max size of `1024`. The pipeline is configured to have `4` workers, each of them reading a maximum of `256` events from the buffer for every `100 milliseconds`. Each worker runs the `string_converter` processor and writes the output of the processor to the `output-file`. +The following comprehensive pipeline configuration uses both required and optional components: ```yml sample-pipeline: - workers: 4 #Number of workers + workers: 4 # Number of workers delay: 100 # in milliseconds, how often the workers should run source: file: @@ -88,9 +70,10 @@ sample-pipeline: path: ``` -## Next steps - -To get started building your own custom pipelines with Data Prepper, see [Getting started]({{site.url}}{{site.baseurl}}/clients/data-prepper/get-started/). +In the given pipeline configuration, the `source` component reads string events from the `input-file` and pushes the data to a bounded buffer with a maximum size of `1024`. The `workers` component specifies `4` concurrent threads that will process events from the buffer, each reading a maximum of `256` events from the buffer every `100` milliseconds. Each `workers` component runs the `string_converter` processor, which converts the strings to uppercase and writes the processed output to the `output-file`. - +## Next steps +- [Get started with Data Prepper]({{site.url}}{{site.baseurl}}/data-prepper/getting-started/). +- [Get familiar with Data Prepper pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/). +- [Explore common use cases]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/common-use-cases/). diff --git a/_data-prepper/managing-data-prepper/configuring-data-prepper.md b/_data-prepper/managing-data-prepper/configuring-data-prepper.md index d890b741cc..e42a9e9449 100644 --- a/_data-prepper/managing-data-prepper/configuring-data-prepper.md +++ b/_data-prepper/managing-data-prepper/configuring-data-prepper.md @@ -103,8 +103,7 @@ check_interval | No | Duration | Specifies the time between checks of the heap s ### Extension plugins -Since Data Prepper 2.5, Data Prepper provides support for user configurable extension plugins. Extension plugins are shared common -configurations shared across pipeline plugins, such as [sources, buffers, processors, and sinks]({{site.url}}{{site.baseurl}}/data-prepper/index/#concepts). +Data Prepper provides support for user-configurable extension plugins. Extension plugins are common configurations shared across pipeline plugins, such as [sources, buffers, processors, and sinks]({{site.url}}{{site.baseurl}}/data-prepper/index/#key-concepts-and-fundamentals). ### AWS extension plugins diff --git a/_data-prepper/pipelines/cidrcontains.md b/_data-prepper/pipelines/cidrcontains.md new file mode 100644 index 0000000000..898f1bc1f5 --- /dev/null +++ b/_data-prepper/pipelines/cidrcontains.md @@ -0,0 +1,24 @@ +--- +layout: default +title: cidrContains() +parent: Functions +grand_parent: Pipelines +nav_order: 5 +--- + +# cidrContains() + +The `cidrContains()` function is used to check if an IP address is contained within a specified Classless Inter-Domain Routing (CIDR) block or range of CIDR blocks. It accepts two or more arguments: + +- The first argument is a JSON pointer, which represents the key or path to the field containing the IP address to be checked. It supports both IPv4 and IPv6 address formats. + +- The subsequent arguments are strings representing one or more CIDR blocks or IP address ranges. The function checks if the IP address specified in the first argument matches or is contained within any of these CIDR blocks. + +For example, if your data contains an IP address field named `client.ip` and you want to check if it belongs to the CIDR blocks `192.168.0.0/16` or `10.0.0.0/8`, you can use the `cidrContains()` function as follows: + +``` +cidrContains('/client.ip', '192.168.0.0/16', '10.0.0.0/8') +``` +{% include copy-curl.html %} + +This function returns `true` if the IP address matches any of the specified CIDR blocks or `false` if it does not. \ No newline at end of file diff --git a/_data-prepper/pipelines/configuration/buffers/buffers.md b/_data-prepper/pipelines/configuration/buffers/buffers.md index eeb68260ea..287825b549 100644 --- a/_data-prepper/pipelines/configuration/buffers/buffers.md +++ b/_data-prepper/pipelines/configuration/buffers/buffers.md @@ -3,9 +3,13 @@ layout: default title: Buffers parent: Pipelines has_children: true -nav_order: 20 +nav_order: 30 --- # Buffers -Buffers store data as it passes through the pipeline. If you implement a custom buffer, it can be memory based, which provides better performance, or disk based, which is larger in size. \ No newline at end of file +The `buffer` component acts as an intermediary layer between the `source` and `sink` components in a Data Prepper pipeline. It serves as temporary storage for events, decoupling the `source` from the downstream processors and sinks. Buffers can be either in-memory or disk based. + +If not explicitly specified in the pipeline configuration, Data Prepper uses the default `bounded_blocking` buffer, which is an in-memory queue bounded by the number of events it can store. The `bounded_blocking` buffer is a convenient option when the event volume and processing rates are manageable within the available memory constraints. + + diff --git a/_data-prepper/pipelines/configuration/processors/add-entries.md b/_data-prepper/pipelines/configuration/processors/add-entries.md index 26b95c7b64..c32e8adb3d 100644 --- a/_data-prepper/pipelines/configuration/processors/add-entries.md +++ b/_data-prepper/pipelines/configuration/processors/add-entries.md @@ -21,8 +21,8 @@ You can configure the `add_entries` processor with the following options. | `metadata_key` | No | The key for the new metadata attribute. The argument must be a literal string key and not a JSON Pointer. Either one string key or `metadata_key` is required. | | `value` | No | The value of the new entry to be added, which can be used with any of the following data types: strings, Booleans, numbers, null, nested objects, and arrays. | | `format` | No | A format string to use as the value of the new entry, for example, `${key1}-${key2}`, where `key1` and `key2` are existing keys in the event. Required if neither `value` nor `value_expression` is specified. | -| `value_expression` | No | An expression string to use as the value of the new entry. For example, `/key` is an existing key in the event with a type of either a number, a string, or a Boolean. Expressions can also contain functions returning number/string/integer. For example, `length(/key)` will return the length of the key in the event when the key is a string. For more information about keys, see [Expression syntax](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). | -| `add_when` | No | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. | +| `value_expression` | No | An expression string to use as the value of the new entry. For example, `/key` is an existing key in the event with a type of either a number, a string, or a Boolean. Expressions can also contain functions returning number/string/integer. For example, `length(/key)` will return the length of the key in the event when the key is a string. For more information about keys, see [Expression syntax]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/). | +| `add_when` | No | A [conditional expression]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. | | `overwrite_if_key_exists` | No | When set to `true`, the existing value is overwritten if `key` already exists in the event. The default value is `false`. | | `append_if_key_exists` | No | When set to `true`, the existing value will be appended if a `key` already exists in the event. An array will be created if the existing value is not an array. Default is `false`. | @@ -135,7 +135,7 @@ When the input event contains the following data: {"message": "hello"} ``` -The processed event will have the same data, with the metadata, `{"length": 5}`, attached. You can subsequently use expressions like `getMetadata("length")` in the pipeline. For more information, see the [`getMetadata` function](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/#getmetadata) documentation. +The processed event will have the same data, with the metadata, `{"length": 5}`, attached. You can subsequently use expressions like `getMetadata("length")` in the pipeline. For more information, see [`getMetadata` function]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/get-metadata/). ### Example: Add a dynamic key diff --git a/_data-prepper/pipelines/configuration/processors/processors.md b/_data-prepper/pipelines/configuration/processors/processors.md index 3000d71670..1fa7120551 100644 --- a/_data-prepper/pipelines/configuration/processors/processors.md +++ b/_data-prepper/pipelines/configuration/processors/processors.md @@ -3,12 +3,14 @@ layout: default title: Processors has_children: true parent: Pipelines -nav_order: 25 +nav_order: 35 --- # Processors -Processors perform an action on your data, such as filtering, transforming, or enriching. +Processors are components within a Data Prepper pipeline that enable you to filter, transform, and enrich events using your desired format before publishing records to the `sink` component. If no `processor` is defined in the pipeline configuration, then the events are published in the format specified by the `source` component. You can incorporate multiple processors within a single pipeline, and they are executed sequentially as defined in the pipeline. + +Prior to Data Prepper 1.3, these components were named *preppers*. In Data Prepper 1.3, the term *prepper* was deprecated in favor of *processor*. In Data Prepper 2.0, the term *prepper* was removed. +{: .note } + -Prior to Data Prepper 1.3, processors were named preppers. Starting in Data Prepper 1.3, the term *prepper* is deprecated in favor of the term *processor*. Data Prepper will continue to support the term *prepper* until 2.0, where it will be removed. -{: .note } \ No newline at end of file diff --git a/_data-prepper/pipelines/configuration/sinks/sinks.md b/_data-prepper/pipelines/configuration/sinks/sinks.md index 0f3af6ab25..51bf3b1c9c 100644 --- a/_data-prepper/pipelines/configuration/sinks/sinks.md +++ b/_data-prepper/pipelines/configuration/sinks/sinks.md @@ -3,20 +3,22 @@ layout: default title: Sinks parent: Pipelines has_children: true -nav_order: 30 +nav_order: 25 --- # Sinks -Sinks define where Data Prepper writes your data to. +A `sink` is an output component that specifies the destination(s) to which a Data Prepper pipeline publishes events. Sink destinations can be services like OpenSearch, Amazon Simple Storage Service (Amazon S3), or even another Data Prepper pipeline, enabling chaining of multiple pipelines. The sink component has the following configurable options that you can use to customize the destination type. -## General options for all sink types +## Configuration options The following table describes options you can use to configure the `sinks` sink. Option | Required | Type | Description :--- | :--- |:------------| :--- -routes | No | String list | A list of routes for which this sink applies. If not provided, this sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information. -tags_target_key | No | String | When specified, includes event tags in the output of the provided key. -include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink. Some codecs and sinks do not allow use of this field. -exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink. Some codecs and sinks do not allow use of this field. +`routes` | No | String list | A list of routes to which the sink applies. If not provided, then the sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information. +`tags_target_key` | No | String | When specified, includes event tags in the output under the provided key. +`include_keys` | No | String list | When specified, provides only the listed keys in the data sent to the sink. Some codecs and sinks may not support this field. +`exclude_keys` | No | String list | When specified, excludes the listed keys from the data sent to the sink. Some codecs and sinks may not support this field. + + diff --git a/_data-prepper/pipelines/configuration/sources/dynamo-db.md b/_data-prepper/pipelines/configuration/sources/dynamo-db.md index f75489f103..e465f45044 100644 --- a/_data-prepper/pipelines/configuration/sources/dynamo-db.md +++ b/_data-prepper/pipelines/configuration/sources/dynamo-db.md @@ -92,7 +92,7 @@ Option | Required | Type | Description ## Exposed metadata attributes -The following metadata will be added to each event that is processed by the `dynamodb` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/#getmetadata). +The following metadata will be added to each event that is processed by the `dynamodb` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/get-metadata/). * `primary_key`: The primary key of the DynamoDB item. For tables that only contain a partition key, this value provides the partition key. For tables that contain both a partition and sort key, the `primary_key` attribute will be equal to the partition and sort key, separated by a `|`, for example, `partition_key|sort_key`. * `partition_key`: The partition key of the DynamoDB item. diff --git a/_data-prepper/pipelines/configuration/sources/sources.md b/_data-prepper/pipelines/configuration/sources/sources.md index b684db56e9..811b161e16 100644 --- a/_data-prepper/pipelines/configuration/sources/sources.md +++ b/_data-prepper/pipelines/configuration/sources/sources.md @@ -3,9 +3,11 @@ layout: default title: Sources parent: Pipelines has_children: true -nav_order: 15 +nav_order: 20 --- # Sources -Sources define where your data comes from within a Data Prepper pipeline. +A `source` is an input component that specifies how a Data Prepper pipeline ingests events. Each pipeline has a single source that either receives events over HTTP(S) or reads from external endpoints, such as OpenTelemetry Collector or Amazon Simple Storage Service (Amazon S3). Sources have configurable options based on the event format (string, JSON, Amazon CloudWatch logs, OpenTelemtry traces). The source consumes events and passes them to the [`buffer`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/buffers/buffers/) component. + + diff --git a/_data-prepper/pipelines/contains.md b/_data-prepper/pipelines/contains.md new file mode 100644 index 0000000000..657f66bd28 --- /dev/null +++ b/_data-prepper/pipelines/contains.md @@ -0,0 +1,36 @@ +--- +layout: default +title: contains() +parent: Functions +grand_parent: Pipelines +nav_order: 10 +--- + +# contains() + +The `contains()` function is used to check if a substring exists within a given string or the value of a field in an event. It takes two arguments: + +- The first argument is either a literal string or a JSON pointer that represents the field or value to be searched. + +- The second argument is the substring to be searched for within the first argument. +The function returns `true` if the substring specified in the second argument is found within the string or field value represented by the first argument. It returns `false` if it is not. + +For example, if you want to check if the string `"abcd"` is contained within the value of a field named `message`, you can use the `contains()` function as follows: + +``` +contains('/message', 'abcd') +``` +{% include copy-curl.html %} + +This will return `true` if the field `message` contains the substring `abcd` or `false` if it does not. + +Alternatively, you can also use a literal string as the first argument: + +``` +contains('This is a test message', 'test') +``` +{% include copy-curl.html %} + +In this case, the function will return `true` because the substring `test` is present within the string `This is a test message`. + +Note that the `contains()` function performs a case-sensitive search by default. If you need to perform a case-insensitive search, you can use the `containsIgnoreCase()` function instead. diff --git a/_data-prepper/pipelines/dlq.md b/_data-prepper/pipelines/dlq.md index 3032536e93..ac1d868ea4 100644 --- a/_data-prepper/pipelines/dlq.md +++ b/_data-prepper/pipelines/dlq.md @@ -2,7 +2,7 @@ layout: default title: Dead-letter queues parent: Pipelines -nav_order: 13 +nav_order: 15 --- # Dead-letter queues diff --git a/_data-prepper/pipelines/expression-syntax.md b/_data-prepper/pipelines/expression-syntax.md index b4603e34f9..383b54c19b 100644 --- a/_data-prepper/pipelines/expression-syntax.md +++ b/_data-prepper/pipelines/expression-syntax.md @@ -2,70 +2,41 @@ layout: default title: Expression syntax parent: Pipelines -nav_order: 12 +nav_order: 5 --- -# Expression syntax +# Expression syntax -The following sections provide information about expression syntax in Data Prepper. +Expressions provide flexibility in manipulating, filtering, and routing data. The following sections provide information about expression syntax in Data Prepper. -## Supported operators +## Key terms -Operators are listed in order of precedence (top to bottom, left to right). +The following key terms are used in the context of expressions. -| Operator | Description | Associativity | -|----------------------|-------------------------------------------------------|---------------| -| `()` | Priority Expression | left-to-right | -| `not`
`+`
`-`| Unary Logical NOT
Unary Positive
Unary negative | right-to-left | -| `<`, `<=`, `>`, `>=` | Relational Operators | left-to-right | -| `==`, `!=` | Equality Operators | left-to-right | -| `and`, `or` | Conditional Expression | left-to-right | - -## Reserved for possible future functionality - -Reserved symbol set: `^`, `*`, `/`, `%`, `+`, `-`, `xor`, `=`, `+=`, `-=`, `*=`, `/=`, `%=`, `++`, `--`, `${}` - -## Set initializer +Term | Definition +-----|----------- +**Expression** | A generic component that contains a primary or an operator. Expressions can be nested within other expressions. An expression's imminent children can contain 0–1 operators. +**Expression string** | The highest priority in a Data Prepper expression and supports only one expression string resulting in a return value. An expression string is not the same as an expression. +**Literal** | A fundamental value that has no children. A literal can be one of the following: float, integer, Boolean, JSON pointer, string, or null. See [Literals](#literals). +**Operator** | A hardcoded token that identifies the operation used in an expression. +**Primary** | Can be one of the following: set initializer, priority expression, or literal. +**Statement** | The highest-priority component within an expression string. -The set initializer defines a set or term and/or expressions. - -### Examples - -The following are examples of set initializer syntax. - -#### HTTP status codes - -``` -{200, 201, 202} -``` - -#### HTTP response payloads - -``` -{"Created", "Accepted"} -``` - -#### Handle multiple event types with different keys - -``` -{/request_payload, /request_message} -``` +## Operators -## Priority expression +The following table lists the supported operators. Operators are listed in order of precedence (top to bottom, left to right). -A priority expression identifies an expression that will be evaluated at the highest priority level. A priority expression must contain an expression or value; empty parentheses are not supported. - -### Example - -``` -/is_cool == (/name == "Steven") -``` - -## Relational operators +| Operator | Description | Associativity | +|----------------------|-------------------------------------------------------|---------------| +| `()` | Priority expression | Left to right | +| `not`
`+`
`-`| Unary logical NOT
Unary positive
Unary negative | Right to left | +| `<`, `<=`, `>`, `>=` | Relational operators | Left to right | +| `==`, `!=` | Equality operators | Left to right | +| `and`, `or` | Conditional expression | Left to right | -Relational operators are used to test the relationship of two numeric values. The operands must be numbers or JSON Pointers that resolve to numbers. +### Relational operators -### Syntax +Relational operators compare numeric values or JSON pointers that resolve to numeric values. The operators are used to test the relationship between two operands, determining if one is greater than, less than, or equal to the other. The syntax for using relational operators is as follows: ``` < @@ -73,75 +44,44 @@ Relational operators are used to test the relationship of two numeric values. Th > >= ``` +{% include copy-curl.html %} -### Example +For example, to check if the value of the `status_code` field in an event is within the range of successful HTTP responses (200--299), you can use the following expression: ``` /status_code >= 200 and /status_code < 300 ``` +{% include copy-curl.html %} -## Equality operators +### Equality operators -Equality operators are used to test whether two values are equivalent. +Equality operators are used to test whether two values are equivalent. These operators compare values of any type, including JSON pointers, literals, and expressions. The syntax for using equality operators is as follows: -### Syntax ``` == != ``` +{% include copy-curl.html %} -### Examples -``` -/is_cool == true -3.14 != /status_code -{1, 2} == /event/set_property -``` -## Using equality operators to check for a JSON Pointer - -Equality operators can also be used to check whether a JSON Pointer exists by comparing the value with `null`. +The following are some example equality operators: -### Syntax -``` - == null - != null -null == -null != -``` +- `/is_cool == true`: Checks if the value referenced by the JSON pointer is equal to the Boolean value. +- `3.14 != /status_code`: Checks if the numeric value is not equal to the value referenced by the JSON pointer. +- `{1, 2} == /event/set_property`: Checks if the array is equal to the value referenced by the JSON pointer. -### Example -``` -/response == null -null != /response -``` +### Conditional expressions -## Type check operator +Conditional expressions allow you to combine multiple expressions or values using logical operators to create more complex evaluation criteria. The available conditional operators are `and`, `or`, and `not`. The syntax for using these conditional operators is as follows: -The type check operator tests whether a JSON Pointer is of a certain data type. - -### Syntax -``` - typeof -``` -Supported data types are `integer`, `long`, `boolean`, `double`, `string`, `map`, and `array`. - -#### Example -``` -/response typeof integer -/message typeof string -``` - -### Conditional expression - -A conditional expression is used to chain together multiple expressions and/or values. - -#### Syntax ``` and or not ``` +{% include copy-curl.html %} + +The following are some example conditional expressions: -### Example ``` /status_code == 200 and /message == "Hello world" /status_code == 200 or /status_code == 202 @@ -149,80 +89,80 @@ not /status_code in {200, 202} /response == null /response != null ``` +{% include copy-curl.html %} -## Definitions +### Reserved symbols -This section provides expression definitions. +Reserved symbols are symbols that are not currently used in the expression syntax but are reserved for possible future functionality or extensions. Reserved symbols include `^`, `*`, `/`, `%`, `+`, `-`, `xor`, `=`, `+=`, `-=`, `*=`, `/=`, `%=`, `++`, `--`, and `${}`. -### Literal -A literal is a fundamental value that has no children: -- Float: Supports values from 3.40282347 × 1038 to 1.40239846 × 10−45. -- Integer: Supports values from −2,147,483,648 to 2,147,483,647. -- Boolean: Supports true or false. -- JSON Pointer: See the [JSON Pointer](#json-pointer) section for details. -- String: Supports valid Java strings. -- Null: Supports null check to see whether a JSON Pointer exists. +## Syntax components -### Expression string -An expression string takes the highest priority in a Data Prepper expression and only supports one expression string resulting in a return value. An _expression string_ is not the same as an _expression_. +Syntax components are the building blocks of expressions in Data Prepper. They allow you to define sets, specify evaluation order, reference values within events, use literal values, and follow specific white space rules. Understanding these components is crucial for creating and working with expressions effectively in Data Prepper pipelines. -### Statement -A statement is the highest-priority component of an expression string. +### Priority expressions -### Expression -An expression is a generic component that contains a _Primary_ or an _Operator_. Expressions may contain expressions. An expression's imminent children can contain 0–1 _Operators_. +Priority expressions specify the evaluation order of expressions. They are enclosed in parentheses `()`. Priority expressions must contain an expression or value (empty parentheses are not supported). The following is an example priority expression: -### Primary +``` +/is_cool == (/name == "Steven") +``` +{% include copy-curl.html %} -- _Set_ -- _Priority Expression_ -- _Literal_ +### JSON pointers -### Operator -An operator is a hardcoded token that identifies the operation used in an _expression_. +JSON pointers are used to reference values within an event. They start with a leading forward slash `/` followed by alphanumeric characters or underscores that are separated by additional forward slashes `/`. -### JSON Pointer -A JSON Pointer is a literal used to reference a value within an event and provided as context for an _expression string_. JSON Pointers are identified by a leading `/` containing alphanumeric characters or underscores, delimited by `/`. JSON Pointers can use an extended character set if wrapped in double quotes (`"`) using the escape character `\`. Note that JSON Pointers require `~` and `/` characters, which should be used as part of the path and not as a delimiter that needs to be escaped. +JSON pointers can use an extended character set by wrapping the entire pointer in double quotation marks `""` and escaping characters using a backslash `\`. Note that the `~` and `/` characters are considered to be part of the pointer path and do not need to be escaped. The following are some examples of valid JSON pointers: `~0` to represent the literal character `~` or `~1` to represent the literal character `/`. -The following are examples of JSON Pointers: +#### Shorthand syntax -- `~0` representing `~` -- `~1` representing `/` +The shorthand syntax for a JSON pointer can be expressed using the following regular expression pattern, where `\w` represents any word character (A--Z, a-z, 0--9, or underscore): -#### Shorthand syntax (Regex, `\w` = `[A-Za-z_]`) ``` -/\w+(/\w+)* +/\w+(/\w+)*` ``` +{% include copy-curl.html %} + -#### Example of shorthand - -The following is an example of shorthand: +The following is an example of this shorthand syntax: ``` /Hello/World/0 ``` +{% include copy-curl.html %} -#### Example of escaped syntax +#### Escaped syntax + +The escaped syntax for a JSON pointer can be expressed as follows: -The following is an example of escaped syntax: ``` "/(/)*" ``` +{% include copy-curl.html %} -#### Example of an escaped JSON Pointer +The following is an example of an escaped JSON pointer: -The following is an example of an escaped JSON Pointer: ``` # Path # { "Hello - 'world/" : [{ "\"JsonPointer\"": true }] } "/Hello - 'world\//0/\"JsonPointer\"" ``` +{% include copy-curl.html %} + +### Literals -## White space +Literals are fundamental values that have no children. Data Prepper supports the following literal types: -White space is **optional** surrounding relational operators, regex equality operators, equality operators, and commas. -White space is **required** surrounding set initializers, priority expressions, set operators, and conditional expressions. +- **Float:** Supports values from 3.40282347 x 10^38 to 1.40239846 x 10^-45. +- **Integer:** Supports values from -2,147,483,648 to 2,147,483,647. +- **Boolean:** Supports `true` or `false`. +- **JSON pointer:** See [JSON pointers](#json-pointers) for more information. +- **String:** Supports valid Java strings. +- **Null:** Supports `null` to check if a JSON pointer exists. +### White space rules + +White space is optional around relational operators, regex equality operators, equality operators, and commas. White space is required around set initializers, priority expressions, set operators, and conditional expressions. | Operator | Description | White space required | ✅ Valid examples | ❌ Invalid examples | |----------------------|--------------------------|----------------------|----------------------------------------------------------------|---------------------------------------| @@ -230,53 +170,12 @@ White space is **required** surrounding set initializers, priority expressions, | `()` | Priority expression | Yes | `/a==(/b==200)`
`/a in ({200})` | `/status in({200})` | | `in`, `not in` | Set operators | Yes | `/a in {200}`
`/a not in {400}` | `/a in{200, 202}`
`/a not in{400}` | | `<`, `<=`, `>`, `>=` | Relational operators | No | `/status < 300`
`/status>=300` | | -| `=~`, `!~` | Regex equality pperators | No | `/msg =~ "^\w*$"`
`/msg=~"^\w*$"` | | +| `=~`, `!~` | Regex equality operators | No | `/msg =~ "^\w*$"`
`/msg=~"^\w*$"` | | | `==`, `!=` | Equality operators | No | `/status == 200`
`/status_code==200` | | | `and`, `or`, `not` | Conditional operators | Yes | `/a<300 and /b>200` | `/b<300and/b>200` | | `,` | Set value delimiter | No | `/a in {200, 202}`
`/a in {200,202}`
`/a in {200 , 202}` | `/a in {200,}` | | `typeof` | Type check operator | Yes | `/a typeof integer`
`/a typeof long`
`/a typeof string`
`/a typeof double`
`/a typeof boolean`
`/a typeof map`
`/a typeof array` |`/a typeof /b`
`/a typeof 2` | +## Related articles -## Functions - -Data Prepper supports the following built-in functions that can be used in an expression. - -### `length()` - -The `length()` function takes one argument of the JSON pointer type and returns the length of the value passed. For example, `length(/message)` returns a length of `10` when a key message exists in the event and has a value of `1234567890`. - -### `hasTags()` - -The `hasTags()` function takes one or more string type arguments and returns `true` if all of the arguments passed are present in an event's tags. When an argument does not exist in the event's tags, the function returns `false`. For example, if you use the expression `hasTags("tag1")` and the event contains `tag1`, Data Prepper returns `true`. If you use the expression `hasTags("tag2")` but the event only contains `tag1`, Data Prepper returns `false`. - -### `getMetadata()` - -The `getMetadata()` function takes one literal string argument to look up specific keys in a an event's metadata. If the key contains a `/`, then the function looks up the metadata recursively. When passed, the expression returns the value corresponding to the key. The value returned can be of any type. For example, if the metadata contains `{"key1": "value2", "key2": 10}`, then the function, `getMetadata("key1")`, returns `value2`. The function, `getMetadata("key2")`, returns 10. - -### `contains()` - -The `contains()` function takes two string arguments and determines whether either a literal string or a JSON pointer is contained within an event. When the second argument contains a substring of the first argument, such as `contains("abcde", "abcd")`, the function returns `true`. If the second argument does not contain any substrings, such as `contains("abcde", "xyz")`, it returns `false`. - -### `cidrContains()` - -The `cidrContains()` function takes two or more arguments. The first argument is a JSON pointer, which represents the key to the IP address that is checked. It supports both IPv4 and IPv6 addresses. Every argument that comes after the key is a string type that represents CIDR blocks that are checked against. - -If the IP address in the first argument is in the range of any of the given CIDR blocks, the function returns `true`. If the IP address is not in the range of the CIDR blocks, the function returns `false`. For example, `cidrContains(/sourceIp,"192.0.2.0/24","10.0.1.0/16")` will return `true` if the `sourceIp` field indicated in the JSON pointer has a value of `192.0.2.5`. - -### `join()` - -The `join()` function joins elements of a list to form a string. The function takes a JSON pointer, which represents the key to a list or a map where values are of the list type, and joins the lists as strings using commas (`,`), the default delimiter between strings. - -If `{"source": [1, 2, 3]}` is the input data, as shown in the following example: - - -```json -{"source": {"key1": [1, 2, 3], "key2": ["a", "b", "c"]}} -``` - -Then `join(/source)` will return `"1,2,3"` in the following format: - -```json -{"key1": "1,2,3", "key2": "a,b,c"} -``` -You can also specify a delimiter other than the default inside the expression. For example, `join("-", /source)` joins each `source` field using a hyphen (`-`) as the delimiter. +- [Functions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/functions/) diff --git a/_data-prepper/pipelines/functions.md b/_data-prepper/pipelines/functions.md new file mode 100644 index 0000000000..f0661faba4 --- /dev/null +++ b/_data-prepper/pipelines/functions.md @@ -0,0 +1,18 @@ +--- +layout: default +title: Functions +parent: Pipelines +nav_order: 10 +has_children: true +--- + +# Functions + +Data Prepper offers a range of built-in functions that can be used within expressions to perform common data preprocessing tasks, such as calculating lengths, checking for tags, retrieving metadata, searching for substrings, checking IP address ranges, and joining list elements. These functions include the following: + +- [`cidrContains()`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/cidrcontains/) +- [`contains()`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/contains/) +- [`getMetadata()`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/get-metadata/) +- [`hasTags()`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/has-tags/) +- [`join()`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/join/) +- [`length()`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/length/) \ No newline at end of file diff --git a/_data-prepper/pipelines/get-metadata.md b/_data-prepper/pipelines/get-metadata.md new file mode 100644 index 0000000000..fc89ed51d6 --- /dev/null +++ b/_data-prepper/pipelines/get-metadata.md @@ -0,0 +1,42 @@ +--- +layout: default +title: getMetadata() +parent: Functions +grand_parent: Pipelines +nav_order: 15 +--- + +# getMetadata() + +The `getMetadata()` function takes one literal string argument and looks up specific keys in event metadata. + +If the key contains a `/`, then the function looks up the metadata recursively. When passed, the expression returns the value corresponding to the key. + +The value returned can be of any type. For example, if the metadata contains `{"key1": "value2", "key2": 10}`, then the function `getMetadata("key1")` returns `value2`. The function `getMetadata("key2")` returns `10`. + +#### Example + +```json +{ + "event": { + "metadata": { + "key1": "value2", + "key2": 10 + }, + "data": { + // ... + } + }, + "output": [ + { + "key": "key1", + "value": "value2" + }, + { + "key": "key2", + "value": 10 + } + ] +} +``` +{% include copy-curl.html %} diff --git a/_data-prepper/pipelines/has-tags.md b/_data-prepper/pipelines/has-tags.md new file mode 100644 index 0000000000..d6cb498b11 --- /dev/null +++ b/_data-prepper/pipelines/has-tags.md @@ -0,0 +1,45 @@ +--- +layout: default +title: hasTags() +parent: Functions +grand_parent: Pipelines +nav_order: 20 +--- + +# hasTags() + +The `hasTags()` function takes one or more string type arguments and returns `true` if all of the arguments passed are present in an event's tags. If an argument does not exist in the event's tags, then the function returns `false`. + +For example, if you use the expression `hasTags("tag1")` and the event contains `tag1`, then Data Prepper returns `true`. If you use the expression `hasTags("tag2")` but the event only contains `tag1`, then Data Prepper returns `false`. + +#### Example + +```json +{ + "events": [ + { + "tags": ["tag1"], + "data": { + // ... + } + }, + { + "tags": ["tag1", "tag2"], + "data": { + // ... + } + } + ], + "expressions": [ + { + "expression": "hasTags(\"tag1\")", + "expected_results": [true, true] + }, + { + "expression": "hasTags(\"tag2\")", + "expected_results": [false, true] + } + ] +} +``` +{% include copy-curl.html %} diff --git a/_data-prepper/pipelines/join.md b/_data-prepper/pipelines/join.md new file mode 100644 index 0000000000..3a4d77d5c2 --- /dev/null +++ b/_data-prepper/pipelines/join.md @@ -0,0 +1,16 @@ +--- +layout: default +title: join() +parent: Functions +grand_parent: Pipelines +nav_order: 25 +--- + +# join() + + +The `join()` function joins elements of a list to form a string. The function takes a JSON pointer, which represents the key to a list or map where values are of the list type, and joins the lists as strings using commas `,`. Commas are the default delimiter between strings. + +If `{"source": [1, 2, 3]}` is the input data, as in `{"source": {"key1": [1, 2, 3], "key2": ["a", "b", "c"]}}`, then `join(/source)` returns `"1,2,3"` in the following format: `{"key1": "1,2,3", "key2": "a,b,c"}`. + +You can specify an alternative delimiter inside the expression. For example, `join("-", /source)` joins each source field using a hyphen `-` as the delimiter. diff --git a/_data-prepper/pipelines/length.md b/_data-prepper/pipelines/length.md new file mode 100644 index 0000000000..fca4b10df2 --- /dev/null +++ b/_data-prepper/pipelines/length.md @@ -0,0 +1,24 @@ +--- +layout: default +title: length() +parent: Functions +grand_parent: Pipelines +nav_order: 30 +--- + +# length() + +The `length()` function takes one argument of the JSON pointer type and returns the length of the passed value. For example, `length(/message)` returns a length of `10` when a key message exists in the event and has a value of `1234567890`. + +#### Example + +```json +{ + "event": { + "/message": "1234567890" + }, + "expression": "length(/message)", + "expected_output": 10 +} +``` +{% include copy-curl.html %} diff --git a/_data-prepper/pipelines/pipelines-configuration-options.md b/_data-prepper/pipelines/pipelines-configuration-options.md deleted file mode 100644 index 5667906af1..0000000000 --- a/_data-prepper/pipelines/pipelines-configuration-options.md +++ /dev/null @@ -1,18 +0,0 @@ ---- -layout: default -title: Pipeline options -parent: Pipelines -nav_order: 11 ---- - -# Pipeline options - -This page provides information about pipeline configuration options in Data Prepper. - -## General pipeline options - -Option | Required | Type | Description -:--- | :--- | :--- | :--- -workers | No | Integer | Essentially the number of application threads. As a starting point for your use case, try setting this value to the number of CPU cores on the machine. Default is 1. -delay | No | Integer | Amount of time in milliseconds workers wait between buffer read attempts. Default is `3000`. - diff --git a/_data-prepper/pipelines/pipelines.md b/_data-prepper/pipelines/pipelines.md index e897ed5596..d519f0da80 100644 --- a/_data-prepper/pipelines/pipelines.md +++ b/_data-prepper/pipelines/pipelines.md @@ -10,11 +10,15 @@ redirect_from: # Pipelines -The following image illustrates how a pipeline works. +Pipelines are critical components that streamline the process of acquiring, transforming, and loading data from various sources into a centralized data repository or processing system. The following diagram illustrates how Data Prepper ingests data into OpenSearch. Data Prepper pipeline{: .img-fluid} -To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more processors, and one or more sinks. For example: +## Configuring Data Prepper pipelines + +Pipelines are defined in the configuration YAML file. Starting with Data Prepper 2.0, you can define pipelines across multiple YAML configuration files, with each file containing the configuration for one or more pipelines. This gives you flexibility to organize and chain together complex pipeline configurations. To ensure proper loading of your pipeline configurations, place the YAML configuration files in the `pipelines` folder in your application's home directory, for example, `/usr/share/data-prepper`. + +The following is an example configuration: ```yml simple-sample-pipeline: @@ -32,36 +36,36 @@ simple-sample-pipeline: sink: - stdout: ``` +{% include copy-curl.html %} -- Sources define where your data comes from. In this case, the source is a random UUID generator (`random`). - -- Buffers store data as it passes through the pipeline. - - By default, Data Prepper uses its one and only buffer, the `bounded_blocking` buffer, so you can omit this section unless you developed a custom buffer or need to tune the buffer settings. +### Pipeline components -- Processors perform some action on your data: filter, transform, enrich, etc. +The following table describes the components used in the given pipeline. - You can have multiple processors, which run sequentially from top to bottom, not in parallel. The `string_converter` processor transform the strings by making them uppercase. +Option | Required | Type | Description +:--- | :--- |:------------| :--- +`workers` | No | Integer | The number of application threads. Set to the number of CPU cores. Default is `1`. +`delay` | No | Integer | The number of milliseconds that `workers` wait between buffer read attempts. Default is `3000`. +`source` | Yes | String list | `random` generates random numbers by using a Universally Unique Identifier (UUID) generator. +`bounded_blocking` | No | String list | The default buffer in Data Prepper. +`processor` | No | String list | A `string_converter` with an `upper_case` processor that converts strings to uppercase. +`sink` | Yes | `stdout` outputs to standard output. -- Sinks define where your data goes. In this case, the sink is stdout. +## Pipeline concepts -Starting from Data Prepper 2.0, you can define pipelines across multiple configuration YAML files, where each file contains the configuration for one or more pipelines. This gives you more freedom to organize and chain complex pipeline configurations. For Data Prepper to load your pipeline configuration properly, place your configuration YAML files in the `pipelines` folder under your application's home directory (e.g. `/usr/share/data-prepper`). -{: .note } +The following are fundamental concepts relating to Data Prepper pipelines. -## End-to-end acknowledgments +### End-to-end acknowledgments -Data Prepper ensures the durability and reliability of data written from sources and delivered to sinks through end-to-end (E2E) acknowledgments. An E2E acknowledgment begins at the source, which monitors a batch of events set inside pipelines and waits for a positive acknowledgment when those events are successfully pushed to sinks. When a pipeline contains multiple sinks, including sinks set as additional Data Prepper pipelines, the E2E acknowledgment sends when events are received by the final sink in a pipeline chain. +Data Prepper ensures reliable and durable data delivery from sources to sinks through end-to-end (E2E) acknowledgments. The E2E acknowledgment process begins at the source, which monitors event batches within pipelines and waits for a positive acknowledgment upon successful delivery to the sinks. In pipelines with multiple sinks, including nested Data Prepper pipelines, the E2E acknowledgment is sent when events reach the final sink in the pipeline chain. Conversely, the source sends a negative acknowledgment if an event cannot be delivered to a sink for any reason. -Alternatively, the source sends a negative acknowledgment when an event cannot be delivered to a sink for any reason. +If a pipeline component fails to process and send an event, then the source receives no acknowledgment. In the case of a failure, the pipeline's source times out, allowing you to take necessary actions, such as rerunning the pipeline or logging the failure. -When any component of a pipeline fails and is unable to send an event, the source receives no acknowledgment. In the case of a failure, the pipeline's source times out. This gives you the ability to take any necessary actions to address the source failure, including rerunning the pipeline or logging the failure. +### Conditional routing +Pipelines also support conditional routing, which enables the routing of events to different sinks based on specific conditions. To add conditional routing, specify a list of named routes using the `route` component and assign specific routes to sinks using the `routes` property. Any sink with the `routes` property will only accept events matching at least one of the routing conditions. -## Conditional routing - -Pipelines also support **conditional routing** which allows you to route events to different sinks based on specific conditions. To add conditional routing to a pipeline, specify a list of named routes under the `route` component and add specific routes to sinks under the `routes` property. Any sink with the `routes` property will only accept events that match at least one of the routing conditions. - -In the following example, `application-logs` is a named route with a condition set to `/log_type == "application"`. The route uses [Data Prepper expressions](https://github.com/opensearch-project/data-prepper/tree/main/examples) to define the conditions. Data Prepper only routes events that satisfy the condition to the first OpenSearch sink. By default, Data Prepper routes all events to a sink which does not define a route. In the example, all events route into the third OpenSearch sink. +In the following example pipeline, `application-logs` is a named route with a condition set to `/log_type == "application"`. The route uses [Data Prepper expressions](https://github.com/opensearch-project/data-prepper/tree/main/examples) to define the condition. Data Prepper routes events satisfying this condition to the first OpenSearch sink. By default, Data Prepper routes all events to sinks without a defined route, as shown in the third OpenSearch sink of the given pipeline: ```yml conditional-routing-sample-pipeline: @@ -84,269 +88,8 @@ conditional-routing-sample-pipeline: hosts: [ "https://opensearch:9200" ] index: all_logs ``` +{% include copy-curl.html %} +## Next steps -## Examples - -This section provides some pipeline examples that you can use to start creating your own pipelines. For more pipeline configurations, select from the following options for each component: - -- [Buffers]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/buffers/buffers/) -- [Processors]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/processors/) -- [Sinks]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/sinks/) -- [Sources]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/sources/) - -The Data Prepper repository has several [sample applications](https://github.com/opensearch-project/data-prepper/tree/main/examples) to help you get started. - -### Log ingestion pipeline - -The following example `pipeline.yaml` file with SSL and basic authentication enabled for the `http-source` demonstrates how to use the HTTP Source and Grok Prepper plugins to process unstructured log data: - - -```yaml -log-pipeline: - source: - http: - ssl_certificate_file: "/full/path/to/certfile.crt" - ssl_key_file: "/full/path/to/keyfile.key" - authentication: - http_basic: - username: "myuser" - password: "mys3cret" - processor: - - grok: - match: - # This will match logs with a "log" key against the COMMONAPACHELOG pattern (ex: { "log": "actual apache log..." } ) - # You should change this to match what your logs look like. See the grok documenation to get started. - log: [ "%{COMMONAPACHELOG}" ] - sink: - - opensearch: - hosts: [ "https://localhost:9200" ] - # Change to your credentials - username: "admin" - password: "admin" - # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate - #cert: /path/to/cert - # If you are connecting to an Amazon OpenSearch Service domain without - # Fine-Grained Access Control, enable these settings. Comment out the - # username and password above. - #aws_sigv4: true - #aws_region: us-east-1 - # Since we are Grok matching for Apache logs, it makes sense to send them to an OpenSearch index named apache_logs. - # You should change this to correspond with how your OpenSearch indexes are set up. - index: apache_logs -``` - -This example uses weak security. We strongly recommend securing all plugins which open external ports in production environments. -{: .note} - -### Trace analytics pipeline - -The following example demonstrates how to build a pipeline that supports the [Trace Analytics OpenSearch Dashboards plugin]({{site.url}}{{site.baseurl}}/observability-plugin/trace/ta-dashboards/). This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines index trace and the service map documents for the dashboard plugin. - -Starting from Data Prepper 2.0, Data Prepper no longer supports `otel_trace_raw_prepper` processor due to the Data Prepper internal data model evolution. -Instead, users should use `otel_trace_raw`. - -```yml -entry-pipeline: - delay: "100" - source: - otel_trace_source: - ssl: false - buffer: - bounded_blocking: - buffer_size: 10240 - batch_size: 160 - sink: - - pipeline: - name: "raw-pipeline" - - pipeline: - name: "service-map-pipeline" -raw-pipeline: - source: - pipeline: - name: "entry-pipeline" - buffer: - bounded_blocking: - buffer_size: 10240 - batch_size: 160 - processor: - - otel_trace_raw: - sink: - - opensearch: - hosts: ["https://localhost:9200"] - insecure: true - username: admin - password: admin - index_type: trace-analytics-raw -service-map-pipeline: - delay: "100" - source: - pipeline: - name: "entry-pipeline" - buffer: - bounded_blocking: - buffer_size: 10240 - batch_size: 160 - processor: - - service_map_stateful: - sink: - - opensearch: - hosts: ["https://localhost:9200"] - insecure: true - username: admin - password: admin - index_type: trace-analytics-service-map -``` - -To maintain similar ingestion throughput and latency, scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload. -{: .tip} - -### Metrics pipeline - -Data Prepper supports metrics ingestion using OTel. It currently supports the following metric types: - -* Gauge -* Sum -* Summary -* Histogram - -Other types are not supported. Data Prepper drops all other types, including Exponential Histogram and Summary. Additionally, Data Prepper does not support Scope instrumentation. - -To set up a metrics pipeline: - -```yml -metrics-pipeline: - source: - otel_metrics_source: - processor: - - otel_metrics_raw_processor: - sink: - - opensearch: - hosts: ["https://localhost:9200"] - username: admin - password: admin -``` - -### S3 log ingestion pipeline - -The following example demonstrates how to use the S3Source and Grok Processor plugins to process unstructured log data from [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3). This example uses application load balancer logs. As the application load balancer writes logs to S3, S3 creates notifications in Amazon SQS. Data Prepper monitors those notifications and reads the S3 objects to get the log data and process it. - -```yml -log-pipeline: - source: - s3: - notification_type: "sqs" - compression: "gzip" - codec: - newline: - sqs: - queue_url: "https://sqs.us-east-1.amazonaws.com/12345678910/ApplicationLoadBalancer" - aws: - region: "us-east-1" - sts_role_arn: "arn:aws:iam::12345678910:role/Data-Prepper" - - processor: - - grok: - match: - message: ["%{DATA:type} %{TIMESTAMP_ISO8601:time} %{DATA:elb} %{DATA:client} %{DATA:target} %{BASE10NUM:request_processing_time} %{DATA:target_processing_time} %{BASE10NUM:response_processing_time} %{BASE10NUM:elb_status_code} %{DATA:target_status_code} %{BASE10NUM:received_bytes} %{BASE10NUM:sent_bytes} \"%{DATA:request}\" \"%{DATA:user_agent}\" %{DATA:ssl_cipher} %{DATA:ssl_protocol} %{DATA:target_group_arn} \"%{DATA:trace_id}\" \"%{DATA:domain_name}\" \"%{DATA:chosen_cert_arn}\" %{DATA:matched_rule_priority} %{TIMESTAMP_ISO8601:request_creation_time} \"%{DATA:actions_executed}\" \"%{DATA:redirect_url}\" \"%{DATA:error_reason}\" \"%{DATA:target_list}\" \"%{DATA:target_status_code_list}\" \"%{DATA:classification}\" \"%{DATA:classification_reason}"] - - grok: - match: - request: ["(%{NOTSPACE:http_method})? (%{NOTSPACE:http_uri})? (%{NOTSPACE:http_version})?"] - - grok: - match: - http_uri: ["(%{WORD:protocol})?(://)?(%{IPORHOST:domain})?(:)?(%{INT:http_port})?(%{GREEDYDATA:request_uri})?"] - - date: - from_time_received: true - destination: "@timestamp" - - - sink: - - opensearch: - hosts: [ "https://localhost:9200" ] - username: "admin" - password: "admin" - index: alb_logs -``` - -## Migrating from Logstash - -Data Prepper supports Logstash configuration files for a limited set of plugins. Simply use the logstash config to run Data Prepper. - -```bash -docker run --name data-prepper \ - -v /full/path/to/logstash.conf:/usr/share/data-prepper/pipelines/pipelines.conf \ - opensearchproject/opensearch-data-prepper:latest -``` - -This feature is limited by feature parity of Data Prepper. As of Data Prepper 1.2 release, the following plugins from the Logstash configuration are supported: - -- HTTP Input plugin -- Grok Filter plugin -- Elasticsearch Output plugin -- Amazon Elasticsearch Output plugin - -## Configure the Data Prepper server - -Data Prepper itself provides administrative HTTP endpoints such as `/list` to list pipelines and `/metrics/prometheus` to provide Prometheus-compatible metrics data. The port that has these endpoints has a TLS configuration and is specified by a separate YAML file. By default, these endpoints are secured by Data Prepper docker images. We strongly recommend providing your own configuration file for securing production environments. Here is an example `data-prepper-config.yaml`: - -```yml -ssl: true -keyStoreFilePath: "/usr/share/data-prepper/keystore.jks" -keyStorePassword: "password" -privateKeyPassword: "other_password" -serverPort: 1234 -``` - -To configure the Data Prepper server, run Data Prepper with the additional yaml file. - -```bash -docker run --name data-prepper \ - -v /full/path/to/my-pipelines.yaml:/usr/share/data-prepper/pipelines/my-pipelines.yaml \ - -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml \ - opensearchproject/data-prepper:latest -``` - -## Configure peer forwarder - -Data Prepper provides an HTTP service to forward events between Data Prepper nodes for aggregation. This is required for operating Data Prepper in a clustered deployment. Currently, peer forwarding is supported in `aggregate`, `service_map_stateful`, and `otel_trace_raw` processors. Peer forwarder groups events based on the identification keys provided by the processors. For `service_map_stateful` and `otel_trace_raw` it's `traceId` by default and can not be configured. For `aggregate` processor, it is configurable using `identification_keys` option. - -Peer forwarder supports peer discovery through one of three options: a static list, a DNS record lookup , or AWS Cloud Map. Peer discovery can be configured using `discovery_mode` option. Peer forwarder also supports SSL for verification and encryption, and mTLS for mutual authentication in a peer forwarding service. - -To configure peer forwarder, add configuration options to `data-prepper-config.yaml` mentioned in the [Configure the Data Prepper server](#configure-the-data-prepper-server) section: - -```yml -peer_forwarder: - discovery_mode: dns - domain_name: "data-prepper-cluster.my-domain.net" - ssl: true - ssl_certificate_file: "" - ssl_key_file: "" - authentication: - mutual_tls: -``` - - -## Pipeline Configurations - -Since Data Prepper 2.5, shared pipeline components can be configured under the reserved section `pipeline_configurations` when all pipelines are defined in a single pipeline configuration YAML file. -Shared pipeline configurations can include certain components within [Extension Plugins]({{site.url}}{{site.baseurl}}/data-prepper/managing-data-prepper/configuring-data-prepper/#extension-plugins), as shown in the following example that refers to secrets configurations for an `opensearch` sink: - -```json -pipeline_configurations: - aws: - secrets: - credential-secret-config: - secret_id: - region: - sts_role_arn: -simple-sample-pipeline: - ... - sink: - - opensearch: - hosts: [ {% raw %}"${{aws_secrets:host-secret-config}}"{% endraw %} ] - username: {% raw %}"${{aws_secrets:credential-secret-config:username}}"{% endraw %} - password: {% raw %}"${{aws_secrets:credential-secret-config:password}}"{% endraw %} - index: "test-migration" -``` - -When the same component is defined in both `pipelines.yaml` and `data-prepper-config.yaml`, the definition in the `pipelines.yaml` will overwrite the counterpart in `data-prepper-config.yaml`. For more information on shared pipeline components, see [AWS secrets extension plugin]({{site.url}}{{site.baseurl}}/data-prepper/managing-data-prepper/configuring-data-prepper/#aws-secrets-extension-plugin) for details. +- See [Common uses cases]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/common-use-cases/) for example configurations.