Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit for redundant information and sections across Data Prepper #7127

Merged
merged 32 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
16b2387
Edit for redundant information and sections across Data Prepper
vagimeli May 9, 2024
584e08d
Edit for redundant information and sections across Data Prepper
vagimeli May 9, 2024
458079e
Rewrite expression syntax and reorganize doc structure for readability
vagimeli May 10, 2024
92cd1b8
Rewrite expression syntax and reorganize doc structure for readability
vagimeli May 10, 2024
7228bb2
Rewrite expression syntax and reorganize doc structure for readability
vagimeli May 10, 2024
c2d751f
Rewrite expression syntax and reorganize doc structure for readability
vagimeli May 10, 2024
82373d7
Rewrite expression syntax and reorganize doc structure for readability
vagimeli May 10, 2024
26c2013
Merge branch 'main' into update-index
vagimeli May 10, 2024
050292a
Update _data-prepper/index.md
vagimeli May 10, 2024
01b54ac
Update configuring-data-prepper.md
vagimeli May 10, 2024
fe74f47
Update _data-prepper/pipelines/expression-syntax.md
vagimeli May 10, 2024
eb38891
Update _data-prepper/pipelines/expression-syntax.md
vagimeli May 10, 2024
1cb7fb6
Merge branch 'main' into update-index
vagimeli May 15, 2024
50b5161
Update _data-prepper/pipelines/pipelines.md
vagimeli Jun 28, 2024
940bdf3
Update expression-syntax.md
vagimeli Jun 28, 2024
1ba7767
Create Functions subpages
vagimeli Jun 28, 2024
1e0b30a
Create functions subpages
vagimeli Jun 28, 2024
a5ae73e
Copy edit
vagimeli Jul 10, 2024
6c37337
Merge branch 'main' into update-index
vagimeli Jul 10, 2024
e279143
Merge branch 'main' into update-index
vagimeli Jul 17, 2024
79d8a34
add remaining subpages
vagimeli Jul 17, 2024
b47b8b4
Update _data-prepper/index.md
hdhalter Jul 30, 2024
c647da6
Apply suggestions from code review
hdhalter Jul 30, 2024
365fa7d
Apply suggestions from code review
hdhalter Jul 30, 2024
1d735a6
Apply suggestions from code review
dlvenable Aug 1, 2024
e0f7743
removed-line
hdhalter Aug 1, 2024
1b1f1f1
Merge branch 'main' into update-index
hdhalter Aug 1, 2024
36bf8b8
Merge branch 'main' into update-index
hdhalter Aug 1, 2024
be987b5
Fixed broken link to pipelines
hdhalter Aug 1, 2024
83b4136
Fixed broken links on Update add-entries.md
hdhalter Aug 1, 2024
6bd246f
Fixed broken link in Update dynamo-db.md
hdhalter Aug 1, 2024
475e202
Fixed link syntax in Update index.md
hdhalter Aug 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 19 additions & 36 deletions _data-prepper/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,42 +18,24 @@ Data Prepper is a server-side data collector capable of filtering, enriching, tr

With Data Prepper you can build custom pipelines to improve the operational view of applications. Two common use cases for Data Prepper are trace analytics and log analytics. [Trace analytics]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/trace-analytics/) can help you visualize event flows and identify performance problems. [Log analytics]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/log-analytics/) equips you with tools to enhance your search capabilities, conduct comprehensive analysis, and gain insights into your applications' performance and behavior.

## Concepts
## Key concepts and fundamentals

Data Prepper includes one or more **pipelines** that collect and filter data based on the components set within the pipeline. Each component is pluggable, enabling you to use your own custom implementation of each component. These components include the following:
Data Prepper ingests data through customizable [pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/). These pipelines consist of pluggable components that you can customize to fit your needs, even allowing you to plug in your own implementations. A Data Prepper pipeline consists of the following components:

- One [source](#source)
- One or more [sinks](#sink)
- (Optional) One [buffer](#buffer)
- (Optional) One or more [processors](#processor)
- One [source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/sources/)
- One or more [sinks]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/sinks/)
- (Optional) One [buffer]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/buffers/buffers/)
- (Optional) One or more [processors]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/processors/)

A single instance of Data Prepper can have one or more pipelines.
Each pipeline contains two required components: `source` and `sink`. If a `buffer`, a `processor`, or both are missing from the pipeline, then Data Prepper uses the default `bounded_blocking` buffer and a no-op processor. Note that a single instance of Data Prepper can have one or more pipelines.

Each pipeline definition contains two required components: **source** and **sink**. If buffers and processors are missing from the Data Prepper pipeline, Data Prepper uses the default buffer and a no-op processor.
## Basic pipeline configurations

### Source
To understand how the pipeline components function within a Data Prepper configuration, see the following examples. Each pipeline configuration uses a `yaml` file format. For more information, see [Pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/) for more information and examples.

Source is the input component that defines the mechanism through which a Data Prepper pipeline will consume events. A pipeline can have only one source. The source can consume events either by receiving the events over HTTP or HTTPS or by reading from external endpoints like OTeL Collector for traces and metrics and Amazon Simple Storage Service (Amazon S3). Sources have their own configuration options based on the format of the events (such as string, JSON, Amazon CloudWatch logs, or open telemetry trace). The source component consumes events and writes them to the buffer component.
### Minimal configuration

### Buffer

The buffer component acts as the layer between the source and the sink. Buffer can be either in-memory or disk based. The default buffer uses an in-memory queue called `bounded_blocking` that is bounded by the number of events. If the buffer component is not explicitly mentioned in the pipeline configuration, Data Prepper uses the default `bounded_blocking`.

### Sink

Sink is the output component that defines the destination(s) to which a Data Prepper pipeline publishes events. A sink destination could be a service, such as OpenSearch or Amazon S3, or another Data Prepper pipeline. When using another Data Prepper pipeline as the sink, you can chain multiple pipelines together based on the needs of the data. Sink contains its own configuration options based on the destination type.

### Processor

Processors are units within the Data Prepper pipeline that can filter, transform, and enrich events using your desired format before publishing the record to the sink component. The processor is not defined in the pipeline configuration; the events publish in the format defined in the source component. You can have more than one processor within a pipeline. When using multiple processors, the processors are run in the order they are defined inside the pipeline specification.

## Sample pipeline configurations

To understand how all pipeline components function within a Data Prepper configuration, see the following examples. Each pipeline configuration uses a `yaml` file format.

### Minimal component

This pipeline configuration reads from the file source and writes to another file in the same path. It uses the default options for the buffer and processor.
The following minimal pipeline configuration reads from the file source and writes the data to another file on the same path. It uses the default options for the `buffer` and `processor` components.

```yml
sample-pipeline:
Expand All @@ -65,13 +47,13 @@ sample-pipeline:
path: <path/to/output-file>
```

### All components
### Comprehensive configuration

The following pipeline uses a source that reads string events from the `input-file`. The source then pushes the data to the buffer, bounded by a max size of `1024`. The pipeline is configured to have `4` workers, each of them reading a maximum of `256` events from the buffer for every `100 milliseconds`. Each worker runs the `string_converter` processor and writes the output of the processor to the `output-file`.
The following comprehensive pipeline configuration uses both required and optional components:

```yml
sample-pipeline:
workers: 4 #Number of workers
workers: 4 # Number of workers
delay: 100 # in milliseconds, how often the workers should run
source:
file:
Expand All @@ -88,9 +70,10 @@ sample-pipeline:
path: <path/to/output-file>
```

## Next steps

To get started building your own custom pipelines with Data Prepper, see [Getting started]({{site.url}}{{site.baseurl}}/clients/data-prepper/get-started/).
In the given pipeline configuration, the `source` component reads string events from the `input-file` and pushes the data to a bounded buffer with a maximum size of `1024`. The `workers` component specifies `4` concurrent threads that will process events from the buffer, each reading a maximum of `256` events from the buffer every `100` milliseconds. Each `workers` component runs the `string_converter` processor, which converts the strings to uppercase and writes the processed output to the `output-file`.

<!---Delete this comment.--->
## Next steps

- [Get started with Data Prepper]({{site.url}}{{site.baseurl}}/data-prepper/getting-started/).
- [Get familiar with Data Prepper pipelines]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/).
- [Explore common use cases]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/common-use-cases/).
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,7 @@ check_interval | No | Duration | Specifies the time between checks of the heap s

### Extension plugins

Since Data Prepper 2.5, Data Prepper provides support for user configurable extension plugins. Extension plugins are shared common
configurations shared across pipeline plugins, such as [sources, buffers, processors, and sinks]({{site.url}}{{site.baseurl}}/data-prepper/index/#concepts).
Data Prepper provides support for user-configurable extension plugins. Extension plugins are common configurations shared across pipeline plugins, such as [sources, buffers, processors, and sinks]({{site.url}}{{site.baseurl}}/data-prepper/index/#key-concepts-and-fundamentals).

### AWS extension plugins

Expand Down
24 changes: 24 additions & 0 deletions _data-prepper/pipelines/cidrcontains.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
layout: default
title: cidrContains()
parent: Functions
grand_parent: Pipelines
nav_order: 5
---

# cidrContains()

Check failure on line 9 in _data-prepper/pipelines/cidrcontains.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _data-prepper/pipelines/cidrcontains.md#L9

[OpenSearch.HeadingCapitalization] 'cidrContains()' is a heading and should be in sentence case.
Raw output
{"message": "[OpenSearch.HeadingCapitalization] 'cidrContains()' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/cidrcontains.md", "range": {"start": {"line": 9, "column": 3}}}, "severity": "ERROR"}

The `cidrContains()` function is used to check if an IP address is contained within a specified Classless Inter-Domain Routing (CIDR) block or range of CIDR blocks. It accepts two or more arguments:

- The first argument is a JSON pointer, which represents the key or path to the field containing the IP address to be checked. It supports both IPv4 and IPv6 address formats.

- The subsequent arguments are strings representing one or more CIDR blocks or IP address ranges. The function checks if the IP address specified in the first argument matches or is contained within any of these CIDR blocks.

For example, if your data contains an IP address field named `client.ip` and you want to check if it belongs to the CIDR blocks `192.168.0.0/16` or `10.0.0.0/8`, you can use the `cidrContains()` function as follows:

```
cidrContains('/client.ip', '192.168.0.0/16', '10.0.0.0/8')
```
{% include copy-curl.html %}

This function returns `true` if the IP address matches any of the specified CIDR blocks or `false` if it does not.
8 changes: 6 additions & 2 deletions _data-prepper/pipelines/configuration/buffers/buffers.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,13 @@ layout: default
title: Buffers
parent: Pipelines
has_children: true
nav_order: 20
nav_order: 30
---

# Buffers

Buffers store data as it passes through the pipeline. If you implement a custom buffer, it can be memory based, which provides better performance, or disk based, which is larger in size.
The `buffer` component acts as an intermediary layer between the `source` and `sink` components in a Data Prepper pipeline. It serves as temporary storage for events, decoupling the `source` from the downstream processors and sinks. Buffers can be either in-memory or disk based.

If not explicitly specified in the pipeline configuration, Data Prepper uses the default `bounded_blocking` buffer, which is an in-memory queue bounded by the number of events it can store. The `bounded_blocking` buffer is a convenient option when the event volume and processing rates are manageable within the available memory constraints.


Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ You can configure the `add_entries` processor with the following options.
| `metadata_key` | No | The key for the new metadata attribute. The argument must be a literal string key and not a JSON Pointer. Either one string key or `metadata_key` is required. |
| `value` | No | The value of the new entry to be added, which can be used with any of the following data types: strings, Booleans, numbers, null, nested objects, and arrays. |
| `format` | No | A format string to use as the value of the new entry, for example, `${key1}-${key2}`, where `key1` and `key2` are existing keys in the event. Required if neither `value` nor `value_expression` is specified. |
| `value_expression` | No | An expression string to use as the value of the new entry. For example, `/key` is an existing key in the event with a type of either a number, a string, or a Boolean. Expressions can also contain functions returning number/string/integer. For example, `length(/key)` will return the length of the key in the event when the key is a string. For more information about keys, see [Expression syntax](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). |
| `add_when` | No | A [conditional expression](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. |
| `value_expression` | No | An expression string to use as the value of the new entry. For example, `/key` is an existing key in the event with a type of either a number, a string, or a Boolean. Expressions can also contain functions returning number/string/integer. For example, `length(/key)` will return the length of the key in the event when the key is a string. For more information about keys, see [Expression syntax]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/). |
| `add_when` | No | A [conditional expression]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), such as `/some-key == "test"'`, that will be evaluated to determine whether the processor will be run on the event. |
| `overwrite_if_key_exists` | No | When set to `true`, the existing value is overwritten if `key` already exists in the event. The default value is `false`. |
| `append_if_key_exists` | No | When set to `true`, the existing value will be appended if a `key` already exists in the event. An array will be created if the existing value is not an array. Default is `false`. |

Expand Down Expand Up @@ -135,7 +135,7 @@ When the input event contains the following data:
{"message": "hello"}
```

The processed event will have the same data, with the metadata, `{"length": 5}`, attached. You can subsequently use expressions like `getMetadata("length")` in the pipeline. For more information, see the [`getMetadata` function](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/#getmetadata) documentation.
The processed event will have the same data, with the metadata, `{"length": 5}`, attached. You can subsequently use expressions like `getMetadata("length")` in the pipeline. For more information, see [`getMetadata` function]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/get-metadata/).


### Example: Add a dynamic key
Expand Down
10 changes: 6 additions & 4 deletions _data-prepper/pipelines/configuration/processors/processors.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@ layout: default
title: Processors
has_children: true
parent: Pipelines
nav_order: 25
nav_order: 35
---

# Processors

Processors perform an action on your data, such as filtering, transforming, or enriching.
Processors are components within a Data Prepper pipeline that enable you to filter, transform, and enrich events using your desired format before publishing records to the `sink` component. If no `processor` is defined in the pipeline configuration, then the events are published in the format specified by the `source` component. You can incorporate multiple processors within a single pipeline, and they are executed sequentially as defined in the pipeline.

Prior to Data Prepper 1.3, these components were named *preppers*. In Data Prepper 1.3, the term *prepper* was deprecated in favor of *processor*. In Data Prepper 2.0, the term *prepper* was removed.
{: .note }


Prior to Data Prepper 1.3, processors were named preppers. Starting in Data Prepper 1.3, the term *prepper* is deprecated in favor of the term *processor*. Data Prepper will continue to support the term *prepper* until 2.0, where it will be removed.
{: .note }
16 changes: 9 additions & 7 deletions _data-prepper/pipelines/configuration/sinks/sinks.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,22 @@ layout: default
title: Sinks
parent: Pipelines
has_children: true
nav_order: 30
nav_order: 25
---

# Sinks

Sinks define where Data Prepper writes your data to.
A `sink` is an output component that specifies the destination(s) to which a Data Prepper pipeline publishes events. Sink destinations can be services like OpenSearch, Amazon Simple Storage Service (Amazon S3), or even another Data Prepper pipeline, enabling chaining of multiple pipelines. The sink component has the following configurable options that you can use to customize the destination type.

## General options for all sink types
## Configuration options

The following table describes options you can use to configure the `sinks` sink.

Option | Required | Type | Description
:--- | :--- |:------------| :---
routes | No | String list | A list of routes for which this sink applies. If not provided, this sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information.
tags_target_key | No | String | When specified, includes event tags in the output of the provided key.
include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink. Some codecs and sinks do not allow use of this field.
exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink. Some codecs and sinks do not allow use of this field.
`routes` | No | String list | A list of routes to which the sink applies. If not provided, then the sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information.
`tags_target_key` | No | String | When specified, includes event tags in the output under the provided key.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"under" => "for"?

`include_keys` | No | String list | When specified, provides only the listed keys in the data sent to the sink. Some codecs and sinks may not support this field.
`exclude_keys` | No | String list | When specified, excludes the listed keys from the data sent to the sink. Some codecs and sinks may not support this field.


Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Option | Required | Type | Description

## Exposed metadata attributes

The following metadata will be added to each event that is processed by the `dynamodb` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/#getmetadata).
The following metadata will be added to each event that is processed by the `dynamodb` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/get-metadata/).

* `primary_key`: The primary key of the DynamoDB item. For tables that only contain a partition key, this value provides the partition key. For tables that contain both a partition and sort key, the `primary_key` attribute will be equal to the partition and sort key, separated by a `|`, for example, `partition_key|sort_key`.
* `partition_key`: The partition key of the DynamoDB item.
Expand Down
6 changes: 4 additions & 2 deletions _data-prepper/pipelines/configuration/sources/sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@ layout: default
title: Sources
parent: Pipelines
has_children: true
nav_order: 15
nav_order: 20
---

# Sources

Sources define where your data comes from within a Data Prepper pipeline.
A `source` is an input component that specifies how a Data Prepper pipeline ingests events. Each pipeline has a single source that either receives events over HTTP(S) or reads from external endpoints, such as OpenTelemetry Collector or Amazon Simple Storage Service (Amazon S3). Sources have configurable options based on the event format (string, JSON, Amazon CloudWatch logs, OpenTelemtry traces). The source consumes events and passes them to the [`buffer`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/buffers/buffers/) component.


36 changes: 36 additions & 0 deletions _data-prepper/pipelines/contains.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
layout: default
title: contains()
parent: Functions
grand_parent: Pipelines
nav_order: 10
---

# contains()

Check failure on line 9 in _data-prepper/pipelines/contains.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _data-prepper/pipelines/contains.md#L9

[OpenSearch.HeadingCapitalization] 'contains()' is a heading and should be in sentence case.
Raw output
{"message": "[OpenSearch.HeadingCapitalization] 'contains()' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/contains.md", "range": {"start": {"line": 9, "column": 3}}}, "severity": "ERROR"}

The `contains()` function is used to check if a substring exists within a given string or the value of a field in an event. It takes two arguments:

- The first argument is either a literal string or a JSON pointer that represents the field or value to be searched.

- The second argument is the substring to be searched for within the first argument.
The function returns `true` if the substring specified in the second argument is found within the string or field value represented by the first argument. It returns `false` if it is not.

For example, if you want to check if the string `"abcd"` is contained within the value of a field named `message`, you can use the `contains()` function as follows:

```
contains('/message', 'abcd')
```
{% include copy-curl.html %}

This will return `true` if the field `message` contains the substring `abcd` or `false` if it does not.

Alternatively, you can also use a literal string as the first argument:

```
contains('This is a test message', 'test')
```
{% include copy-curl.html %}

In this case, the function will return `true` because the substring `test` is present within the string `This is a test message`.

Note that the `contains()` function performs a case-sensitive search by default. If you need to perform a case-insensitive search, you can use the `containsIgnoreCase()` function instead.
Loading
Loading