Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Processing A Nested List as Individual Log Events #5015

Open
Conklin-Spencer-bah opened this issue Oct 2, 2024 · 3 comments
Open

[BUG] Processing A Nested List as Individual Log Events #5015

Conklin-Spencer-bah opened this issue Oct 2, 2024 · 3 comments
Labels
bug Something isn't working untriaged

Comments

@Conklin-Spencer-bah
Copy link

Describe the bug
It could potentially possible to do this however I have not been able to find anything in documentation that covers it.

If you have CloudWatch Logs -> Data Firehose -> S3 and want to pull that into DataPrepper it brings in the multi line event.

The structure seems to be like so:

{ "messageType": "DATA_MESSAGE", "owner": "123456789", "logGroup": "foo", "logStream": "bar", "logEvents": [{"id": "123456", "message": "some log message here", "timestamp" 1727880215114}, {"id": "789102", "message": "another log message here", "timestamp" 1727880215114}, {"id": "99999", "message": "yet another log message here", "timestamp" 1727880215114}]}

What I was hoping to do was use DataPrepper to read in the log message from S3 (that is like above) and then parse out the "logEvents" and treat each entry as an individual log message to publish to S3 & OpenSearch alike.

S3 being it will allow me to create a neat structure of a prefix with accountid/log-group/YYYY/MM/DD/HH

However I am not sure that it is possible to extract logEvents dictionary that contains a list of arrays and treat them as separate events.

To Reproduce
Steps to reproduce the behavior:

  1. Create fake log event like above.
  2. Write to DataPrepper
  3. Try to parse

Expected behavior
I was expecting a feature within DataPrepper to support something like so:

  processor:
    - parse_json:
    - split_string:
        entries:
          - source: "/logEvents[0]"
            delimiter: ","

Environment (please complete the following information):

  • OS: macOSX

Additional context
AWS Managed OpenSearch & AWS Managed OSIS is being used. I setup a local container deployment to expedite testing and still see the same issue.

@Conklin-Spencer-bah Conklin-Spencer-bah added bug Something isn't working untriaged labels Oct 2, 2024
@JunChatani
Copy link

JunChatani commented Oct 4, 2024

Hey there, we also stumbled upon this issue and found it odd that you can not split simple json arrays into multiple documents.

I worked around this by using the following hack:

  1. parse the json key logEvents
  2. stringify the key again
  3. manipulate this stringified key:
    a) remove the Array characters in the beginning
    b) remove the Array character at the end
    c) Insert a unique delimiter character
  4. split event by this delimiter character

process these split events by a chained pipeline.

I am on mobile so my apologies for the formatting:

preprocess_pipeline:
sink s3
processor:

  • parse_json:
    pointer: “logEvents”
  • write_json:
    source: “logEvents”
    target: “logEventsJson”
  • substitute_string:
    entries:
    -source: “logEventsJson”
    from: “\[\{\\”id”
    to: “{\\”id”
  • substitute_string:
    entries:
    -source: “logEventsJson”
    from: “n\”\}\]”
    to: “n\”\}”
  • substitute_string:
    entries:
    -source: “logEventsJson”
    from: “\},\{\\”id”
    to: “}␟{\\”id”
  • split_event:
    field: logEventsJson
    delimiter_regex: “␟”
    sink:
    • pipeline:
      name: message_process_pipeline

message_process_pipeline
src:
pipeline:
name: preprocess_pipeline
processor:
- parse_json:
source: “logEventsJson”
sink:…

We asked for the AWS service team to support splitting json arrays into multiple docs, but this hack seems to work for us now.

Edit:
formatting seems to remove the backslashes, will try to adjust formatting but I hope you get the idea

@oeyh
Copy link
Collaborator

oeyh commented Oct 4, 2024

@JunChatani Nice workaround! Thanks!

We asked for the AWS service team to support splitting json arrays into multiple docs, but this hack seems to work for us now.

I agree that split_event should support splitting event on json arrays. Have you opened an github issue for this already? If not, we can use this issue to track.

@JunChatani
Copy link

I haven’t opened an issue yet, perhaps this one can be used to track it then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
Development

No branches or pull requests

3 participants