Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to transform tuples to human readable format #274

Merged

Conversation

mikaylathompson
Copy link
Collaborator

@mikaylathompson mikaylathompson commented Aug 20, 2023

Signed-off-by: Mikayla Thompson [email protected]

Description

This adds a new feature as a follow up to #266 and #258 .
Those two PRs created a new EFS/shared volume attached to the replayer and migration console, and directed the replayer's output tuples to that volume.

The Replayer outputs base64 encoded message bodies in the tuples: this prevents issues where saving a gzipped message body in utf-8 in a json was corrupting them.

This PR adds a python script to the migration console that allows the user to specify a file of replayer-generated tuple logs. It loads each log line, extracts the tuple, and processes it:

  • the message bodies are base64 decoded
  • if the headers (Content-Encoding) indicate that the message is gzipped, it's unzipped
  • if the headers (Content-Type) indicate that the message is a json, it's loaded as such
  • if the uri indicates that the messages is a _bulk request, it's loaded as a bulk json (a list of jsons)

The processed tuples are then written to a new file. If an output file wasn't specified, it's the input file, but prefaced with readable-.

It takes a very tolerant approach to errors. If any errors are found (e.g. can't base64 decode, can't un-gzip, can't parse as json) it reports them (error & line number), but outputs the tuple to the final file in whatever most recent state it was able to generate (so if it can't be b64 decoded, the original tuple is written; if it can't be un-gzipped, the base64 decoded version is written). This is my best attempt to ensure that we tolerate weird data (which could be the OS cluster messing up its headers, for instance) and don't quietly hide it from a user's future analyses.

There are many ways this could be improved in the future:

  • operating on an entire directory
  • accepting/outputting streaming data
  • more sophisticated and customizable parsing
  • watching a directory and automatically running on each new file

but it is inherently a short-term tool, and I took a minimal but opinionated approach to giving the user a tool that allows them to make use of this data.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1260

Testing

Manual.

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -4,10 +4,12 @@ ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && \
apt-get install -y --no-install-recommends python3.9 python3-pip python3-dev gcc libc-dev git curl && \
pip3 install opensearch-benchmark
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying these versions resolves a versioning conflict that was happening in some installs (I think it was all new installs and my local just had a cached version that happened to work, but I'm not positive why it was mixed). The newest opensearch-benchmarks (1.1.0) has a dependency that does not work with the newest version of urllib3 (1.26.x) for this version of python, and pip wasn't detecting/resolving the issue itself. Locking to these versions fixes it.

@codecov
Copy link

codecov bot commented Aug 20, 2023

Codecov Report

Merging #274 (c6fb8da) into main (dd17251) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #274   +/-   ##
=========================================
  Coverage     61.41%   61.41%           
  Complexity      593      593           
=========================================
  Files            74       74           
  Lines          2991     2991           
  Branches        279      279           
=========================================
  Hits           1837     1837           
  Misses          986      986           
  Partials        168      168           
Flag Coverage Δ
unittests 61.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Copy link
Collaborator

@sumobrian sumobrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output isn't exactly JSON but it is human readable. An optimization to this would be to output valid JSON. I see two errors in the example, I've create:

  1. Single quotes are used rather than double quotes for keys and values
  2. Boolean values are not encoded as strings.

This example demonstrates these issues:

root@b4726b93fbf0:~# jq '.' /shared-replayer-output/tuples.log
{
  "instant": {
    "epochSecond": 1692557025,
    "nanoOfSecond": 663395000
  },
  "thread": "main",
  "level": "INFO",
  "loggerName": "OutputTupleJsonLogger",
  "message": "{\"request\":{\"Authorization\":\"Basic YWRtaW46YWRtaW4=\",\"Accept\":\"*/*\",\"User-Agent\":\"curl/7.88.1\",\"Request-URI\":\"/bjp-0820-1343\",\"Host\":\"localhost:9200\",\"Method\":\"PUT\",\"HTTP-Version\":\"HTTP/1.1\",\"body\":\"\"},\"primaryResponse\":{\"content-length\":\"72\",\"content-type\":\"application/json; charset=UTF-8\",\"response_time_ms\":612,\"HTTP-Version\":\"HTTP/1.1\",\"Status-Code\":\"200\",\"body\":\"eyJhY2tub3dsZWRnZWQiOnRydWUsInNoYXJkc19hY2tub3dsZWRnZWQiOnRydWUsImluZGV4IjoiYmpwLTA4MjAtMTM0MyJ9\",\"Reason-Phrase\":\"OK\"},\"shadowResponse\":{\"content-length\":\"72\",\"content-type\":\"application/json; charset=UTF-8\",\"response_time_ms\":608,\"HTTP-Version\":\"HTTP/1.1\",\"Status-Code\":\"200\",\"body\":\"eyJhY2tub3dsZWRnZWQiOnRydWUsInNoYXJkc19hY2tub3dsZWRnZWQiOnRydWUsImluZGV4IjoiYmpwLTA4MjAtMTM0MyJ9\",\"Reason-Phrase\":\"OK\"}}",
  "endOfBatch": false,
  "loggerFqcn": "org.apache.logging.log4j.spi.AbstractLogger",
  "threadId": 1,
  "threadPriority": 5
}
root@b4726b93fbf0:~# ./humanReadableLogs.py /shared-replayer-output/tuples.log --outfile humanreadable.json
Input file: /shared-replayer-output/tuples.log; Output file: humanreadable.json
1it [00:00, 3795.75it/s]
root@b4726b93fbf0:~# jq '.' humanreadable.json
parse error: Invalid numeric literal at line 1, column 11
root@b4726b93fbf0:~# sed -i s/\'/\"/g humanreadable.json
root@b4726b93fbf0:~# jq '.' humanreadable.json
parse error: Invalid numeric literal at line 1, column 419
root@b4726b93fbf0:~# cat humanreadable.json | cut -c 400-440
acknowledged": True, "shards_acknowledged
root@b4726b93fbf0:~# sed -i s/True/\"true\"/g humanreadable.json
root@b4726b93fbf0:~# cat humanreadable.json | cut -c 400-440
acknowledged": "true", "shards_acknowledg
root@b4726b93fbf0:~# jq '.' humanreadable.json
{
  "request": {
    "Authorization": "Basic YWRtaW46YWRtaW4=",
    "Accept": "*/*",
    "User-Agent": "curl/7.88.1",
    "Request-URI": "/bjp-0820-1343",
    "Host": "localhost:9200",
    "Method": "PUT",
    "HTTP-Version": "HTTP/1.1",
    "body": ""
  },
  "primaryResponse": {
    "content-length": "72",
    "content-type": "application/json; charset=UTF-8",
    "response_time_ms": 612,
    "HTTP-Version": "HTTP/1.1",
    "Status-Code": "200",
    "body": {
      "acknowledged": "true",
      "shards_acknowledged": "true",
      "index": "bjp-0820-1343"
    },
    "Reason-Phrase": "OK"
  },
  "shadowResponse": {
    "content-length": "72",
    "content-type": "application/json; charset=UTF-8",
    "response_time_ms": 608,
    "HTTP-Version": "HTTP/1.1",
    "Status-Code": "200",
    "body": {
      "acknowledged": "true",
      "shards_acknowledged": "true",
      "index": "bjp-0820-1343"
    },
    "Reason-Phrase": "OK"
  }
}


The data generated from the replayer is stored on an Elastic File System volume shared between the Replayer and Migration Console.
It is mounted to the Migration Console at the path `/shared_replayer_output`. The Replayer generates files named `output_tuples.log`.
These files are rolled over as they hit 10 Mb to a series of `output_tuples-%d{yyyy-MM-dd-HH:mm}.log` files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mb should be MB.

TrafficCapture/README.md Outdated Show resolved Hide resolved
@mikaylathompson
Copy link
Collaborator Author

To your comment about non-valid json: you're totally right, I"m just printing dicts instead of wriitng them back as JSONs. Fixing now and will rerun your test case.

Signed-off-by: Mikayla Thompson <[email protected]>
@sumobrian sumobrian merged commit e83ffe8 into opensearch-project:main Aug 21, 2023
6 checks passed
The body of the messages is sometimes gzipped which makes it difficult to represent as text in a JSON. Therefore, the body field of all requests
and responses is base64 encoded before it is logged. This makes the files stable, but not human-readable.

We have provided a utility script that can parse these files and output them to a human-readable format: the bodies are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rephrase to remove "We". In the future, "we" won't be relevant for any readers.

--outfile OUTFILE Path for output human readable tuple file.

# By default, the output file is the same path as the input file, but the file name is prefixed with `readable-`.
$ ./humanReadableLogs.py /shared_replayer_output/tuples.log
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: humanReadableLogs sounds like what I would make my output files from this program. I'd like to see some sort of verb - "makeLogsHumanReadable", "decodeResultJson", "prettyPrintReplayResults" (my preferred, since this isn't for logs, but effectively, for the primary output, right?)

Comment on lines 9 to +10
COPY runTestBenchmarks.sh /root/
RUN chmod ugo+x /root/runTestBenchmarks.sh
COPY humanReadableLogs.py /root/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, these could be on one line. Same for the next pair. It shrinks the number of layers and makes it easier to read through docker reports.

Comment on lines +18 to +19
# TODO: I'm not positive about the capitalization of the Content-Encoding and Content-Type headers.
# This version worked on my test cases, but not guaranteed to work in all cases.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP spec says that header names are case-insensitive. The replayer (& netty's HTTP classes) make special accommodations to support that, so you'll need to be prepared for any crazy kind of outPUT.

def parse_body_value(raw_value: str, content_encoding: Optional[str],
content_type: Optional[str], is_bulk: bool, line_no: int):
try:
b64decoded = base64.b64decode(raw_value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd call this something like encodedValue or originalValue. To me, the raw value means that there's been no encoding & clearly, that isn't the case here

except Exception as e:
logger.error(f"Body value on line {line_no} could not be decoded to utf-8: {e}. "
"Skipping parsing body value.")
return unzipped
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you prepend what you're returning with how far you got (for all return statements)? It could be very confusing to a user to see binary - and then not know if it was improperly gunzipped, not decompressed at all, or if the charset was wrong, or if it was a PNG, etc.

if is_json and len(decoded) > 0:
if is_bulk:
try:
return [json.loads(line) for line in decoded.splitlines()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect that bulk json has one json doc on each line? Is that how ES/OS work? That seems pretty awful.
Aren't newlines normally considered just whitespace within JSON?

message = item[LOG_JSON_TUPLE_FIELD]
tuple = json.loads(message)
try:
is_bulk_path = BULK_URI_PATH in get_element(URI_PATH, tuple, raise_on_error=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for "_bulk" to be in another part of a path and this not be a bulk request? Are query params included in the path?

Comment on lines +141 to +142
logging.basicConfig(level=logging.INFO)
with logging_redirect_tqdm():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these lines do?

usage: humanReadableLogs.py [-h] [--outfile OUTFILE] infile

positional arguments:
infile Path to input logged tuple file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this work on stdin/stdout? Customers will very likely be compressing old (or all) files. Expanding them first to files could be 100-1K times slower than piping them, not to mention much more difficult to manage the space for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants