Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fetch Migration] Monitoring script for Data Prepper #264

Merged
merged 2 commits into from
Aug 31, 2023

Conversation

kartg
Copy link
Member

@kartg kartg commented Aug 15, 2023

Description

This change adds a monitor.py Python script that can monitor the running Data Prepper pipeline and shut it down once the target document count has been reached (as reported by Data Prepper's documentsSuccess Prometheus metric) and an idle pipeline is detected.

  • Category: New feature

Testing

Unit Tests

$ python -m coverage report --omit "*/tests/*"
Name                  Stmts   Miss  Cover
-----------------------------------------
endpoint_info.py          6      0   100%
index_operations.py      35      0   100%
main.py                 121      1    99%
monitor.py               52      0   100%
utils.py                 13      0   100%
-----------------------------------------
TOTAL                   227      1    99%

Integration Test

  1. Setup step (see the Index Configuration Tool readme):
python index_configuration_tool/main.py -r <input_pipeline_yaml> <output_yaml_path>
  1. Data Prepper container kickoff:
docker run -p 4900:4900 -v <output_yaml_path>:/usr/share/data-prepper/pipelines/pipelines.yaml opensearch-data-prepper:2.4.0-SNAPSHOT
  1. After Data Prepper server has started, monitoring process kickoff:
python index_configuration_tool/monitor.py https://localhost:4900 <target_doc_count_from_report>
  1. Verified that the monitoring process invoked the /shutdown endpoint once the target doc count was reached

Limitations

  • Though the monitoring script correctly stops the Data Prepper server, the Docker container does not terminate. The current theory is that this is due to [BUG] Docker container does not stop running after Data Prepper shuts down data-prepper#3141
  • Currently the endpoint auth information is hard-coded. The fix for this is tracked in JIRA task MIGRATIONS-1285
  • The monitoring script only tracks Data Prepper metrics to identify shutdown state. It does not check the target cluster to validate that the target document count has been reached. Since index readers need to refresh to make the newly ingested documents searchable, there is often a delay (~5 minutes from my local testing) in the document counts reflecting in the _cat/indices count of the target cluster. Handling this scenario is tracked in JIRA task MIGRATIONS-1228

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented Aug 15, 2023

Codecov Report

Merging #264 (361edef) into main (92405b8) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #264   +/-   ##
=========================================
  Coverage     62.14%   62.14%           
  Complexity      619      619           
=========================================
  Files            82       82           
  Lines          3141     3141           
  Branches        292      292           
=========================================
  Hits           1952     1952           
  Misses         1013     1013           
  Partials        176      176           
Flag Coverage Δ
unittests 62.14% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@kartg kartg force-pushed the fetch-migration-monitoring branch from 39de240 to 92c8ce7 Compare August 16, 2023 18:51
@kartg kartg marked this pull request as ready for review August 16, 2023 18:56
@kartg kartg force-pushed the fetch-migration-monitoring branch 3 times, most recently from e58442a to a35e6d9 Compare August 18, 2023 18:42
@kartg
Copy link
Member Author

kartg commented Aug 18, 2023

The codecov diff hit seems inaccurate since it's looking at Java code which isn't a part of this change.

EDIT: Resolved after rebase on 8/21

@kartg kartg force-pushed the fetch-migration-monitoring branch from a35e6d9 to 6928d2c Compare August 21, 2023 16:28
Copy link
Collaborator

@mikaylathompson mikaylathompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Biggest issue is around error handling -- if any of the api calls fail, even transiently, the entire thing will crash.

FetchMigration/index_configuration_tool/monitor.py Outdated Show resolved Hide resolved
prev_no_partitions_count = 0
terminal = False
target_doc_count = int(args.target_count)
while not terminal:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How high is the risk that a doc fails to be written for whatever reason, and we're stuck here forever because doc_count is never >= target? Should we also have a timeout value that if doc_count doesn't change for X seconds/minutes, we give up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good callout. I'll capture this as a follow up item in JIRA - https://opensearch.atlassian.net/browse/MIGRATIONS-1303

FetchMigration/index_configuration_tool/monitor.py Outdated Show resolved Hide resolved
@kartg kartg force-pushed the fetch-migration-monitoring branch from 6928d2c to e4a5a9f Compare August 23, 2023 22:39
This change adds a monitor.py Python script that can monitor the running Data Prepper pipeline and shut it down once the target document count has been reached (as reported by Data Prepper's documentsSuccess Prometheus metric) and an idle pipeline is detected.

Signed-off-by: Kartik Ganesh <[email protected]>
@kartg kartg force-pushed the fetch-migration-monitoring branch from e4a5a9f to 31ea718 Compare August 30, 2023 19:51
@kartg kartg force-pushed the fetch-migration-monitoring branch from dc480fa to 361edef Compare August 30, 2023 21:19
@kartg kartg merged commit 81f8a99 into opensearch-project:main Aug 31, 2023
6 checks passed
@kartg kartg deleted the fetch-migration-monitoring branch August 31, 2023 17:42
gregschohn added a commit to gregschohn/opensearch-migrations that referenced this pull request Sep 5, 2023
* main:
  [Fetch Migration] Monitoring script for Data Prepper (opensearch-project#264)
  Refactor secrets cache to sdk client (opensearch-project#278)
  Update humanReadableLogs script
  Update primary->source and shadow->target and documentation
  Add shadowRequest and connectionId to tuple
  Update tests for CRLF
  Use CRLF in NettyJsonToByteBufHandler

Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	TrafficCapture/trafficCaptureProxyServer/src/test/java/org/opensearch/migrations/trafficcapture/proxyserver/netty/ExpiringSubstitutableItemPoolTest.java
#	TrafficCapture/trafficReplayer/build.gradle
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/SourceTargetCaptureTuple.java
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants