-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-category support, rate limiting, and pagination #121
Conversation
EntityColdStartWorker used COLD_ENTITY_QUEUE_CONCURRENCY instead of ENTITY_COLD_START_QUEUE_CONCURRENCY. This PR fixes the bug. This PR also made changes to AnomalyDetectorSettings.java by * removing unused setting COLD_ENTITY_QUEUE_CONCURRENCY * renaming setting from entity_coldstart to entity_cold_start to be consistent Testing done: 1. Added unit tests for the changed code.
We schedule the next run if there are more requests than allowed in a single batch. Previously, we used filteredRequests.size > batchSize to decide whether a next run is required or not. But filteredRequests may be low despite there are still a lot of requests in the queue. This PR fixes the bug by using requestSize > batchSize instead. This PR also changes the return type of pullRequests as we don’t use its return value. Testing done: 1. Added unit tests for the changed code.
The stats API broadcasts requests to all nodes and renders node responses using toXContent. For the local node, the stats API's calls toXContent on the node response directly. For remote node, the coordinating node gets a serialized content from ADStatsNodeResponse.writeTo, deserializes the content and renders the result using toXContent. ADStatsNodeResponse.writeTo uses StreamOutput::writeGenericValue that only recognizes built-in types (e.g., String or primitive types). It throws exceptions for types such as java.time.Instant and Entity. This PR fixes the bug by converting non-built-in types to built-in types in node stats response. This PR also removes model id from Entity.toString as it is random across different runs and keeps only attributes. This can help locate the same node in hash ring. Testing done: * Verified stats API response on a multi-node cluster for a multiple-category field detector. * Added unit tests.
…ing once Currently, I prune overflow requests in multiple loop iterations. This PR calculates how many requests we should prune directly to avoid this while loop. This PR also: *Put the hardcoded worker name to variables for code reuse. *Fixed a bug in RateLimitedRequestWorker’s constructor that doesn’t assign variables in the setting update consumer. Testing done: * added unit tests for related changes.
PriorityTracker uses a ConcurrentSkipListSet to record priorities/frequencies of entities. I didn’t realize that ConcurrentSkipListSet.first() and ConcurrentSkipListSet.last() method throws exceptions when the set is empty. This PR adds an empty check. Also, we want PriorityCache.selectUpdateCandidate to be side-effect free. Thus, this PR replaces the computeBufferIfAbsent method call with activeEnities.get as computeBufferIfAbsent has side effects by creating a CacheBuffer. Testing done: * created unit tests for related changes.
We throw an EndRunException (endNow = true) whenever there is an SearchPhaseExecutionException. EndRunException (endNow = true) stops a detector immediately. But these query exceptions can happen during the starting phase of the OpenSearch process. To confirm the EndRunException is not a false positive, this PR fixed that by setting the stopNow parameter of these EndRunException to false. Testing done: * created unit tests for related changes.
|
||
if (entityName != null && entityValue != null) { | ||
// single-stream profile request: | ||
// GET _opendistro/_anomaly_detection/detectors/<detectorId>/_profile/init_progress?category_field=<field-name>&entity=<value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _opendistro
-> _plugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
return Entity.createSingleAttributeEntity(detectorId, entityName, entityValue); | ||
} else if (request.hasContent()) { | ||
/* HCAD profile request: | ||
* GET _opendistro/_anomaly_detection/detectors/<detectorId>/_profile/init_progress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _opendistro
-> _plugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Some high level question which not covered in previous PR review. Current design will dispatch entities to different work nodes with consistent hashing. Each work node will cache entities in queue and run RCF model, then bulk write anomaly result to index. If current detection interval timeout, we just drop all remaining entities in queue to avoid carry over to next interval. It's possible that some worker node may receive more entities which needs to run longer like need to load checkpoints. So the unbalanced load may cause some worker node drops more entities. But seems not a big concern as we don't guarantee all entities will be detected, there are other reasons can cause dropping entities like worker node queue is full. Have some concern about the horizontal scalability. For example, if user scale cluster from 10 data nodes to 100 data nodes, then will have 100 worker nodes running entities, that means we may send out 100 bulk write AD result requests concurrently. But the AD result index shard setting is still the same, that may bring higher write load and user's general bulk write request may impacted. Confirm with PM/SDM if the concern of horizontal scalable issue is reasonable. |
Will confirm with Sean about the horizontal scalability issue. Result write is not problem, as we are gonna drop results when the index pressure is high. But requests like reading checkpoints are impacted when |
Codecov Report
@@ Coverage Diff @@
## main #121 +/- ##
============================================
- Coverage 79.30% 76.09% -3.21%
- Complexity 2692 2884 +192
============================================
Files 243 264 +21
Lines 11132 12435 +1303
Branches 1014 1222 +208
============================================
+ Hits 8828 9463 +635
- Misses 1888 2455 +567
- Partials 416 517 +101
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@ylwu-amzn, asked Sean. He said creating an issue to track #125 and don't block code commit. |
@ylwu-amzn, please also comment on the issue I created with solutions. |
…ject#121) This PR is a conglomerate of the following PRs. opensearch-project#60 opensearch-project#64 opensearch-project#65 opensearch-project#67 opensearch-project#68 opensearch-project#69 opensearch-project#70 opensearch-project#71 opensearch-project#74 opensearch-project#75 opensearch-project#76 opensearch-project#77 opensearch-project#78 opensearch-project#79 opensearch-project#82 opensearch-project#83 opensearch-project#84 opensearch-project#92 opensearch-project#94 opensearch-project#93 opensearch-project#95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
…ject#121) This PR is a conglomerate of the following PRs. opensearch-project#60 opensearch-project#64 opensearch-project#65 opensearch-project#67 opensearch-project#68 opensearch-project#69 opensearch-project#70 opensearch-project#71 opensearch-project#74 opensearch-project#75 opensearch-project#76 opensearch-project#77 opensearch-project#78 opensearch-project#79 opensearch-project#82 opensearch-project#83 opensearch-project#84 opensearch-project#92 opensearch-project#94 opensearch-project#93 opensearch-project#95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Description
This PR is a conglomerate of the following PRs.
#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10
This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Issues Resolved
#85 (rate limiting)
#87 (multiple category fields)
#86 (pagination)
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.