opensearch-project · vagimeli · Sep 13, 2024 · Sep 9, 2024 · Sep 10, 2024 · Sep 11, 2024
@@ -76,6 +76,8 @@
    - (Optional) To add extra processing time for data collection, specify a **Window delay** value.
       - This value tells the detector that the data is not ingested into OpenSearch in real time but with a certain delay. Set the window delay to shift the detector interval to account for this delay.
       - For example, say the detector interval is 10 minutes and data is ingested into your cluster with a general delay of 1 minute. Assume the detector runs at 2:00. The detector attempts to get the last 10 minutes of data from 1:50 to 2:00, but because of the 1-minute delay, it only gets 9 minutes of data and misses the data from 1:59 to 2:00. Setting the window delay to 1 minute shifts the interval window to 1:49--1:59, so the detector accounts for all 10 minutes of the detector interval time.
+      - To avoid missing any data, set the **Window delay** to the upper bound of the expected ingestion delay. This ensures the detector accounts for all data during its interval, reducing the chances of missing relevant information. While setting a longer window delay helps capture all data, setting it too high can hinder real-time anomaly detection, as the detector will always be looking further back in time. Strike a balance to maintain both data accuracy and timely detection.
+
 1. Specify custom results index.
    - The Anomaly Detection plugin allows you to store anomaly detection results in a custom index of your choice. To enable this, select **Enable custom results index** and provide a name for your index, for example, `abc`. The plugin then creates an alias prefixed with `opensearch-ad-plugin-result-` followed by your chosen name, for example, `opensearch-ad-plugin-result-abc`. This alias points to an actual index with a name containing the date and a sequence number, like `opensearch-ad-plugin-result-abc-history-2024.06.12-000002`, where your results are stored.
 
@@ -164,7 +166,44 @@
 
 Set the number of aggregation intervals from your data stream to consider in a detection window. It’s best to choose this value based on your actual data to see which one leads to the best results for your use case.
 
-The anomaly detector expects the shingle size to be in the range of 1 and 60. The default shingle size is 8. We recommend that you don't choose 1 unless you have two or more features. Smaller values might increase [recall](https://en.wikipedia.org/wiki/Precision_and_recall) but also false positives. Larger values might be useful for ignoring noise in a signal.
+The anomaly detector expects the shingle size to be in the range of 1 and 128. The default shingle size is 8. We recommend that you don't choose 1 unless you have two or more features. Smaller values might increase [recall](https://en.wikipedia.org/wiki/Precision_and_recall) but also false positives. Larger values might be useful for ignoring noise in a signal.
+
+#### (Advanced settings) Set an imputation option
+
+The Imputation option allows you to address missing data in your streams. You can choose from the following methods to handle gaps:
+
+- **Ignore Missing Data (Default):** The system continues without factoring in missing data points, maintaining the existing data flow.
+- **Fill with Custom Values:** Specify a custom value for each feature to replace missing data points, allowing for targeted imputation tailored to your data.
+- **Fill with Zeros:** Replace missing values with zeros, ideal when the absence of data itself indicates a significant event, such as a drop to zero in event counts.
+- **Use Previous Values:** Fill gaps with the last observed value, maintaining continuity in your time series data. This method treats missing data as non-anomalous, carrying forward the previous trend.
+
+Using these options can improve recall in anomaly detection. For instance, if you're monitoring for drops in event counts, including both partial and complete drops, filling missing values with zeros helps detect significant data absences, improving detection recall.
+
+Note: Be cautious when imputing extensively missing data, as excessive gaps can compromise model accuracy. Remember, quality input is critical—poor data quality will lead to poor model performance. You can determine whether a feature value has been imputed using the `feature_imputed` field in the anomaly result index.  For more information, see [Anomaly result mapping]({{site.url}}{{site.baseurl}}/monitoring-plugins/ad/result-mapping).
+
+#### (Advanced settings) Suppressing Anomalies with Threshold-Based Rules
+
+You can suppress anomalies by setting rules that define acceptable differences between the expected and actual values, either as an absolute value or a relative percentage. This helps reduce false anomalies caused by minor fluctuations, allowing you to focus on significant deviations.
+
+Suppose you want to detect substantial changes in log volume while ignoring small variations that aren't meaningful. Without customized settings, the system might generate false alerts for minor changes, making it difficult to identify true anomalies. By setting suppression rules, you can filter out minor deviations and hone in on genuinely anomalous patterns.
+
+If you want to suppress anomalies for deviations smaller than 30% from the expected value, you can set the following rules:
+
+```
+Ignore anomalies for feature logVolume when the actual value is no more than 30% above the expected value.
+Ignore anomalies for feature logVolume when the actual value is no more than 30% below the expected value.
+```
+
+Note: Ensure that a feature (e.g., logVolume) is properly defined in your model, as suppression rules are tied to specific features.
+
+If you expect that the log volume should differ by at least 10,000 from the expected value before being considered an anomaly, you can set absolute thresholds:
+
+```
+Ignore anomalies for feature logVolume when the actual value is no more than 10000 above the expected value.
+Ignore anomalies for feature logVolume when the actual value is no more than 10000 below the expected value.
+```
+
+If no custom suppression rules are set, the system defaults to a filter that ignores anomalies with deviations of less than 20% from the expected value for each enabled feature.
 
 #### Preview sample anomalies
 

@@ -80,6 +80,81 @@ Field | Description
 `model_id` | A unique ID that identifies a model. If a detector is a single-stream detector (with no category field), it has only one model. If a detector is a high-cardinality detector (with one or more category fields), it might have multiple models, one for each entity.
 `threshold` | One of the criteria for a detector to classify a data point as an anomaly is that its `anomaly_score` must surpass a dynamic threshold. This field records the current threshold.
 
+When the imputation option is enabled, the anomaly result output will include a `feature_imputed` array, indicating whether each feature has been imputed. This information helps you understand which features were modified during the anomaly detection process due to missing data. If no features were imputed, the feature_imputed array will be omitted from the results.
+
+In the following example, the feature processing_bytes_max was imputed, as indicated by the `imputed: true` status:
+
+```json
+{
+    "detector_id": "kzcZ43wBgEQAbjDnhzGF",
+    "schema_version": 5,
+    "data_start_time": 1635898161367,
+    "data_end_time": 1635898221367,
+    "feature_data": [
+        {
+            "feature_id": "processing_bytes_max",
+            "feature_name": "processing bytes max",
+            "data": 2322
+        },
+        {
+            "feature_id": "processing_bytes_avg",
+            "feature_name": "processing bytes avg",
+            "data": 1718.6666666666667
+        },
+        {
+            "feature_id": "processing_bytes_min",
+            "feature_name": "processing bytes min",
+            "data": 1375
+        },
+        {
+            "feature_id": "processing_bytes_sum",
+            "feature_name": "processing bytes sum",
+            "data": 5156
+        },
+        {
+            "feature_id": "processing_time_max",
+            "feature_name": "processing time max",
+            "data": 31198
+        }
+    ],
+    "execution_start_time": 1635898231577,
+    "execution_end_time": 1635898231622,
+    "anomaly_score": 1.8124904404395776,
+    "anomaly_grade": 0,
+    "confidence": 0.9802940756605277,
+    "entity": [
+        {
+            "name": "process_name",
+            "value": "process_3"
+        }
+    ],
+    "model_id": "kzcZ43wBgEQAbjDnhzGF_entity_process_3",
+    "threshold": 1.2368549346675202,
+    "feature_imputed": [
+        {
+            "feature_id": "processing_bytes_max",
+            "imputed": true
+        },
+        {
+            "feature_id": "processing_bytes_avg",
+            "imputed": false
+        },
+        {
+            "feature_id": "processing_bytes_min",
+            "imputed": false
+        },
+        {
+            "feature_id": "processing_bytes_sum",
+            "imputed": false
+        },
+        {
+            "feature_id": "processing_time_max",
+            "imputed": false
+        }
+    ]
+}
+```
+
 If an anomaly detector detects an anomaly, the result has the following format:
 
 ```json