-
Notifications
You must be signed in to change notification settings - Fork 24
cloudwatch 3.3 cloud_design
This design covers our implementation of AWS Cloud Watch.
Implementation of items related to features that we do not support are out of scope. These features are:
- Placement groups
- Spot Instances
- SNS
- VPC
- VPC
- SNS
- DynamoDB
- Billing
- ElastiCache
- ElasticMapReduce
- RDS
- SQS
- StorageGateway
- EC2 - Data collection for system-defined metrics.
- EBS - Data collection for EBS disk system-defined metrics.
- AS - Support execution of autoscaling policies from alarm triggers, and collection of system metrics.
- ELB - Data collection for ELB system metrics.
This feature relates to the following features in this release:
- EC2
- ELB
- AS
- ELB
This section provides design details relevant to developers.
DB | eucalyptus_cloudwatch | |
Source modules | cloudwatch | Contents under com.eucalyptus.cloudwatch package |
cloudwatch-common | Contents under com.eucalyptus.cloudwatch.common package | |
Service class | CloudWatchService | |
Component ID class | CloudWatch | |
Service type name | cloudwatch | |
Internal URI | /internal/CloudWatch | Internal SOAP |
The following entities will be added:
- Dimension
- Metric Data
- List Metrics
- Alarms
- Alarm History
- Absolute Metric History
So many entities have a Dimension collection as an attribute that a class called AbstractPersistentWithDimensions was created. It adds the following fields to tables.
dim_1_name | character varying(255) | Name of dimension 1 |
dim_1_value | character varying(255) | Value of dimension 1 |
dim_2_name | character varying(255) | Name of dimension 2 |
dim_2_value | character varying(255) | Value of dimension 2 |
dim_3_name | character varying(255) | Name of dimension 3 |
dim_3_value | character varying(255) | Value of dimension 3 |
dim_4_name | character varying(255) | Name of dimension 4 |
dim_4_value | character varying(255) | Value of dimension 4 |
dim_5_name | character varying(255) | Name of dimension 5 |
dim_5_value | character varying(255) | Value of dimension 5 |
dim_6_name | character varying(255) | Name of dimension 6 |
dim_6_value | character varying(255) | Value of dimension 6 |
dim_7_name | character varying(255) | Name of dimension 7 |
dim_7_value | character varying(255) | Value of dimension 7 |
dim_8_name | character varying(255) | Name of dimension 8 |
dim_8_value | character varying(255) | Value of dimension 8 |
dim_9_name | character varying(255) | Name of dimension 9 |
dim_9_value | character varying(255) | Value of dimension 9 |
dim_10_name | character varying(255) | Name of dimension 10 |
dim_10_value | character varying(255) | Value of dimension 10 |
This design is limited to 10 dimensions and may appear to be hard coded, but it allows a collection implementation within a single table, which allows for fast lookups and deletes. There are getter/setter methods for each field, but the important methods are getDimensions() and setDimensions() for this class. Dimensions are generally sorted alphabetically by name before calling setDimensions() to make comparisons faster.
The Metric entity represents metrics, together with timed data values. MetricEntity extends AbstractPersistentWithDimensions so in addition to the dimension fields all tables representing metric data have the following fields.
account_id | character varying(255) | Id associated with the account |
namespace | character varying(255) | Namespace for the metric |
metric_name | character varying(255) | Name of the metric |
dimension_hash | character varying(255) | hash of all the dimension name/values (more below) |
units | character varying(255) | Enumeration of possible unit values (see spec) |
metric_type | character varying(255) | Enumeration of SYSTEM or CUSTOM metric |
timestamp | timestamp without time zone | Timestamp of the statistic set |
sample_size | double precision | number of data points in the statistic set |
sample_max | double precision | Maximum value in the statistic set |
sample_min | double precision | Minimum value in the statistic set |
sample_sum | double precision | Sum of all the values in the statistic set |
Metric data can either be entered as a single value, or a statistic set, which constains a max, min, sum, and sample_count of data points. Since all data is used in calculation, in the table, a single data point is turned into a statistic set of size 1, with equal max, min, and sum.
The dimension hash is a quick method to search on an exact dimension match. Given a collection of dimensions, the dimension hash is constructed as SHA-1(dim_1_name + "|" + dim_1_value + "|" + dim_2_name + "|" + dim_2_value + "|" + ...) for each dimension, after the dimensions have been sorted alphabetically by key. There is a final "|" in the string, and a null or empty collection hashes against an empty string (no "|" characters).
Note: Due to dimension folding (discussed below) the dimension hash is not always the hash of the dimensions in the table row.
To eventually allow database delegation and scalability, the rows are placed into certain tables depending on dimension hash and metric type. Here are the tables. (All have identical fields).
- custom_metric_data_0
- custom_metric_data_1
- custom_metric_data_2
- custom_metric_data_3
- custom_metric_data_4
- custom_metric_data_5
- custom_metric_data_6
- custom_metric_data_7
- custom_metric_data_8
- custom_metric_data_9
- custom_metric_data_a
- custom_metric_data_b
- custom_metric_data_c
- custom_metric_data_d
- custom_metric_data_e
- custom_metric_data_f
- system_metric_data_0
- system_metric_data_1
- system_metric_data_2
- system_metric_data_3
- system_metric_data_4
- system_metric_data_5
- system_metric_data_6
- system_metric_data_7
- system_metric_data_8
- system_metric_data_9
- system_metric_data_a
- system_metric_data_b
- system_metric_data_c
- system_metric_data_d
- system_metric_data_e
- system_metric_data_f
Table Operations
- Create : An aggregation queue will be used to aggregate whatever data points that can be combined within the window. For each data point, a table will be selected based on a hash of the dimensions (key:value) and metric type.
- Read : Search will be done on equality on metric_name, metric_type, namespace, and units. Dimensions will be searched against the dimension hash. Custom metrics require all dimensions match, system metrics require only a subset.
- Update : Not required
- Delete : removes rows in which have a time stamp that is more than two weeks old
- Uniqueness Constraints: None
The List Metric entity represents a list of all metrics in the system, without data points. It is a list of all metrics one can call GetMetricStatistics against. Data is stored in the list_metrics table. ListMetric extends AbstractPersistentWithDimensions so in addition to the dimension fields the table has the folowing fields.
account_id | character varying(255) | Id associated with the account |
metric_type | character varying(255) | Enumeration of SYSTEM or CUSTOM metric |
metric_name | character varying(255) | Name of the metric |
namespace | character varying(255) | Namespace for the metric |
Table Operations
- Create : The metric info (without metric value), will be persisted after the data is inserted into the metric data tables.
- Read: Searches can be based on metric name, namespace, or a dimension list (dimension subset match is allowed). Result will include the entire dimension list entered with the metric
- Update: Only the update timestamp of the list_metrics row will be updated (after data is inserted into the metric data tables).
- Delete : removes rows in which have a time stamp that is more than two weeks old
- Uniqueness Constraints: Only the combination of account_id, metric_name, namespace, and dimension key/values (in order) need to be unique.
The Alarm Entity represents alarms in the system. Alarms are associated with a metric, so have most of the fields associated with it. AlarmEntity extends AbstractPersistentWithDimensions so in addition to the dimension fields, alarms are stored in the table alarms and have the following fields.
account_id | character varying(255) | Id associated with the account |
actions_enabled | boolean | True if the actions associated with this alarm should be execute |
alarm_action_1 | text | ARN of the first action to be taken when the alarm goes into the ALARM state. |
alarm_action_2 | text | ARN of the second action to be taken when the alarm goes into the ALARM state. |
alarm_action_3 | text | ARN of the third action to be taken when the alarm goes into the ALARM state. |
alarm_action_4 | text | ARN of the fourth action to be taken when the alarm goes into the ALARM state. |
alarm_action_5 | text | ARN of the fifth action to be taken when the alarm goes into the ALARM state. |
alarm_configuration_updated_timestamp | timestamp without time zone | Last time this alarm configuration was updated |
alarm_description | character varying(255) | description for the alarm |
alarm_name | character varying(255) | name for the alarm |
comparison_operator | character varying(255) | Enumeration representing the comparison operator used to determine alarm state |
evaluation_periods | integer | The number of consecutive periods an alarm must exceed the threshold before moving to the ALARM state. |
evaluation_periods | integer | The number of consecutive periods an alarm must exceed the threshold before moving to the ALARM state. |
insufficient_data_action_1 | text | ARN of the first action to be taken when the alarm goes into the INSUFFICIENT_DATA state. |
insufficient_data_action_2 | text | ARN of the second action to be taken when the alarm goes into the INSUFFICIENT_DATA state. |
insufficient_data_action_3 | text | ARN of the third action to be taken when the alarm goes into the INSUFFICIENT_DATA state. |
insufficient_data_action_4 | text | ARN of the fourth action to be taken when the alarm goes into the INSUFFICIENT_DATA state. |
insufficient_data_action_5 | text | ARN of the fifth action to be taken when the alarm goes into the INSUFFICIENT_DATA state. |
last_actions_executed_timestamp | timestamp without time zone | The last time an action was executed (used to determine if more than one period has passed) |
metric_name | character varying(255) | Name of the metric |
metric_type | character varying(255) | Enumeration of SYSTEM or CUSTOM metric |
namespace | character varying(255) | Namespace for the metric |
ok_action_1 | text | ARN of the first action to be taken when the alarm goes into the OK state. |
ok_action_2 | text | ARN of the second action to be taken when the alarm goes into the OK state. |
ok_action_3 | text | ARN of the third action to be taken when the alarm goes into the OK state. |
ok_action_4 | text | ARN of the fourth action to be taken when the alarm goes into the OK state. |
ok_action_5 | text | ARN of the fifth action to be taken when the alarm goes into the OK state. |
period | integer | The number of seconds of one evaluation period |
state_reason | character varying(1023) | The reason the alarm is in the current state |
state_reason_data | character varying(1023) | A JSON string representing the reason the alarm is in the current state. |
state_updated_timestamp | timestamp without time zone | Time the alarm state was last updated |
state_value | character varying(255) | Enumeration of current alarm state (ALARM, OK, INSUFFICIENT_DATA) |
statistic | character varying(255) | Enumeration of possible statistics (Average, Maximum, Minimum, Sample Count, Sum) used for alarm state evaluation |
threshold | double precision | The threshold value used for alarm state evaluation |
unit | character varying(255) | Enumeration of valid unit values |
Table Operations
- Create : A row is created when a putMetricAlarm() request is made on an alarm that does not exist yet.
- Read: Searches can be done via the describeAlarms() or describeAlarmsForMetric() requests. Alarms can be searched by alarm name, alarm name prefix, action prefix (ARN of one of the ok, alarm, or insufficient data actions), or state value. Alarms for a certain metric can be searched by metric name, namespace, period, units, or statistic.
- Update: The table is updated by many factors, alarm state evaluation, putMetricAlarm() on an existing alarm, or any of the disableAlarmActions(), enableAlarmActions() or setAlarmState() requests.
- Delete : A row is deleted when a deleteMetricAlarms() request is made on that alarm. Alarm history (shown below) is not deleted.
- Uniqueness Constraints: Only the combination of account_id, and alarm_name need to be unique.
The Alarm History Entity represents the history of alarm operations in the system. This includes things like alarm creation, deletion, and state update. Alarm history is stored in the alarm_history table, which has the following fields.
account_id | character varying(255) | Id associated with the account |
alarm_name | character varying(255) | name for the alarm |
history_data | character varying(4095) | A JSON string representing this history data item |
history_item_type | character varying(255) | Enumeration of valid history item types (Configuration Update, State Change, Action) |
history_summary | character varying(255) | A summary of the history item. |
timestamp | timestamp without time zone | Timestamp of the alarm history item. |
Table Operations
- Create : A row is created when an alarm action is executed, alarm configuration is updated, an alarm is created or deleted, or the alarm state is updated.
- Read: Searches can be done via the describeAlarmHistory() request. Alarm History can be searched by timestamp (start and end time), alarm name, or history item type.
- Update: Not required.
- Delete : removes rows in which have a time stamp that is more than two weeks old. Rows are not deleted just because an alarm is deleted.
- Uniqueness Constraints: None
Some system metrics that come in are cumulative metrics. For example DiskReadOps return the total number of disk reads since the instance started, not the number of reads since the last metric was recorded. To put the non-cumulative value in the metric data, a delta against the last cumulative (absolute) value of the metric is computed. The previous cumulative value of a given metric is stored in the absolute_metric_history table. The fields are shown below.
dimension_name | character varying(255) | The (primary) name of the dimension associated with the metric |
dimension_value | character varying(255) | The (primary) value of the dimension associated with the metric |
last_metric_value | character varying(4095) | The last (cumulative) metric value associated with the metirc |
metric_name | character varying(255) | Name of the metric |
namespace | character varying(255) | Namespace of the metric |
timestamp | timestamp without time zone | Timestamp of the last recorded data point for the metric |
Most of the cumulative system metrics that come in have one primary dimenison associated with this metric. Items stored in this table are of this type.
Table Operations
- Create : Every time a new system metric comes in that is cumulative, the table is checked for a previous data point. If none is in the table, a new row is created.
- Read: A read is done to compute the delta when the a new cumulative system metric is put in.
- Update: Row is updated when a new cumulative system metric is put in.
- Delete : Right now rows are never deleted, but should be deleted every two weeks.
- Uniqueness Constraints: dimension_name, dimension_value, metric_name, and namespace must be unique together.
File:put-metric-data-workflow.png
- PutMetricData -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Raw Data Queue -> FIFO queue
- Aggregation -> Function to process an aggregation of one minute window of raw data to be inserted into the database. Every put_metric_data call also updates the list_metrics table
- Cleaner -> House keeping process to delete metric data from the database with a 2 week window
- Data Points Rows / Metric Data Table -> A table is selected based on dimension hash and metric type, processed information.
File:list-metrics-workflow.png
- ListMetrics -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Cleaner -> House keeping process to delete metric data from the database with a 2 week window
- list_metrics -> Processed information
- PutMetricData -> Previous workflow also populates list_metrics table
File:get-metric-statistics-workflow.png
- GetMetricStatistics -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Aggregation -> Function to convert one or more raw data rows into an aggregated statistic based on the period selected
- Data Points Rows / Metric Data Table -> A table is selected based on dimension hash and metric type, processed information.
File:put-metric-alarm-workflow.png
- PutMetricAlarm -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Alarms -> database table used to store alarms
File:describe-alarms-workflow.png
- DescribeAlarms -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Alarms -> database table used to store alarms
File:describe-alarms-for-metric-workflow.png
- DescribeAlarmsForMetric -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Alarms -> database table used to store alarms
File:enable-alarm-actions-workflow.png
- EnableAlarmActions -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Alarms -> database table used to store alarms
File:disable-alarm-actions-workflow.png
- DisableAlarmActions -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Alarms -> database table used to store alarms
File:set-alarm-state-workflow.png
- DisableAlarmActions -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Alarms -> database table used to store alarms
File:describe-alarm-history-workflow.png
- DescribeAlarmHistory -> user created request
- Cloud Watch Service -> Eucalyptus implementation of the Cloud Watch Service
- Alarm History -> database table used to store alarm history
- Cleaner -> House keeping process to delete metric data from the database with a 2 week window
- Alarm CRUD Operation -> A change in state to an alarm, alarm creation, alarm deletion, or executed action.
Global period data deletion after 2 weeks.
Once a minute, every alarm is read out of the database and put into a queue for evaluation. There are currently 5 queue workers. Each queue worker evaluates the state of it's alarm, according to the alarm rules. (See alarm-rules.wiki). If alarm actions need to be executed, the queue workers do that as well.
Because AWS allows some metrics to be aggregated or "folded", system metrics are inserted into the database with every possible parameter combination. For example, a system metric of "CPUUtilization", with dimensions "{ImageId=emi-12345, InstanceId=i-1234567, InstanceType=m1.small} and value 20.0 is actually inserted into the database 8 times, with all possible dimension parameters (i.e.
- {}
- {ImageId=emi-12345}
- {InstanceId=i-1234567}
- {InstanceType=m1.small}
- {ImageId=emi-12345, InstanceId=i-1234567}
- {ImageId=emi-12345, InstanceType=m1.small}
- {InstanceId=i-1234567, InstanceType=m1.small}
- {ImageId=emi-12345, InstanceId=i-1234567, InstanceType=m1.small}
Update: It appears AWS does not aggregate all combinations of system metrics. Using the approach above caused EucaLobo to not function properly. For EBS and EC2 metrics, while data is still folded in the metric data tables, the list_metrics table only shows metrics with zero or one dimension for EBS and EC2 metrics. We may wish to reconsider folding metrics in the future and make the metric provider responsible for adding whatever aggregated metrics they want to allow.
New SOAP / Query API user facing services are added.
CLC -> Consume general jvm resource the cloud controller
DB -> additional jdbc connections
Alarms -> additional duty cycle in which will spawn more threads
Must be HA compliant
Filters should be discovered, whether individually, per type or per API.
No upgrade impact noted.
No specific packaging requirements.
- Need feedback from the docs team
Each operation must adhere with the Eucalyptus implementation of the IAM-compatible identity management system.
Testing should cover SOAP and Query APIs
AWS Cloud Watch : http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/Welcome.html JIRA https://eucalyptus.atlassian.net/browse/EUCA-4307