Skip to content

Latest commit

 

History

History
367 lines (276 loc) · 19.8 KB

metrics.md

File metadata and controls

367 lines (276 loc) · 19.8 KB

OpenWhisk Metric Support

OpenWhisk distinguishes between system and user metrics (events).

System metrics typically contain information about system performance and provide a possibility to send them to Kamon or write them to log files in logmarker format. These metrics are typically used by OpenWhisk providers/operators.

User metrics encompass information about action performance which is sent to Kafka in a form of events. These metrics are to be consumed by OpenWhisk users, however they could be also used for billing or audit purposes. It is to be noted that at the moment the events are not directly exposed to the users and require an additional Kafka Consumer based micro-service for data processing.

System specific metrics

Configuration

Both capabilities can be enabled or disabled separately during deployment via Ansible configuration in the 'group_vars/all' file of an environment.

There are four configurations options available:

  • metrics_log [true / false (default: true)]

    Enable/disable whether the metric information is written out to the log files in logmarker format.

    Beware: Even if set to false all messages using the log markers are still written out to the log

  • metrics_kamon [true / false (default: false)]

    Enable/disable whether metric information is sent to the configured StatsD server.

  • metrics_kamon_tags: false [true / false (default: false)]

    Enable/disable whether to use the Kamon tags when sending metrics.

    Notice: Tags are supported in only some Kamon backends. (OpenTSDB, Datadog, InfluxDB)

  • metrics_kamon_statsd_host [hostname or ip address]

    Hostname or ip address of the StatsD server

  • metrics_kamon_statsd_port [port number (default:8125)]

    Port number of the StatsD server

Example configuration:

metrics_kamon: true
metrics_kamon_tags: false
metrics_kamon_statsd_host: '192.168.99.100'
metrics_kamon_statsd_port: '8125'
metrics_log: true

Testing the StatsD metric support

The Kamon project provides an integrated docker image containing StatsD and a connected Grafana dashboard via this Github project. This image is helpful for testing the metrics sent via StatsD.

Please follow these instructions to start the docker image in your local docker environment.

The docker image exposes StatsD via the (standard) port 8125 and a Grafana dashboard via port 8080 on your docker host.

The address of your docker host has to be configured in the metrics_kamon_statsd_host configuration property.

Metric Names

All metric names have to be prefixed by a prefix that you specify and are subject to modification by graphite, datadog, or statsd. For example if prefix used is openwhisk then metric names would be like openwhisk.counter.controller_activation_start. This document assumes that metric name prefix is openwhisk

Currently OpenWhisk emits following types of metrics

Counter

Counter record the count of metric and there names are prefixed with openwhisk.counter. For example openwhisk.counter.controller_activation_start. Counters just counts and resets to zero upon each flush.

Histograms

Histogram record the distribution of given metric and there names are prefixed with openwhisk.histogram. For example openwhisk.histogram.controller_activation_finish. A histogram metrics may result in multiple values at the metric aggregator level. For example in Datadog for each histogram metric following values are record

  • my_metric.avg - Average of aggregated values during the flush interval.
  • my_metric.count - Count of aggregated values during the flush interval.
  • my_metric.median - Median of aggregated values during the flush interval.
  • my_metric.95percentile - 95th percentile value of aggregated values during the flush interval.
  • my_metric.max - Max of aggregated values during the flush interval.
  • my_metric.min - Min of aggregated values during the flush interval.

Gauges

Gauges record the distribution of given metric and their names are prefixed with openwhisk.gauge. For example openwhisk.gauge.loadbalancer_totalHealthyInvoker_counter. A gauge metrics provides the value at the given point and reports the same data unless the value has been changed be incremental or decremental than before. Gauges are useful for reporting metrics like kafka queue size or disk size.

Metric Details

Below are some of the important metrics emitted by OpenWhisk setup

Controller metrics

Metrics below are emitted from within a Controller instance.

Controller Startup
  • openwhisk.counter.controller_startup<controller_id>_counter (counter)
    • Example openwhisk.counter.controller_startup0_counter
    • Records count of controller instance startup
Controller Activation Retrieval During Blocking Invocations
  • openwhisk.counter.controller_blockingActivationDatabaseRetrieval_counter (counter) - Records the count of activations the controller has retrieved from the activation store during blocking invocations
Activation Submission

Following metrics record stats around activation handling within Controller

  • Normal actions
    • openwhisk.counter.controller_activation_start (counter) - Records the count of non blocking activations started.
    • openwhisk.histogram.controller_activation_finish (histogram) - Records the overall time taken for non blocking activation to be submitted to Load balancer.
  • Blocking actions
    • openwhisk.counter.controller_blockingActivation_start (counter) - Records the count of blocking activations started.
    • openwhisk.histogram.controller_blockingActivation_finish (histogram) - Records the time taken for a blocking activation to finish or timeout.
Load Balancer

Aggregate metrics for inflight activations.

  • openwhisk.gauge.loadbalancer<controllerId>_activationsInflight_counter (gauge) - Records the number of activations being worked upon for a given controller. As a gauge this will give inflight activation count at the given point in time unless the change in value occurs.
  • openwhisk.gauge.loadbalancer<controllerId>_memory<invokerType>Inflight_counter (gauge) - Records the amount of RAM memory in use for in flight activations. This is not actual runtime memory but the memory specified per action limits. invokerType defines whether it is a managed or a blackbox invoker.

Metrics below are for current memory capacity

  • openwhisk.histogram.loadbalancer_totalCapacity<invokerType>_counter (histogram) - Current memory capacity for all usable managed and blackbox invokers, total user memory in shard managed by controller. invokerType defines whether it is a managed or a blackbox invoker.

Metrics below are captured within load balancer

  • openwhisk.counter.loadbalancer_activations_counter (counter) - Records the count of activations sent to Kafka.
  • openwhisk.counter.controller_kafka_start (counter) - Records the count of activations sent to Kafka.
  • openwhisk.counter.controller_kafka_error (counter) - Records the count of activations which encountered some failure while submitting to Kafka.
  • openwhisk.histogram.controller_kafka_finish (histogram) - Records the time taken when activation was successfully submitted to Kafka.
  • openwhisk.histogram.controller_kafka_error (histogram) - Records the time taken when activation submission to Kafka resulted in failure.
  • openwhisk.counter.controller_loadbalancer_start (counter) - Records the count of activations submitted to load balancer.
  • openwhisk.histogram.controller_loadbalancer_finish (histogram) - Records the time taken to submit to load balancer.

Metrics below are for invoker state as recorded within load balancer monitoring.

  • openwhisk.gauge.loadbalancer_totalHealthyInvoker<invokerType>_counter(gauge) - Records the count of managed invokers considered healthy based on health pings. invokerType defines whether it is a managed or a blackbox invoker.
  • openwhisk.gauge.loadbalancer_totalUnresponsiveInvoker<invokerType>_counter (gauge) - Records the count of managed invokers considered unresponsive when health pings arriving fine but the invokers do not respond with active-acks in given time. invokerType defines whether it is a managed or a blackbox invoker.
  • openwhisk.gauge.loadbalancer_totalOfflineInvoker<invokerType>_counter (gauge) - Records the count of managed invokers considered offline when no health pings arrive from the invokers. invokerType defines whether it is a managed or a blackbox invoker.
  • openwhisk.gauge.loadbalancer_totalUnhealthyInvoker<invokerType>_counter (gauge) - Records the count of managed invokers considered unhealthy when health pings arrive fine but the invokers report system errors. invokerType defines whether it is a managed or a blackbox invoker.

Metrics below provide information about completion ack processing in load balancers. Depending on configuration setting metrics_kamon_tags (see above), a base metric with tags or a set of metrics without tags will be emitted.

  • Base metric openwhisk.counter.loadbalancer_completionAck_counter: count of processed regular or forced completion acks.
  • Tag controller_id: the controller's id.
  • Tag type: the exact type of completion ack.
    • Type regular: a regular completion ack sent by an invoker and received in time. Does not include completion acks for healthcheck actions.
    • Type forced: no completion ack was received in time and the timeout forced the completion ack to close.
    • Type healthcheck: a regular completion ack for healthcheck actions sent by an invoker and received in time.
    • Type regularAfterForced: a regular completion ack sent by an invoker and not received in time. The completion ack was already forced.
    • Type forcedAfterRegular: a timeout tries to force a completion ack that has already been closed by a regular completion ack. A race condition that can occur if the regular completion ack is received near the timeout.
  • If metrics_kamon_tags is set to false, a set of metrics will be emitted constructed using following scheme: openwhisk.counter.loadbalancer<controller_id>_completionAck_<type>_counter.

Invoker metrics

Container Init
  • openwhisk.counter.invoker_activationInit_start (counter) - Count of container initializations done.
  • openwhisk.histogram.invoker_activationInit_finish (histogram) - Time taken for successful container initializations.
  • openwhisk.histogram.invoker_activationInit_error (histogram) - Time taken container initialization failed. Count metrics of this histogram would give insight on failed initialization count.
Container Run
  • openwhisk.counter.invoker_activationRun_start (counter) - Count of action executions performed.
  • openwhisk.histogram.invoker_activationRun_finish (histogram) - Time taken for action execution for success case.
  • openwhisk.histogram.invoker_activationRun_error (histogram) - Time taken for action execution for failed cases. Count metrics of this histogram would give insight on failed execution count.
Container Start
  • openwhisk.counter.invoker_containerStart.cold_counter (counter) - Count of number of cold starts.
  • openwhisk.counter.invoker_containerStart.recreated_counter (counter) - Count of number of times container is recreated.
  • openwhisk.counter.invoker_containerStart.warm_counter (counter) - Count of number of times a warm container is used.
Log Collection
  • openwhisk.counter.invoker_collectLogs_start (counter) - Count of number of times log were collected.
  • openwhisk.counter.invoker_collectLogs_error (counter) - Count of number of failed logs collections.
  • openwhisk.histogram.invoker_collectLogs_error (histogram) - Time taken for failed log collection.
  • openwhisk.histogram.invoker_collectLogs_finish (histogram) - Time taken for successful log collection.
Activation Handling
  • openwhisk.counter.invoker_activation_start (counter) - Count of activations handled
Docker Metrics

Following metrics capture stats around various docker command executions.

  • pause
    • openwhisk.counter.invoker_docker.pause_start
    • openwhisk.counter.invoker_docker.pause_error
    • openwhisk.counter.invoker_docker.pause_timeout
    • openwhisk.histogram.invoker_docker.pause_finish
    • openwhisk.histogram.invoker_docker.pause_error
  • ps
    • openwhisk.counter.invoker_docker.ps_start
    • openwhisk.counter.invoker_docker.ps_error
    • openwhisk.counter.invoker_docker.ps_timeout
    • openwhisk.histogram.invoker_docker.ps_finish
    • openwhisk.histogram.invoker_docker.ps_error
  • pull
    • openwhisk.counter.invoker_docker.pull_start
    • openwhisk.counter.invoker_docker.pull_error
    • openwhisk.counter.invoker_docker.pull_timeout
    • openwhisk.histogram.invoker_docker.pull_finish
    • openwhisk.histogram.invoker_docker.pull_error
  • rm
    • openwhisk.counter.invoker_docker.rm_start
    • openwhisk.counter.invoker_docker.rm_error
    • openwhisk.counter.invoker_docker.rm_timeout
    • openwhisk.histogram.invoker_docker.rm_finish
    • openwhisk.histogram.invoker_docker.rm_error
  • run
    • openwhisk.counter.invoker_docker.run_start
    • openwhisk.counter.invoker_docker.run_error
    • openwhisk.counter.invoker_docker.run_timeout
    • openwhisk.histogram.invoker_docker.run_finish
    • openwhisk.histogram.invoker_docker.run_error
  • unpause
    • openwhisk.counter.invoker_docker.unpause_start
    • openwhisk.counter.invoker_docker.unpause_error
    • openwhisk.counter.invoker_docker.unpause_timeout
    • openwhisk.histogram.invoker_docker.unpause_finish
    • openwhisk.histogram.invoker_docker.unpause_error

Kafka Metrics

Metrics below are emitted per kafka topic.

  • openwhisk.histogram.kafka_<topic name>.delay_start - Time delay between when a message was pushed to Kafka and when it is read within a consumer. This metric is recorded for every message read.
  • openwhisk.gauge.kafka_<topic name>_counter - Records the Queue size of the topic. By default this metric is emitted every 60 secs.

Metrics per topic

  • cacheInvalidation - Emitted per controller while reading the cache invalidation messages.
    • openwhisk.histogram.kafka_cacheInvalidation.delay_start
    • openwhisk.histogram.kafka_cacheInvalidation_counter.count
  • health - Emitted per controller while reading the invoker health pings.
    • openwhisk.histogram.kafka_health.delay_start
    • openwhisk.histogram.kafka_health_counter
  • completed<controllerId> - Topic to receive completed activations. This is emitted per controller for its own topic. For example for controller id 0 metric names would be
    • openwhisk.histogram.kafka_completed0.delay_start
    • openwhisk.histogram.kafka_completed0_counter
  • invoker<invokerId> - Topic to receive activations to complete. This is emitted per invoker for its own topic. For example for invoker id 0 metric names would be
    • openwhisk.histogram.kafka_invoker0_counter
    • openwhisk.histogram.kafka_invoker0.delay_start

Database Metrics

Cache Metrics
  • openwhisk.counter.database_cacheHit_counter - Count of cache hits.
  • openwhisk.counter.database_cacheMiss_counter - Count of cache misses.

Metrics below are emitted for database related operations and follow a pattern

  • openwhisk.counter.database_<operation type>_start - Count of database operations done for given type. Example openwhisk.counter.database_getDocument_start.
  • openwhisk.counter.database_<operation type>_error - Count of database operations done for given type which resulted in error. Example openwhisk.counter.database_getDocument_error.
  • openwhisk.histogram.database_<operation type>_finish - Time taken for successful completion of given database operation. Example openwhisk.histogram.database_getDocument_finish.
  • openwhisk.histogram.database_<operation type>_error - Time taken for failed completion of given database operation. Example openwhisk.histogram.database_getDocument_error.

Operation Types

  • deleteDocument
  • getDocument
  • queryView
  • saveDocument
  • saveDocumentBulk

CosmosDB RU Metrics

When database used is CosmosDB then metrics related to CosmosDB Resource Units is also emitted.

If Kamon tags are enabled then metric name is openwhisk.counter.cosmosdb_ru_used with following tags

  • mode - read or write
  • collection - Name of collection. Example activations, whisks and subjects
  • action - Type of operation performed. Example get, put, del, query and count

If Kamon tags are not enabled then metric name is of the form openwhisk.counter.cosmosdb.ru.<collection>.<action>

User specific metrics

Configuration

User metrics are enabled by default and could be explicitly disabled by setting the following property in one of the Ansible configuration files:

user_events: false

Supported events

Activation is an event that occurs after after each activation. It includes the following execution metadata:

waitTime - internal system hold time
initTime - time it took to initialize an action, e.g. docker init
statusCode - status code of the invocation: 0 - success, 1 - application error, 2 - action developer error, 3 - internal OpenWhisk error
duration - actual time the action code was running
kind - action flavor, e.g. Node.js
conductor - true for conductor backed actions
memory - maximum memory allowed for action container
causedBy - contains the "causedBy" annotation (can be "sequence" or nothing at the moment)
size - size (in bytes) of the invocation response

Metric is any user specific event produced by the system and it at this moment includes the following information:

ConcurrentRateLimit - a user has exceeded its limit for concurrent invocations.
TimedRateLimit - the user has reached its per minute limit for the number of invocations.
ConcurrentInvocations - the number of in flight invocations per user.

Example events that could be consumed from Kafka. Activation:

{
  "body": {
    "statusCode": 0,
    "duration": 3,
    "name": "whisk.system/invokerHealthTestAction0",
    "waitTime": 583915671,
    "conductor": false,
    "kind": "nodejs:6",
    "initTime": 0,
    "memory": 256,
    "size": 463,
    "causedBy": false
  },
  "eventType": "Activation",
  "source": "invoker0",
  "subject": "whisk.system",
  "timestamp": 1524476122676,
  "userId": "d0888ad5-5a92-435e-888a-d55a92935e54",
  "namespace": "whisk.system"
}

Metric:

{
  "body": {
    "metricName": "ConcurrentInvocations",
    "metricValue": 1
  },
  "eventType": "Metric",
  "source": "controller0",
  "subject": "guest",
  "timestamp": 1524476104419,
  "userId": "23bc46b1-71f6-4ed5-8c54-816aa4f8c502",
  "namespace": "guest"
}

User-events consumer service

All user metrics can be consumed and published to various services such as Prometheus, Datadog etc via Kamon by using the user-events service.