Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

monitoring 4.1 System Events

Steve Jones edited this page Sep 7, 2017 · 1 revision

Document Status:Draft

Overview

While not covering all failure and error cases, the class of errors that Eucalyptus knows about and can accurately report internally is significant and covers a high percentage of the operational failures seen in the field. Examples include: SC failure to connect to SAN, NC experiencing libvirt failures and needing to restart libvirt connections, OSG experiencing authentication failures to backend. This class of failures requires no special out-of-band reporting or detection and thus should be reported and managed by the system itself. Eucalyptus will provide a way to present these failures, most already detected and reported in the error logs, to the administrator in a single location with added capabilities to manage and track.

Eucalyptus will present a single basic abstraction: events. Events are defined as any system behavior or occurrence of interest to the administrator. Events have a severity, timestamp, "location" and unique identifier and information to tie them to resources or services as needed. Events are artifacts of the service architecture and framework of Eucalyptus, not of a single service. Any Eucalyptus service can create events and they are handled uniformly by the system such that they are aggregated and persisted in a consistent fashion for all services in Eucalyptus.

Events

Events are the base abstraction and have the following properties:

  1. Unique identifier– globally unique identifier for a single event occurance
  2. Timestamp– time on originating host at which event occurred
  3. Service– the name of the service that created the event as registered in the system via standard Eucalyptus service registration operations
  4. Component– the component type of the source service (e.g. compute, objectstorage, storage, cloudwatch, etc)
  5. Host IP/dns name– host where the event occurred as used in the registration entry
  6. Description– Free-form text description
  7. Severity - INFO|WARN|ERROR|CRITICAL
  8. (OPTIONAL) Event Code -- A numeric code to identify the event type if applicable (for consistent reporting)
  9. (OPTIONAL) Related Resource (instanceId, volumeId, etc)– The resource involved in the event, as applicable. An instanceId might be applicable for an event involving a failed vol attachment
  10. (OPTIONAL) Originating correlationId– The correlationId of the request that triggered the event (e.g a runInstance correlationId)

Example events: Service started, Service shutdown, InstanceRunFailure

Events are generated as they occur and are propagated through the system with no specific timeliness guarantees. They will expire a configurable amount of time after they are generated.

Event Generation, Propagation, and Persistence

The internal service that manages, propagates, and persists evens is the "Empyrean" service that manages all service state, state transitions, and service modifications. Because events are not part of a specific service but common to all, Empyrean is the logical location. Empyrean can be accessed on any host and always runs in each JVM. The Empyrean instance on the CLC is responsible for aggregating all events to itself to provide a single-point of listing, and clearing events by the administrator, but events may be persisted at any/all points along the propagation path as well. However, events should only be propagated in response to higher-level requests to "DescribeEvents" and are not managed at any location other than the CLC.

EucalyptusEventHandling

Event Components and Responsibilities

  • Empyrean Service - The JVM-local service that manages service state and transitions for services in that JVM

    • Aggregates all events from services that run in that container/JVM
  • CLC Empyrean Service - Manages state and reporting for all services in the cloud

    • Aggregates all events from all services in the cloud by interacting with other Empyreans and EventHandlers on the java component hosts and the CCs.
    • Persists events in cloud-events.log
    • Manages flushing the event buffers after the configured timeout for events (examples: 2 days, 24 hours, or 2 weeks)
    • Handles interaction with administrator via API (DescribeEvents, DeleteEvents, etc)
  • EventHandler - a common c library used by the CC/NC to buffer and persist local events as well as aggregated events from "downstream" components.

    • Persists events to a local event log (can be the standard nc|cc.log file or a separate file if needed)
    • Exposes the Events interface in SOAP to upstream components
    • Buffers events generated locally or aggregated from downstream until the upstream components ask for them via invoking DescribeEvents on the handler or buffers fill
    • Could use a persistent buffering mechanism (e.g. lightningDB/MDB or LevelDB) ensure events aren't lost on service restart
    • Current working code uses Redis as persistence mechanism on each host

Administrator Interactions and API

Infrastructure administrators can view and clear events using an event API with associated admin tools:

Service endpoint: UFS IP:8773/services/empyrean also available directly on CLC

Action: DescribeEvents Parameters

Host (optional), Service (optional), StartDate (optional), StopDate (optional), type (optional)

CLI: euca-describe-events

Action: DeleteEvent - Deletes event(s) from the event log permanently regardless of state or type, typically used to flush prior to normal expiration Parameters

EventID or All (required)

CLI: euca-delete-event


tag:confluence tag:rls-4.1 tag:monitoring




Clone this wiki locally