-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write checkable create
& delete
sla history events
#566
base: main
Are you sure you want to change the base?
Conversation
2853ab4
to
87e94ac
Compare
1cc1f09
to
80a76e5
Compare
2fc3663
to
c160354
Compare
2a4824a
to
2717c61
Compare
93d6f3f
to
69bc4ad
Compare
I just had an idea how we could call that type of SLA history after we didn't really come up with good name for this initially: lifecycle |
753dba4
to
f1878aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don’t force-push for now.
12f577d
to
6bc5647
Compare
f36ce4b
to
64fe4da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have played with this PR in a testing environment and have not encountered any errors so far.
However, I have a more general question: How will the sla_lifecycle
table be used later on? The get_sla_ok_percent
function does not honor it and, unless I am missing something, cannot even do this for deleted checkable objects.
delete_time biguint NOT NULL DEFAULT 0, | ||
|
||
CONSTRAINT pk_sla_lifecycle PRIMARY KEY (id, delete_time) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about creating some indices for the queries within the SyncCheckablesSlaLifecycle
function.
In a huge environment, those queries may take some time. However, I cannot say how huge they must become to make those queries slow or if Icinga 2 will be the performance bottleneck before.
I have not tested this or anything, just wanted to put the idea out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or if Icinga 2 will be the performance bottleneck before
If there's an issue, you could probably trigger it without Icinga 2 becoming an issue. You can create and delete many hosts and services over time, keeping the active config in Icinga 2 at a constant size but growing this table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid point, I have not considered cloud native fast moving infrastructures with a multitude of short-lived VMs.
That's right and it will never make use of it as well! As we discussed this last time, we will only collect the data for the time being until we get the time to finalise Icinga/icingadb-web#710 which will make use of this newly introduced table and delegate the SLA generation from SQL procedure to PHP code and mark the |
64fe4da
to
58e2cc2
Compare
abdb06f
to
f8ec4c2
Compare
I've pushed the new transparent Screenshots and also changed the main function a bit, so I won't be pushing anything from now on unless someone requests a change. |
f8ec4c2
to
507833d
Compare
507833d
to
6776694
Compare
Why do we need this!
Currently we are generating the SLA history events only when e.g. there were state change and downtime start and end events for the checkables. Under some circumstances (if a checkable is created once and never deleted) this should be sufficient. However, when you e.g. delete a host and create it again a couple of days later and want to generate sla reports for this host at the end of the week, the result can vary depending on which state the host had before it was deleted. In order to be able to generate sla reports as accurately as possible, we decided to track the checkable creation and deletion time on top of the existing info. And since Icinga 2 doesn't really know when an object has been deleted (at least not in a simple way), this PR should take care of it.
Though, Icinga DB doesn't know when an object has been deleted either, it just takes the time the delete event for that object arrived and puts it into the new table. Meaning when you delete checkables while Icinga DB is stopped, the events Icinga DB would write after it is started won't reflect the actual delete/create event. Though, there is no better way to handle this gracefully.
Config sync
The upgrade script for
1.3.0
generates acreated_at
sla lifecycle entry for all existing hosts and services once it is applied as proposed in #566 (comment). Thus, all special cases such as the implementation of a custom fingerprint type for services1 and performing an extra query to retrieve host IDs from the database for runtime deleted services, have been removed.Implementation
The new table
sla_history_lifecycle
has a primary key over(id, delete_time)
wheredelete_time=0
means "not deleted yet" (the column has to beNOT NULL
due to being included in the primary key).id
is either the service or host ID for which that sla lifecycle is being generated. This ensures that there can only be row per object that states that the object is currently alive in Icinga 2.Initial sync
Icinga DB performs a simple
INSERT
statement forHost
andService
types after each initial config dump unconditionally, but matches onhosts
andservices
that don't already have acreate_time
entry withdelete_time = 0
in thesla_lifecycle
table, and sets theircreate_time
timestamp tonow
. Additionally, it also updates thedelete_time
of each existingsla_lifecycle
entries whosehost/service
IDs cannot be found in theHost/Service
tables. It's unlikely, but when a given Checkable doesn't already have acreate_time
entry in the database, the update query won't update anything. Likewise, the insert statements may also become a no-op if the Checkables already have acreate_time
entry withdelete_time = 0
.Create
Nothing to be done here (all newly created objects will be covered by the bulk
INSERT ... SELECT FROM host/service
queries after the config dump).Update
Nothing to be done here (object existed before and continues to exist).
Delete
Nothing to be done here (all deleted objects will be covered by general bulk
sea_lifecycle
queries after the config dump).Runtime updates
Upsert
Performs an
INSERT
with ignore for duplicate keys for both create and update events (these look identical in the runtime update stream). If the object is already marked as alive insla_history_lifecycle
, this will do nothing, otherwise it will mark it as created now(including when an object that was created before this feature was enabled is updated).Delete
It assumes that there exists a
created_at
sla_lifecycle entry for that checkable currently going to be deleted, and performs a simpleUPDATE
statement settingdelete_time = now
(i.e. updates the PK of the row) marking the alive row for the object as deleted. If, for whatever reason, there is no correspondingcreated_at
entry for this checkable, that update statement becomes a no-op, as the upgrade script and/or the initial config dump should have generated the necessary entries for all existing objects that were created before this feature was available.Footnotes
Before @julianbrost proposed a change in https://github.com/Icinga/icingadb/pull/566#issuecomment-2273088195, services had to implement an additional custom fingerprint type
ServiceFingerprint
which was used to also retrieve their host IDs when computing the config delta of the initial sync. By introducing this type, the necessity of having to always perform an extraSELECT
query to additionally retrieve the host IDs was eliminated, as host ID is always required for the sla lifecycles to work. ↩