[8pt] Estimate wait time for advise requests #727

fridex · 2021-06-29T10:22:59Z

Is your feature request related to a problem? Please describe.

As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.

Describe the solution you'd like

Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).

The metric can be generalized for other jobs we run - package-extract, provenance-check, ...

pacospace · 2021-06-29T16:09:26Z

Is your feature request related to a problem? Please describe.

As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.

Describe the solution you'd like

Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).

The metric can be generalized for other jobs we run - package-extract, provenance-check, ...

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s). This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

fridex · 2021-06-30T07:58:16Z

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s).

If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.

This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only latest can (but not necessarily has to) finish sooner. This upper boundary can be used to estimate how much time will be required to serve user requests based on the quota, resource allocation in adviser workflow.

Example:

We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers.

If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric.

As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA).

goern · 2021-06-30T13:39:36Z

@pacospace is this good to go or do we still need information? If you are happy feel free to change prio etc...

/sig observability

sesheta · 2021-08-06T14:12:01Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta · 2021-08-06T14:12:50Z

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fridex · 2021-08-09T07:28:04Z

/reopen
/remove-lifecycle rotten

sesheta · 2021-08-09T07:28:06Z

@fridex: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sesheta · 2021-09-19T11:28:36Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta · 2021-09-19T11:28:39Z

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

goern · 2021-10-04T07:28:58Z

@pacospace could this be another data driven development topic?

pacospace · 2021-10-04T09:51:36Z

@pacospace could this be another data driven development topic?

Sure sounds good!

goern · 2021-10-27T13:15:05Z

/project observability

goern · 2021-10-27T13:16:38Z

/sig observability

pacospace · 2021-11-08T11:54:28Z

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s).

If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.

This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only latest can (but not necessarily has to) finish sooner. This upper boundary can be used to estimate how much time will be required to serve user requests based on the quota, resource allocation in adviser workflow.

Example:

We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers.

If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric.

As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA).

what about

n_p = number of parallel workflows that can run in the namespace (backend)

workflows running in backend

n_a = number of adviser workflows running
n_k = number of kebechet workflows running
n_pc = number of provenance checker running

n_p = n_a + n_k + n_pc

tav_a = average time adviser workflows runs
tav_k = average time kebechet workflows runs
tav_pc = average time provenance checker runs

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc
= tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

all those metrics are already available in Prometheus, so we can estimate that.

fridex · 2021-11-08T13:47:29Z

Sounds good. What about counting also requests that are queued?

pacospace · 2021-11-08T13:55:31Z

Sounds good. What about counting also requests that are queued?

I have to check how to get that number from Kafka, but in theory we can do that yes! and do we want to provide this information at user-API level?

pacospace · 2021-11-08T13:56:39Z

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

all those metrics are already available in Prometheus, so we can estimate that.

based on @fridex suggestion:

t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

kafka_adviser_requests_queued = number of adviser message requests queued in Kafka.

pacospace · 2021-11-09T07:59:06Z

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)
all those metrics are already available in Prometheus, so we can estimate that.

based on @fridex suggestion:

t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

kafka_adviser_requests_queued = number of adviser message requests queued in Kafka.

Based on a conversation with @KPostOffice, we should consider:
kafka_adviser_requests_queued = _get_current_offset_from_strimzi_metrics() - _get_investigator_consumer_offset() using metrics from Strimzi, moreover @KPostOffice pointed out that is important to take into account partitions:

(current_offset_p1 - consumer_offset_p1) + (current_offset_p2 - consumer_offset_p2) + ... + (current_offset_pN - consumer_offset_pN)

fridex · 2021-11-09T08:14:14Z

Sounds interesting 👍🏻 It might be a good idea to discuss this at the tech talk.

pacospace · 2021-11-09T13:56:47Z

@harshad16, are strimzi metrics collected by Prometheus in smaug and aws?

harshad16 · 2022-01-10T15:01:17Z

@pacospace sorry for missing your question here.
i would have to check on this, maybe we need to create the service monitor for this.

harshad16 · 2022-09-15T13:33:37Z

One method solve this to calculate with kafka queue and adviser workflow schedule per hour.

Acceptance criteria

Calculate the wait from kafka queue and schedule adviser runs.
Place the metric in metric exporter or suggest an alternative to persist this new metric.
create the panel in dashboard [thoth user metric](https://github.com/thoth-station/thoth-application/blob/master/grafana-dashboard/base/thoth-service-metrics.json).

Reference:

Kafka queue:
kafka_log_log_logendoffset{topic_name="aws-prod.thoth.kebechet-trigger"} -current_partition_offsets{topic_name="aws-prod.thoth.kebechet-trigger"}
https://github.com/thoth-station/investigator/blob/0f4965b174060062eb42df12a5b2bd10ca1eeabf/thoth/investigator/metrics.py#L104
Argo workflow: Each workflow generates them.
https://github.com/thoth-station/thoth-application/blob/c27c47ed40da48ccaa20993576083dba6374496e/adviser/base/argo-workflows/advise.yaml#L11

harshad16 · 2022-09-15T13:35:21Z

/triage accepted

fridex added kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 29, 2021

sesheta added the sig/observability Categorizes an issue or PR as relevant to SIG Observability label Jun 30, 2021

pacospace added triage/needs-information Indicates an issue needs more information in order to work on it. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 30, 2021

goern added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Jul 7, 2021

sesheta closed this as completed Aug 6, 2021

sesheta reopened this Aug 9, 2021

pacospace added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 20, 2021

sesheta closed this as completed Sep 19, 2021

pacospace reopened this Sep 20, 2021

codificat mentioned this issue Oct 11, 2021

sesheta (prow / commenter) auto-closes issues when it should not thoth-station/thoth-application#2025

Closed

pacospace self-assigned this Nov 8, 2021

pacospace removed their assignment Nov 22, 2021

goern added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 26, 2022

harshad16 changed the title ~~Estimate wait time for advise requests~~ [8pt] Estimate wait time for advise requests Sep 15, 2022

sesheta added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Sep 15, 2022

harshad16 moved this to Backlog in SIG-Observability Sep 22, 2022

harshad16 added this to SIG-Observability Sep 22, 2022

harshad16 moved this from 📋 Backlog to 🔖 Ready in SIG-Observability Sep 22, 2022

codificat moved this to 📋 Backlog in Planning Board Sep 26, 2022

codificat added this to Planning Board Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8pt] Estimate wait time for advise requests #727

[8pt] Estimate wait time for advise requests #727

fridex commented Jun 29, 2021

pacospace commented Jun 29, 2021 •

edited

Loading

fridex commented Jun 30, 2021

goern commented Jun 30, 2021

sesheta commented Aug 6, 2021

sesheta commented Aug 6, 2021

fridex commented Aug 9, 2021

sesheta commented Aug 9, 2021

sesheta commented Sep 19, 2021

sesheta commented Sep 19, 2021

goern commented Oct 4, 2021

pacospace commented Oct 4, 2021

goern commented Oct 27, 2021

goern commented Oct 27, 2021

pacospace commented Nov 8, 2021 •

edited

Loading

fridex commented Nov 8, 2021

pacospace commented Nov 8, 2021

pacospace commented Nov 8, 2021 •

edited

Loading

pacospace commented Nov 9, 2021 •

edited

Loading

fridex commented Nov 9, 2021

pacospace commented Nov 9, 2021

harshad16 commented Jan 10, 2022

harshad16 commented Sep 15, 2022

harshad16 commented Sep 15, 2022

[8pt] Estimate wait time for advise requests #727

[8pt] Estimate wait time for advise requests #727

Comments

fridex commented Jun 29, 2021

pacospace commented Jun 29, 2021 • edited Loading

fridex commented Jun 30, 2021

goern commented Jun 30, 2021

sesheta commented Aug 6, 2021

sesheta commented Aug 6, 2021

fridex commented Aug 9, 2021

sesheta commented Aug 9, 2021

sesheta commented Sep 19, 2021

sesheta commented Sep 19, 2021

goern commented Oct 4, 2021

pacospace commented Oct 4, 2021

goern commented Oct 27, 2021

goern commented Oct 27, 2021

pacospace commented Nov 8, 2021 • edited Loading

fridex commented Nov 8, 2021

pacospace commented Nov 8, 2021

pacospace commented Nov 8, 2021 • edited Loading

pacospace commented Nov 9, 2021 • edited Loading

fridex commented Nov 9, 2021

pacospace commented Nov 9, 2021

harshad16 commented Jan 10, 2022

harshad16 commented Sep 15, 2022

harshad16 commented Sep 15, 2022

pacospace commented Jun 29, 2021 •

edited

Loading

pacospace commented Nov 8, 2021 •

edited

Loading

pacospace commented Nov 8, 2021 •

edited

Loading

pacospace commented Nov 9, 2021 •

edited

Loading