Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8pt] Estimate wait time for advise requests #727

Open
fridex opened this issue Jun 29, 2021 · 23 comments
Open

[8pt] Estimate wait time for advise requests #727

fridex opened this issue Jun 29, 2021 · 23 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/observability Categorizes an issue or PR as relevant to SIG Observability triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@fridex
Copy link
Contributor

fridex commented Jun 29, 2021

Is your feature request related to a problem? Please describe.

As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.

Describe the solution you'd like

Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).

The metric can be generalized for other jobs we run - package-extract, provenance-check, ...

@fridex fridex added kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 29, 2021
@pacospace
Copy link
Contributor

pacospace commented Jun 29, 2021

Is your feature request related to a problem? Please describe.

As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.

Describe the solution you'd like

Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).

The metric can be generalized for other jobs we run - package-extract, provenance-check, ...

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s). This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

@fridex
Copy link
Contributor Author

fridex commented Jun 30, 2021

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s).

If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.

This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only latest can (but not necessarily has to) finish sooner. This upper boundary can be used to estimate how much time will be required to serve user requests based on the quota, resource allocation in adviser workflow.

Example:

We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers.

If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric.

As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA).

@goern
Copy link
Member

goern commented Jun 30, 2021

@pacospace is this good to go or do we still need information? If you are happy feel free to change prio etc...

/sig observability

@sesheta sesheta added the sig/observability Categorizes an issue or PR as relevant to SIG Observability label Jun 30, 2021
@pacospace pacospace added triage/needs-information Indicates an issue needs more information in order to work on it. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 30, 2021
@goern goern added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Jul 7, 2021
@sesheta
Copy link
Member

sesheta commented Aug 6, 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta sesheta closed this as completed Aug 6, 2021
@sesheta
Copy link
Member

sesheta commented Aug 6, 2021

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fridex
Copy link
Contributor Author

fridex commented Aug 9, 2021

/reopen
/remove-lifecycle rotten

@sesheta
Copy link
Member

sesheta commented Aug 9, 2021

@fridex: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sesheta sesheta reopened this Aug 9, 2021
@pacospace pacospace added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 20, 2021
@sesheta
Copy link
Member

sesheta commented Sep 19, 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta sesheta closed this as completed Sep 19, 2021
@sesheta
Copy link
Member

sesheta commented Sep 19, 2021

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacospace pacospace reopened this Sep 20, 2021
@goern
Copy link
Member

goern commented Oct 4, 2021

@pacospace could this be another data driven development topic?

@pacospace
Copy link
Contributor

@pacospace could this be another data driven development topic?

Sure sounds good!

@goern
Copy link
Member

goern commented Oct 27, 2021

/project observability

@goern
Copy link
Member

goern commented Oct 27, 2021

/sig observability

@pacospace
Copy link
Contributor

pacospace commented Nov 8, 2021

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s).

If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.

This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only latest can (but not necessarily has to) finish sooner. This upper boundary can be used to estimate how much time will be required to serve user requests based on the quota, resource allocation in adviser workflow.

Example:

We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers.

If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric.

As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA).

what about

n_p = number of parallel workflows that can run in the namespace (backend)

workflows running in backend

n_a = number of adviser workflows running
n_k = number of kebechet workflows running
n_pc = number of provenance checker running

n_p = n_a + n_k + n_pc

tav_a = average time adviser workflows runs
tav_k = average time kebechet workflows runs
tav_pc = average time provenance checker runs

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc
= tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

all those metrics are already available in Prometheus, so we can estimate that.

@fridex
Copy link
Contributor Author

fridex commented Nov 8, 2021

Sounds good. What about counting also requests that are queued?

@pacospace
Copy link
Contributor

Sounds good. What about counting also requests that are queued?

I have to check how to get that number from Kafka, but in theory we can do that yes! and do we want to provide this information at user-API level?

@pacospace
Copy link
Contributor

pacospace commented Nov 8, 2021

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

all those metrics are already available in Prometheus, so we can estimate that.

based on @fridex suggestion:

t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

kafka_adviser_requests_queued = number of adviser message requests queued in Kafka.

@pacospace pacospace self-assigned this Nov 8, 2021
@pacospace
Copy link
Contributor

pacospace commented Nov 9, 2021

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)
all those metrics are already available in Prometheus, so we can estimate that.

based on @fridex suggestion:

t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

kafka_adviser_requests_queued = number of adviser message requests queued in Kafka.

Based on a conversation with @KPostOffice, we should consider:
kafka_adviser_requests_queued = _get_current_offset_from_strimzi_metrics() - _get_investigator_consumer_offset() using metrics from Strimzi, moreover @KPostOffice pointed out that is important to take into account partitions:

(current_offset_p1 - consumer_offset_p1) + (current_offset_p2 - consumer_offset_p2) + ... + (current_offset_pN - consumer_offset_pN)

@fridex
Copy link
Contributor Author

fridex commented Nov 9, 2021

Sounds interesting 👍🏻 It might be a good idea to discuss this at the tech talk.

@pacospace
Copy link
Contributor

@harshad16, are strimzi metrics collected by Prometheus in smaug and aws?

@pacospace pacospace removed their assignment Nov 22, 2021
@harshad16
Copy link
Member

@pacospace sorry for missing your question here.
i would have to check on this, maybe we need to create the service monitor for this.

@goern goern added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 26, 2022
@harshad16
Copy link
Member

One method solve this to calculate with kafka queue and adviser workflow schedule per hour.

Acceptance criteria

  • Calculate the wait from kafka queue and schedule adviser runs.
  • Place the metric in metric exporter or suggest an alternative to persist this new metric.
  • create the panel in dashboard [thoth user metric](https://github.com/thoth-station/thoth-application/blob/master/grafana-dashboard/base/thoth-service-metrics.json).

Reference:

@harshad16 harshad16 changed the title Estimate wait time for advise requests [8pt] Estimate wait time for advise requests Sep 15, 2022
@harshad16
Copy link
Member

/triage accepted

@sesheta sesheta added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Sep 15, 2022
@harshad16 harshad16 moved this to Backlog in SIG-Observability Sep 22, 2022
@harshad16 harshad16 moved this from 📋 Backlog to 🔖 Ready in SIG-Observability Sep 22, 2022
@codificat codificat moved this to 📋 Backlog in Planning Board Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/observability Categorizes an issue or PR as relevant to SIG Observability triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: 📋 Backlog
Status: 🔖 Ready
Development

No branches or pull requests

5 participants