-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8pt] Estimate wait time for advise requests #727
Comments
Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s). This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt? |
If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.
Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only Example: We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers. If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric. As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA). |
@pacospace is this good to go or do we still need information? If you are happy feel free to change prio etc... /sig observability |
Rotten issues close after 30d of inactivity. /close |
@sesheta: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@fridex: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Rotten issues close after 30d of inactivity. /close |
@sesheta: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@pacospace could this be another data driven development topic? |
Sure sounds good! |
/project observability |
/sig observability |
what about n_p = number of parallel workflows that can run in the namespace (backend) workflows running in backend n_a = number of adviser workflows running n_p = n_a + n_k + n_pc tav_a = average time adviser workflows runs t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc all those metrics are already available in Prometheus, so we can estimate that. |
Sounds good. What about counting also requests that are queued? |
I have to check how to get that number from Kafka, but in theory we can do that yes! and do we want to provide this information at user-API level? |
based on @fridex suggestion: t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a) kafka_adviser_requests_queued = number of adviser message requests queued in Kafka. |
Based on a conversation with @KPostOffice, we should consider:
|
Sounds interesting 👍🏻 It might be a good idea to discuss this at the tech talk. |
@harshad16, are strimzi metrics collected by Prometheus in smaug and aws? |
@pacospace sorry for missing your question here. |
One method solve this to calculate with Acceptance criteria
Reference:
|
/triage accepted |
Is your feature request related to a problem? Please describe.
As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.
Describe the solution you'd like
Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).
The metric can be generalized for other jobs we run - package-extract, provenance-check, ...
The text was updated successfully, but these errors were encountered: