[5pt] Make sure ps-stacks can receive recommendation from Thoth #326

pacospace · 2021-09-01T12:47:03Z

Describe the bug
As User of Thoth PS images,

I want to have continous updates on software stacks to be maintained by Thoth services.

To Reproduce
Steps to reproduce the behavior:

Run thamos advise on all ps-stacks

Expected behavior
All ps-* stacks can be advised by Thoth (all integration tests are green for ps-stacks: thoth-station/integration-tests#204)

Screenshots

Additional context
ps-*:

goern · 2021-09-15T09:57:29Z

/priority important-soon
/assign @codificat
/triage accepted

goern · 2022-01-14T13:48:13Z

any update on this?

sesheta · 2022-04-14T16:12:09Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

codificat · 2022-04-26T11:20:11Z

/remove-lifecycle stale

codificat · 2022-05-02T11:58:45Z

In the last integration test runs for aws-prod there are errors in some of the ps-* tests: ps-cv-{pytorch,tensorflow} and ps-nlp-tensorflow due to tmieouts:

2022-05-02 03:41:12,899 thoth.adviser.run           ERROR: Child exited with exit code 10
2022-05-02 03:25:01,696 thoth.adviser.run           ERROR: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded

Other related integration tests succeeded.

In the last run of integration tests for smaug-prod, ps-* tests failed with HTTP 400 codes (bad request), e.g.

Then I ask for an advise for the cloned application for runtime environment ps-nlp-pytorch , without user stack supplied and without static analysis (52.758s) 
Error Message

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.8/site-packages/behave/model.py", line 1329, in run
    match.run(runner.context)
  File "/opt/app-root/lib64/python3.8/site-packages/behave/matchers.py", line 98, in run
    self.func(context, *args, **kwargs)
  File "features/steps/advise.py", line 248, in step_impl
    results = advise_using_config(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 397, in advise_using_config
    return advise(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 118, in wrapper
    result = func(api_client, *args, **kwargs)
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 583, in advise
    response = _retrieve_analysis_result(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 276, in _retrieve_analysis_result
    return retrieve_func(analysis_id)
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 53, in get_advise_python
    (data) = self.get_advise_python_with_http_info(analysis_id, **kwargs)  # noqa: E501
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 112, in get_advise_python_with_http_info
    return self.api_client.call_api(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 316, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 148, in __call_api
    response_data = self.request(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 338, in request
    return self.rest_client.GET(url,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 228, in GET
    return self.request("GET", url,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
thamos.swagger_client.rest.ApiException: (400)
Reason: BAD REQUEST
HTTP response headers: HTTPHeaderDict({'server': 'gunicorn', 'date': 'Thu, 28 Apr 2022 01:05:55 GMT', 'content-type': 'application/json', 'content-length': '272', 'x-thoth-version': '0.34.14', 'x-user-api-service-version': '0.34.14+messaging.0.16.0.storages.0.71.1.common.0.36.0.python.0.16.9', 'x-thoth-search-ui-url': 'https://thoth-station.ninja/search/', 'access-control-allow-origin': '*', 'set-cookie': '829f3dbab311aaac0d90f580d731991c=d36e665b294c43e30415dbb1b2323809; path=/; HttpOnly; Secure; SameSite=None'})
HTTP response body: b'{\n  "error": "Analysis was not successful",\n  "parameters": {\n    "analysis_id": "adviser-220428010502-f22f7444ce59c173"\n  },\n  "status": {\n    "finished_at": "2022-04-28T01:05:48Z",\n    "reason": null,\n    "started_at": "2022-04-28T01:05:03Z",\n    "state": "error"\n  }\n}\n'

codificat · 2022-05-03T10:27:14Z

/milestone OKR review Q2 2022
/sig user-experience

codificat · 2022-05-03T15:11:57Z

/remove-sig user-experience
/sig stack-guidance

because there are issues resolving the stacks here

fridex · 2022-05-03T17:04:01Z

The last integration-tests report (Integration tests update for ocp4-stage (2022-05-03 version 0.11.2)) has the following scenarios failing:

ps-nlp-tensorflow
ps-nlp-pytorch
ps-cv-tensorflow

All of them use latest recommendation type. The predictor used in the adviser implementation in that cases uses "hops" when it randomly takes some path in the resolution process if solely the latest versions cannot be resolved. It might be that this implementation is not perfect in these cases and it would be better to provide an implementation that would use backtracking (similarly as pip, but offline using the dependency information from the database - see thoth-station/adviser#2329).

These issues can be also supported with the following solving error described in thoth-station/integration-tests#266 (comment). Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails obtaining dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

To introspect what is happening here, we might:

try to remove jupyter-tensorboard from requirements and try to ask for an advise using latest recommendation type
try to run adviser with different recommendation type set, such as stable which uses resolution algorithm based on reinforcement learning and see if it finds a resolution
try to manually pin version of jupyter-tensorboard (older version that is solvable by thoth-solver) and see if the resolution process finds a solution even for the latest recommendation type

Also, we can try using user stack scoring and see how the resolver behaves with specific versions of libraries to narrow down to possible issue maker.

fridex · 2022-05-03T18:07:27Z

Tested with stable recommendation type:

ps-nlp-tensorflow succeeded - see results
ps-nlp-pytorch failed - see results
ps-cv-tensorflow failed - see results

fridex · 2022-05-03T18:19:20Z

Tested with latest recommendation type without jupyter-tensorboard package in the stack:

ps-nlp-tensorflow succeeded - see results
ps-nlp-pytorch suceeded - see results
ps-cv-tensorflow succeeded - see results

fridex · 2022-05-03T18:30:50Z

Tested with latest recommendation type and jupyter-tensorboard==0.1.1 (solvable using our solver):

ps-nlp-tensorflow succeeded - see results
ps-nlp-pytorch succeeded - see results
ps-cv-tensorflow succeeded - see results

fridex · 2022-05-03T18:35:11Z

Possible fixes:

use jupyter-tensorboard==0.1.1 in all the stacks that use it
remove jupyter-tensorboard (if it is not used)
contact jupyter-tensorboard upstream for a possible fix - so that it does not have hard requirements on packages to be present in the environment during installation
patch jupyter-tensorboard ourselves and host a patched version on our Pulp Python Package Index

harshad16 · 2022-05-03T19:07:04Z

Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right?
Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?

fridex · 2022-05-03T19:21:12Z

Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right?

Generally, no - we are not able to solve libraries that have hard requirements on environment that are not met in our solvers. Ideally, jupyter-tensorboard should not depend on the environment and execute code during the installation process - at least not make it a hard requirement (if it fails, the installed package can still be present).

This might get better over time as python packaging evolves (and provides static wheel metadata).

Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?

There can be found versions that were removed in the stack info provided to the user:

"The following versions of 'jupyter-tensorboard' from 'https://pypi.org/simple' were removed due to installation issues in the target environment: 0.2.0, 0.1.10, 0.1.9, 0.1.8, 0.1.7, 0.1.6, 0.1.5, 0.1.4, 0.1.4.dev0, 0.1.3, 0.1.3.dev0, 0.1.2, 0.1.2.dev1, 0.1.2.dev0"

Thoth also suggested to use it, for example in the first successful resolution with stable recommendation type:

ps-nlp-tensorflow succeeded - see results

fridex · 2022-05-03T19:23:18Z

Thoth also suggested to use it, for example in the first successful resolution with stable recommendation type:

ps-nlp-tensorflow succeeded - see results

And for others, it looks like it failed as it did not find any resolution in the allocated time.

harshad16 · 2022-05-03T20:14:17Z

ack, thanks for the explanation.

codificat · 2022-05-06T10:24:39Z

/remove-label human_intervention_required

sesheta · 2022-05-06T10:24:40Z

@codificat: The label(s) /remove-label human_intervention_required cannot be applied. These labels are supported: community/discussion, community/group-programming, community/maintenance, community/question, deployment_name/ocp4-stage, deployment_name/ocp4-test, deployment_name/moc-prod, hacktoberfest, hacktoberfest-accepted, kind/cleanup, kind/demo, kind/deprecation, kind/documentation, kind/question, sig/advisor, sig/build, sig/cyborgs, sig/devops, sig/documentation, sig/indicators, sig/investigator, sig/knowledge-graph, sig/slo, sig/solvers, thoth/group-programming, thoth/human-intervention-required, thoth/potential-observation, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, triage/accepted, triage/duplicate, triage/needs-information, triage/not-reproducible, triage/unresolved, lifecycle/submission-accepted, lifecycle/submission-rejected

In response to this:

/remove-label human_intervention_required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codificat · 2022-06-13T15:53:47Z

The integration tests in stage are suffering from cluster issues that have been going on for a while and are expected to take some more time to fix.

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml). Other recommendation types had a few failures.

The recommendations that are failing fail with the following message:

Resolver did not find any stack that would satisfy requirements and stack characteristics given the time allocated - see https://thoth-station.ninja/j/no_stack

Below is the current status with each stack.

ps-nlp

overlay	type	result	advise ID	time
ps-nlp	latest	success	adviser-220613143911-3483a20bdb243903	49s
ps-nlp-tensorflow	latest	success	adviser-220613144121-b52e7ffa560ce6c6	1m 47s
ps-nlp-tensorflow-gpu	latest	success	adviser-220613144345-97b1d4046d403a80	2m 4s
ps-nlp-pytorch	latest	success	adviser-220613115118-b406e285a0ae8618	2m 2s
ps-nlp	stable	success	adviser-220614064546-327778099d10c008	34m 35s
ps-nlp-tensorflow	stable	failure	adviser-220614075651-ef920679375c9a8f	26m 7s
ps-nlp-tensorflow-gpu	stable	success	adviser-220614160559-2f1a80945ee22db5	26m 30s
ps-nlp-pytorch	stable	success	adviser-220614165324-519f4729bc0fbb77	26m 30s
ps-nlp	security	success	adviser-220614110005-7b9c92d2284d37fc	16m 40s
ps-nlp-tensorflow	security	success	adviser-220614084155-3ca5965c25ece6d8	19m 33s
ps-nlp-tensorflow-gpu	security	success	adviser-220614172749-ea72fedac3595022	17m 55s
ps-nlp-pytorch	security	success	adviser-220614120043-6e7ea342826ad597	25m 23s
ps-nlp	performance	success	adviser-220614072310-4fe6535e416419a8	26m 22s
ps-nlp-tensorflow	performance	success	adviser-220614134333-d76be662fa33319a	26m 29s
ps-nlp-tensorflow-gpu	performance	failure	adviser-220614153715-1fc19007f995c727	26m 9s
ps-nlp-pytorch	performance	failure	adviser-220614150618-eba5b4e3183fd0f2	26m 14s

ps-cv

overlay	type	result	advise ID	time
ps-cv-ocr	latest	success	adviser-220613144932-7c569b4d4585fd54	22s
ps-cv-tensorflow	latest	success	adviser-220613145241-6d5d7bdc27ac3a3a	1m 18s
ps-cv-pytorch	latest	success	adviser-220613145031-fcb0a951d8adb577	1m 44s
ps-cv-ocr	stable	success	adviser-220613184700-1f90f53ccf4159b1	2m 24s
ps-cv-tensorflow	stable	failure	adviser-220613162944-6d3bf6b86a373e6d	22m 49s
ps-cv-pytorch	stable	failure	adviser-220613180701-d348b4ef9c9b3e87	26m 17s
ps-cv-ocr	performance	success	adviser-220613185358-fb8309ed55dd32d9	2m 7s
ps-cv-tensorflow	performance	failure	adviser-220613210546-7078358b1944fd96	26m 8s
ps-cv-pytorch	performance	failure	adviser-220613192843-85146a8c21afc8ee	27m 3s
ps-cv-ocr	security	success	adviser-220614183752-2991f946845d4af	27s
ps-cv-tensorflow	security	failure	adviser-220614180129-becf1431eab85efe	27m 46s
ps-cv-pytorch	security	failure	adviser-220614183855-da15ed8545b4868	26m 13s

ps-ip

overlay	type	result	advise ID	time
ps-ip-ifd	latest	success	adviser-220613145447-b8f73428af85d2bc	31s
ps-ip-ifd	stable	success	adviser-220613160047-b5ae72918b1150b7	20m 56s
ps-ip-ifd	performance	success	adviser-220613185732-2c9fc59a216df36a	23m 17s
ps-ip-ifd	security	success	adviser-220614174631-d062f02ad3a261bf	54s

codificat · 2022-06-21T13:52:09Z

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml).

Based on this, I believe we can
/close
this one as complete.

We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. thoth-station/integration-tests#324), and possibly review the justification related to the failures on some combination of stack/type.

These are tracked in separate issues as appropriate.

sesheta · 2022-06-21T13:52:11Z

@codificat: Closing this issue.

In response to this:

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml).

Based on this, I believe we can
/close
this one as complete.

We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. thoth-station/integration-tests#324), and possibly review the justification related to the failures on some combination of stack/type.

These are tracked in separate issues as appropriate.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pacospace added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/...` label and requires one. human_intervention_required and removed needs-triage Indicates an issue or PR lacks a `triage/...` label and requires one. labels Sep 1, 2021

sesheta assigned codificat Sep 15, 2021

sesheta added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Sep 15, 2021

goern changed the title ~~Make sure ps-stacks can receive reccomendation from Thoth~~ Make sure ps-stacks can receive recommendation from Thoth Oct 1, 2021

goern mentioned this issue Oct 1, 2021

Automate update of dependencies + release of image for ps-* repos. #325

Open

2 tasks

sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2022

sesheta removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022

sesheta added this to the OKR review Q2 2022 milestone May 3, 2022

sesheta added the sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries. label May 3, 2022

sesheta added sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. and removed sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries. labels May 3, 2022

fridex changed the title ~~Make sure ps-stacks can receive recommendation from Thoth~~ [5pt] Make sure ps-stacks can receive recommendation from Thoth May 3, 2022

This was referenced May 3, 2022

Use jupyter-tensorboard==0.1.1 thoth-station/ps-nlp#152

Merged

Use jupyter-tensorboard==0.1.1 thoth-station/ps-cv#29

Merged

goern added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 5, 2022

codificat mentioned this issue May 5, 2022

[3pt][SPIKE] verify kebechet's advise manager is working on all overlays thoth-station/ps-nlp#151

Closed

2 tasks

sesheta closed this as completed Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5pt] Make sure ps-stacks can receive recommendation from Thoth #326

[5pt] Make sure ps-stacks can receive recommendation from Thoth #326

pacospace commented Sep 1, 2021

goern commented Sep 15, 2021

goern commented Jan 14, 2022

sesheta commented Apr 14, 2022

codificat commented Apr 26, 2022

codificat commented May 2, 2022 •

edited

Loading

codificat commented May 3, 2022

codificat commented May 3, 2022

fridex commented May 3, 2022 •

edited

Loading

fridex commented May 3, 2022

fridex commented May 3, 2022 •

edited

Loading

fridex commented May 3, 2022

fridex commented May 3, 2022 •

edited

Loading

harshad16 commented May 3, 2022

fridex commented May 3, 2022

fridex commented May 3, 2022

harshad16 commented May 3, 2022

codificat commented May 6, 2022

sesheta commented May 6, 2022

codificat commented Jun 13, 2022 •

edited

Loading

codificat commented Jun 21, 2022

sesheta commented Jun 21, 2022

[5pt] Make sure ps-stacks can receive recommendation from Thoth #326

[5pt] Make sure ps-stacks can receive recommendation from Thoth #326

Comments

pacospace commented Sep 1, 2021

goern commented Sep 15, 2021

goern commented Jan 14, 2022

sesheta commented Apr 14, 2022

codificat commented Apr 26, 2022

codificat commented May 2, 2022 • edited Loading

codificat commented May 3, 2022

codificat commented May 3, 2022

fridex commented May 3, 2022 • edited Loading

fridex commented May 3, 2022

fridex commented May 3, 2022 • edited Loading

fridex commented May 3, 2022

fridex commented May 3, 2022 • edited Loading

harshad16 commented May 3, 2022

fridex commented May 3, 2022

fridex commented May 3, 2022

harshad16 commented May 3, 2022

codificat commented May 6, 2022

sesheta commented May 6, 2022

codificat commented Jun 13, 2022 • edited Loading

ps-nlp

ps-cv

ps-ip

codificat commented Jun 21, 2022

sesheta commented Jun 21, 2022

codificat commented May 2, 2022 •

edited

Loading

fridex commented May 3, 2022 •

edited

Loading

fridex commented May 3, 2022 •

edited

Loading

fridex commented May 3, 2022 •

edited

Loading

codificat commented Jun 13, 2022 •

edited

Loading