Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5pt] Make sure ps-stacks can receive recommendation from Thoth #326

Closed
pacospace opened this issue Sep 1, 2021 · 21 comments
Closed

[5pt] Make sure ps-stacks can receive recommendation from Thoth #326

pacospace opened this issue Sep 1, 2021 · 21 comments
Assignees
Labels
human_intervention_required kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pacospace
Copy link
Contributor

Describe the bug
As User of Thoth PS images,

I want to have continous updates on software stacks to be maintained by Thoth services.

To Reproduce
Steps to reproduce the behavior:

  1. Run thamos advise on all ps-stacks

Expected behavior
All ps-* stacks can be advised by Thoth (all integration tests are green for ps-stacks: thoth-station/integration-tests#204)

Screenshots

Additional context
ps-*:

@pacospace pacospace added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/...` label and requires one. human_intervention_required and removed needs-triage Indicates an issue or PR lacks a `triage/...` label and requires one. labels Sep 1, 2021
@goern
Copy link
Member

goern commented Sep 15, 2021

/priority important-soon
/assign @codificat
/triage accepted

@sesheta sesheta added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Sep 15, 2021
@goern goern changed the title Make sure ps-stacks can receive reccomendation from Thoth Make sure ps-stacks can receive recommendation from Thoth Oct 1, 2021
@goern
Copy link
Member

goern commented Jan 14, 2022

any update on this?

@sesheta
Copy link
Member

sesheta commented Apr 14, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2022
@codificat
Copy link
Member

/remove-lifecycle stale

@sesheta sesheta removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022
@codificat
Copy link
Member

codificat commented May 2, 2022

In the last integration test runs for aws-prod there are errors in some of the ps-* tests: ps-cv-{pytorch,tensorflow} and ps-nlp-tensorflow due to tmieouts:

2022-05-02 03:41:12,899 thoth.adviser.run           ERROR: Child exited with exit code 10
2022-05-02 03:25:01,696 thoth.adviser.run           ERROR: Resolver was killed as allocated CPU time was exceeded - https://thoth-station.ninja/j/cpu_time_exceeded

Other related integration tests succeeded.

In the last run of integration tests for smaug-prod, ps-* tests failed with HTTP 400 codes (bad request), e.g.

Then I ask for an advise for the cloned application for runtime environment ps-nlp-pytorch , without user stack supplied and without static analysis (52.758s) 
Error Message

Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.8/site-packages/behave/model.py", line 1329, in run
    match.run(runner.context)
  File "/opt/app-root/lib64/python3.8/site-packages/behave/matchers.py", line 98, in run
    self.func(context, *args, **kwargs)
  File "features/steps/advise.py", line 248, in step_impl
    results = advise_using_config(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 397, in advise_using_config
    return advise(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 118, in wrapper
    result = func(api_client, *args, **kwargs)
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 583, in advise
    response = _retrieve_analysis_result(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/lib.py", line 276, in _retrieve_analysis_result
    return retrieve_func(analysis_id)
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 53, in get_advise_python
    (data) = self.get_advise_python_with_http_info(analysis_id, **kwargs)  # noqa: E501
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/thoth/advise_api.py", line 112, in get_advise_python_with_http_info
    return self.api_client.call_api(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 316, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 148, in __call_api
    response_data = self.request(
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/api_client.py", line 338, in request
    return self.rest_client.GET(url,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 228, in GET
    return self.request("GET", url,
  File "/opt/app-root/lib64/python3.8/site-packages/thamos/swagger_client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
thamos.swagger_client.rest.ApiException: (400)
Reason: BAD REQUEST
HTTP response headers: HTTPHeaderDict({'server': 'gunicorn', 'date': 'Thu, 28 Apr 2022 01:05:55 GMT', 'content-type': 'application/json', 'content-length': '272', 'x-thoth-version': '0.34.14', 'x-user-api-service-version': '0.34.14+messaging.0.16.0.storages.0.71.1.common.0.36.0.python.0.16.9', 'x-thoth-search-ui-url': 'https://thoth-station.ninja/search/', 'access-control-allow-origin': '*', 'set-cookie': '829f3dbab311aaac0d90f580d731991c=d36e665b294c43e30415dbb1b2323809; path=/; HttpOnly; Secure; SameSite=None'})
HTTP response body: b'{\n  "error": "Analysis was not successful",\n  "parameters": {\n    "analysis_id": "adviser-220428010502-f22f7444ce59c173"\n  },\n  "status": {\n    "finished_at": "2022-04-28T01:05:48Z",\n    "reason": null,\n    "started_at": "2022-04-28T01:05:03Z",\n    "state": "error"\n  }\n}\n'

@codificat
Copy link
Member

/milestone OKR review Q2 2022
/sig user-experience

@sesheta sesheta added this to the OKR review Q2 2022 milestone May 3, 2022
@sesheta sesheta added the sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries. label May 3, 2022
@codificat
Copy link
Member

/remove-sig user-experience
/sig stack-guidance

because there are issues resolving the stacks here

@sesheta sesheta added sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. and removed sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries. labels May 3, 2022
@fridex fridex changed the title Make sure ps-stacks can receive recommendation from Thoth [5pt] Make sure ps-stacks can receive recommendation from Thoth May 3, 2022
@fridex
Copy link
Contributor

fridex commented May 3, 2022

The last integration-tests report (Integration tests update for ocp4-stage (2022-05-03 version 0.11.2)) has the following scenarios failing:

  • ps-nlp-tensorflow
  • ps-nlp-pytorch
  • ps-cv-tensorflow

All of them use latest recommendation type. The predictor used in the adviser implementation in that cases uses "hops" when it randomly takes some path in the resolution process if solely the latest versions cannot be resolved. It might be that this implementation is not perfect in these cases and it would be better to provide an implementation that would use backtracking (similarly as pip, but offline using the dependency information from the database - see thoth-station/adviser#2329).

These issues can be also supported with the following solving error described in thoth-station/integration-tests#266 (comment). Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails obtaining dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

To introspect what is happening here, we might:

  1. try to remove jupyter-tensorboard from requirements and try to ask for an advise using latest recommendation type
  2. try to run adviser with different recommendation type set, such as stable which uses resolution algorithm based on reinforcement learning and see if it finds a resolution
  3. try to manually pin version of jupyter-tensorboard (older version that is solvable by thoth-solver) and see if the resolution process finds a solution even for the latest recommendation type

Also, we can try using user stack scoring and see how the resolver behaves with specific versions of libraries to narrow down to possible issue maker.

@fridex
Copy link
Contributor

fridex commented May 3, 2022

Tested with stable recommendation type:

@fridex
Copy link
Contributor

fridex commented May 3, 2022

Tested with latest recommendation type without jupyter-tensorboard package in the stack:

@fridex
Copy link
Contributor

fridex commented May 3, 2022

Tested with latest recommendation type and jupyter-tensorboard==0.1.1 (solvable using our solver):

@fridex
Copy link
Contributor

fridex commented May 3, 2022

Possible fixes:

  1. use jupyter-tensorboard==0.1.1 in all the stacks that use it
  2. remove jupyter-tensorboard (if it is not used)
  3. contact jupyter-tensorboard upstream for a possible fix - so that it does not have hard requirements on packages to be present in the environment during installation
  4. patch jupyter-tensorboard ourselves and host a patched version on our Pulp Python Package Index

@harshad16
Copy link
Member

Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right?
Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?

@fridex
Copy link
Contributor

fridex commented May 3, 2022

Basically, jupyter-tensorboard expects jupyterlab to be already installed during the installation process (it registers itself). Our solver has no jupyterlab installed when it tries to install jupyter-tensorboard so it fails to obtain dependency information (it was observed for some versions). This behaviour is not very nice, but Python packaging supports it. This can support the first paragraph stated as adviser might be failing to find suitable versions when latest recommendation type is used.

This means our solvers are not able to solve jupyter-tensorboard or other packages with such requirements, right?

Generally, no - we are not able to solve libraries that have hard requirements on environment that are not met in our solvers. Ideally, jupyter-tensorboard should not depend on the environment and execute code during the installation process - at least not make it a hard requirement (if it fails, the installed package can still be present).

This might get better over time as python packaging evolves (and provides static wheel metadata).

Is that the reason we are pinning the jupyter-tensorboard to 0.1.1, or we are pinning it because thoth advice suggested it?

There can be found versions that were removed in the stack info provided to the user:

"The following versions of 'jupyter-tensorboard' from 'https://pypi.org/simple' were removed due to installation issues in the target environment: 0.2.0, 0.1.10, 0.1.9, 0.1.8, 0.1.7, 0.1.6, 0.1.5, 0.1.4, 0.1.4.dev0, 0.1.3, 0.1.3.dev0, 0.1.2, 0.1.2.dev1, 0.1.2.dev0"

Thoth also suggested to use it, for example in the first successful resolution with stable recommendation type:

@fridex
Copy link
Contributor

fridex commented May 3, 2022

Thoth also suggested to use it, for example in the first successful resolution with stable recommendation type:

And for others, it looks like it failed as it did not find any resolution in the allocated time.

@harshad16
Copy link
Member

ack, thanks for the explanation.

@goern goern added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 5, 2022
@codificat
Copy link
Member

/remove-label human_intervention_required

@sesheta
Copy link
Member

sesheta commented May 6, 2022

@codificat: The label(s) /remove-label human_intervention_required cannot be applied. These labels are supported: community/discussion, community/group-programming, community/maintenance, community/question, deployment_name/ocp4-stage, deployment_name/ocp4-test, deployment_name/moc-prod, hacktoberfest, hacktoberfest-accepted, kind/cleanup, kind/demo, kind/deprecation, kind/documentation, kind/question, sig/advisor, sig/build, sig/cyborgs, sig/devops, sig/documentation, sig/indicators, sig/investigator, sig/knowledge-graph, sig/slo, sig/solvers, thoth/group-programming, thoth/human-intervention-required, thoth/potential-observation, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, triage/accepted, triage/duplicate, triage/needs-information, triage/not-reproducible, triage/unresolved, lifecycle/submission-accepted, lifecycle/submission-rejected

In response to this:

/remove-label human_intervention_required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@codificat
Copy link
Member

codificat commented Jun 13, 2022

The integration tests in stage are suffering from cluster issues that have been going on for a while and are expected to take some more time to fix.

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml). Other recommendation types had a few failures.

The recommendations that are failing fail with the following message:

Resolver did not find any stack that would satisfy requirements and stack characteristics given the time allocated - see https://thoth-station.ninja/j/no_stack

Below is the current status with each stack.

ps-nlp

overlay type result advise ID time
ps-nlp latest success adviser-220613143911-3483a20bdb243903 49s
ps-nlp-tensorflow latest success adviser-220613144121-b52e7ffa560ce6c6 1m 47s
ps-nlp-tensorflow-gpu latest success adviser-220613144345-97b1d4046d403a80 2m 4s
ps-nlp-pytorch latest success adviser-220613115118-b406e285a0ae8618 2m 2s
ps-nlp stable success adviser-220614064546-327778099d10c008 34m 35s
ps-nlp-tensorflow stable failure adviser-220614075651-ef920679375c9a8f 26m 7s
ps-nlp-tensorflow-gpu stable success adviser-220614160559-2f1a80945ee22db5 26m 30s
ps-nlp-pytorch stable success adviser-220614165324-519f4729bc0fbb77 26m 30s
ps-nlp security success adviser-220614110005-7b9c92d2284d37fc 16m 40s
ps-nlp-tensorflow security success adviser-220614084155-3ca5965c25ece6d8 19m 33s
ps-nlp-tensorflow-gpu security success adviser-220614172749-ea72fedac3595022 17m 55s
ps-nlp-pytorch security success adviser-220614120043-6e7ea342826ad597 25m 23s
ps-nlp performance success adviser-220614072310-4fe6535e416419a8 26m 22s
ps-nlp-tensorflow performance success adviser-220614134333-d76be662fa33319a 26m 29s
ps-nlp-tensorflow-gpu performance failure adviser-220614153715-1fc19007f995c727 26m 9s
ps-nlp-pytorch performance failure adviser-220614150618-eba5b4e3183fd0f2 26m 14s

ps-cv

overlay type result advise ID time
ps-cv-ocr latest success adviser-220613144932-7c569b4d4585fd54 22s
ps-cv-tensorflow latest success adviser-220613145241-6d5d7bdc27ac3a3a 1m 18s
ps-cv-pytorch latest success adviser-220613145031-fcb0a951d8adb577 1m 44s
ps-cv-ocr stable success adviser-220613184700-1f90f53ccf4159b1 2m 24s
ps-cv-tensorflow stable failure adviser-220613162944-6d3bf6b86a373e6d 22m 49s
ps-cv-pytorch stable failure adviser-220613180701-d348b4ef9c9b3e87 26m 17s
ps-cv-ocr performance success adviser-220613185358-fb8309ed55dd32d9 2m 7s
ps-cv-tensorflow performance failure adviser-220613210546-7078358b1944fd96 26m 8s
ps-cv-pytorch performance failure adviser-220613192843-85146a8c21afc8ee 27m 3s
ps-cv-ocr security success adviser-220614183752-2991f946845d4af 27s
ps-cv-tensorflow security failure adviser-220614180129-becf1431eab85efe 27m 46s
ps-cv-pytorch security failure adviser-220614183855-da15ed8545b4868 26m 13s

ps-ip

overlay type result advise ID time
ps-ip-ifd latest success adviser-220613145447-b8f73428af85d2bc 31s
ps-ip-ifd stable success adviser-220613160047-b5ae72918b1150b7 20m 56s
ps-ip-ifd performance success adviser-220613185732-2c9fc59a216df36a 23m 17s
ps-ip-ifd security success adviser-220614174631-d062f02ad3a261bf 54s

@codificat
Copy link
Member

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml).

Based on this, I believe we can
/close
this one as complete.

We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. thoth-station/integration-tests#324), and possibly review the justification related to the failures on some combination of stack/type.

These are tracked in separate issues as appropriate.

@sesheta
Copy link
Member

sesheta commented Jun 21, 2022

@codificat: Closing this issue.

In response to this:

Meanwhile, though, a current test of all the overlays using the production environment provided successful advice with the latest recommendation type (what is currently configured in .thoth.yaml).

Based on this, I believe we can
/close
this one as complete.

We still need to ensure that integration tests, that include checks for successful advices on the predictable stacks, run successfully (e.g. thoth-station/integration-tests#324), and possibly review the justification related to the failures on some combination of stack/type.

These are tracked in separate issues as appropriate.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sesheta sesheta closed this as completed Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
human_intervention_required kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants