Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when starting to perform the first collection #539

Closed
fulwang opened this issue Jul 25, 2024 · 22 comments
Closed

Errors when starting to perform the first collection #539

fulwang opened this issue Jul 25, 2024 · 22 comments
Assignees
Milestone

Comments

@fulwang
Copy link

fulwang commented Jul 25, 2024

Describe the bug
can't not collect any data after start the container

Expected behavior
can collect data and queried in the browser

To Reproduce

start the container with command as below:

podman run -itd -v /opt/zhmcexporter:/root/myconfig -p 9291:9291 --name zhmcexporter zhmcexporter:latest -c /root/myconfig/hmccreds.yaml -v

Environment information
zhmc_prometheus_exporter version: 1.7.0.dev1
zhmcclient version: 1.17.0
Verbosity level: 1

  • HMC version:
    HMC certificate validation: False
    HMC version: 2.15.0
    HMC API version: 3.13
    HMC features: None

Command output

Log file
zhmcexporter.log

@fulwang
Copy link
Author

fulwang commented Jul 25, 2024

I checkout version 1.5.2 and build another container to have a try, but still can't get the data collected as before. attached is the console log of running the new container.
zhmcexporter-1.5.2.log

@andy-maier andy-maier self-assigned this Jul 25, 2024
@andy-maier
Copy link
Member

@fulwang If you use version 1.5.2 of the exporter, you also need to use the metric definition file for that version. The warning in your 1.5.2 log:

/usr/local/lib/python3.9/site-packages/zhmc_prometheus_exporter/zhmc_prometheus_exporter.py:540: UserWarning: Ignoring item because its condition "'storage-group-uris' in resource_obj.properties" does not properly evaluate: NameError: name 'resource_obj' is not defined
  warnings.warn("Ignoring item because its condition {!r} does not "

Is caused by using a metric definition file that uses the resource object in its conditions, with an exporter version that does not yet have that support.

On your original error with 1.7.0.dev1:

There are two main errors there:

HTTPError: 503,3: Too many concurrent threads per user [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]

I have never seen this before and have started a dialogue with the Z development team on that.

HTTPError: 400,14: 'absolute-ifl-capping'' is not a valid value for the corresponding query parm [GET /api/logical-partitions/0a92d550-d75c-35d2-bc51-2bd88fc01b3f]

That is an error in the exporter code, but to find that it would be very helpful to get an exporter log file.

-> Could you please run this version of the exporter again and add the following options to its command line: --log-comp all=debug --log exporter.log ?

@fulwang
Copy link
Author

fulwang commented Jul 29, 2024

@andy-maier Thanks for the analysis?
Could you tell where i can get the metric definition file for version 1.5.2 and how to replace it before i rebuild the container image?

@fulwang
Copy link
Author

fulwang commented Jul 29, 2024

@andy-maier For rerun v1.7.0, do i need to rebuild the container to add the log options you mentioned or just add it to the podman command line is enough?

@fulwang
Copy link
Author

fulwang commented Jul 29, 2024

I just scheduled a run by adding the options on command line.

[root@lpar27 ~]# podman run -itd -v /opt/zhmcexporter:/root/myconfig -p 9291:9291 --name zhmcexporter zhmcexporter:v1.7.0 -c /root/myconfig/hmccreds.yaml -v --log-comp all=debug --log exporter.log
90f58391668f89d1ded5c3d4ebbdb23bb0ffde7bdf8db7f57fb6b0294c55e334
[root@lpar27 ~]#
[root@lpar27 ~]#
[root@lpar27 ~]# podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ccdb73d06b25 localhost/grafana:v10.3.4 4 days ago Up 4 days 0.0.0.0:3000->3000/tcp grafana
82ed76756548 localhost/prometheus:v2.53.0 --config.file=/et... 4 days ago Up 4 days 0.0.0.0:9090->9090/tcp prometheus
a219e9f8fcf6 localhost/nginx:v1.23.3 nginx -g daemon o... 4 days ago Up 4 days 0.0.0.0:1443->1443/tcp rproxy
8ee926934022 localhost/zhmcexporter:v1.5.2 -c /root/myconfig... 4 days ago Up 4 days 0.0.0.0:9292->9291/tcp zhmcexporter_1
c1087def4b23 localhost/s390x/mariadb:10.5.13 mysqld 4 hours ago Up 4 hours 0.0.0.0:3306->3306/tcp ecs_db
05ead85a9219 localhost/nginx:1.17.9 nginx -g daemon o... 4 hours ago Up 4 hours 0.0.0.0:443->443/tcp ecs_nginx
159b9ef35fd6 localhost/ecs_api:test python app.py 4 hours ago Up 4 hours 0.0.0.0:18443->18443/tcp ecs_api
90f58391668f localhost/zhmcexporter:v1.7.0 -c /root/myconfig... 8 seconds ago Up 8 seconds 0.0.0.0:9291->9291/tcp zhmcexporter
[root@lpar27 ~]#

@andy-maier
Copy link
Member

podman passes the command line after the container name through to the invoked container, so your podman command line looks good to me.

@andy-maier
Copy link
Member

andy-maier commented Jul 29, 2024

The metric definition file for a specific exporter version can be downloaded from the repo, when selecting the tag for that version. For example, for version 1.5.2, this is the repo at that version: https://github.com/zhmcclient/zhmc-prometheus-exporter/tree/1.5.2, and the sample metric file for that version is https://github.com/zhmcclient/zhmc-prometheus-exporter/blob/1.5.2/examples/metrics.yaml

I don't know how you build your container image, and whether you have the metric definition file in the image (vs. mounting its directory). If you have it in the image (which I think is the case given your podman command line), then you need to rebuild your image, and then you probably already have a COPY directive in the Dockerfile that pulls it in from the local directory.

@fulwang
Copy link
Author

fulwang commented Jul 29, 2024

zhmcexporter_1.log
@andy-maier I choose to have the metric definition file in the image, so i save the metrics.yaml of version 1.5.2 to "/root" directory and rebuild the image as below and run it on the testing environment again, but no luck for now.


cd /root
git clone https://github.com/zhmcclient/zhmc-prometheus-exporter
cd zhmc-prometheus-exporter/
git checkout 1.5.2
rm -fr examples/metrics.yaml
cp /root/metrics.yaml examples/
make docker

@fulwang
Copy link
Author

fulwang commented Jul 30, 2024

@andy-maier Can this be something wrong with the HMC side? The physical server was shutdown for several days due to malfunctions of the cooling system and was powered on in last week. I can saw many of errors include "HTTPError: 409,272: Unable to obtain STP configuration data, rc=[0x1000] [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]".

I have checked the user for hmc access and the option of "Web Services API " was checked as before.

@andy-maier
Copy link
Member

@fulwang The errors "Unable to obtain STP configuration data" are not severe, they only cause the "cpc" label not to be added to metrics for some types of resources.

Having said that, I suggest to configure STP on that HMC so that this error goes away.

Let's walk through the errors in the zhmcexporter_1.log file you attached above:

  • "Ignoring resource-based metrics for CPC BZ17, because enabling auto-update for it failed with HTTPError: 409,272: Unable to obtain STP configuration data, rc=[0x1000] [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]"

    This causes the metrics from the resource-based metric group "cpc-resource" not to be available. See https://github.com/zhmcclient/zhmc-prometheus-exporter/blob/master/examples/metrics.yaml#L1488 for the metrics you are not getting due to this. Caused by STP not being configured on the HMC.

  • "UserWarning: Ignoring item because its condition "'processor-usage' in resource_obj.properties" does not properly evaluate: NameError: name 'resource_obj' is not defined"

    This causes the LPAR metric "processor_mode_int" not to be available. Probably caused by a version mismatch between the metrics.yaml file you are using and the exporter version.

  • "UserWarning: Skipping metric with exporter name 'ifl_processor_count' in resource metric group 'logical-partition-resource' in metric definition file /etc/zhmc-prometheus-exporter/metrics.yaml, because its resource property 'number-ifl-processors' is not returned by the HMC for CPC 'BZ09'"

    This should probably not be reported at all. It simply is a reminder that the z14 BZ09 does not yet have the corresponding property on its LPAR objects.

  • "Ignoring label 'cpc' on metric group 'storagegroup-resource' due to error in rendering label value Jinja2 expression: HTTPError: 409,272: Unable to obtain STP configuration data, rc=[0x1000] [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]

    This causes some metrics to not have the "cpc" label. Caused by STP not being configured on the HMC.

@andy-maier
Copy link
Member

andy-maier commented Jul 30, 2024

@fulwang
On your Docker build:

If you use the "make build" command, then it uses the Dockerfile in the repo. That Dockerfile gets the metrics.yaml file from examples/metrics.yaml.

Your commands shown above first check out version 1.5.2, and then replace the examples/metrics.yaml file with /root/metrics.yaml. That step is not necessary, because when you check out version 1.5.2, the examples/metrics.yaml file already has the correct version for 1.5.2. Depending on the version of /root/metrics.yaml, that might have introduced the version mismatch.

So your commands should be (after removing /root/zhmc-prometheus-exporter):

cd /root
git clone https://github.com/zhmcclient/zhmc-prometheus-exporter
cd zhmc-prometheus-exporter/
git checkout 1.5.2
make docker

@andy-maier
Copy link
Member

andy-maier commented Jul 30, 2024

@fulwang
The messages have been improved in commit 511f7fe and in PR #559 (merged).

You may want to try out the latest version from the master branch (including its matching metrics.yaml file) to see if there are any issues remaining. I'll keep this issue open for a while.

@fulwang
Copy link
Author

fulwang commented Jul 30, 2024

@andy-maier I built with the latest code and run on the testing env a moment ago, here is the log for your review.
zhmcexporter_new.log

@fulwang
Copy link
Author

fulwang commented Jul 31, 2024

@fulwang On your Docker build:

If you use the "make build" command, then it uses the Dockerfile in the repo. That Dockerfile gets the metrics.yaml file from examples/metrics.yaml.

Your commands shown above first check out version 1.5.2, and then replace the examples/metrics.yaml file with /root/metrics.yaml. That step is not necessary, because when you check out version 1.5.2, the examples/metrics.yaml file already has the correct version for 1.5.2. Depending on the version of /root/metrics.yaml, that might have introduced the version mismatch.

So your commands should be (after removing /root/zhmc-prometheus-exporter):

cd /root
git clone https://github.com/zhmcclient/zhmc-prometheus-exporter
cd zhmc-prometheus-exporter/
git checkout 1.5.2
make docker

@andy-maier I realized this later and built the container image using the source code (tar.gz download from your repo) yesterday.

@fulwang
Copy link
Author

fulwang commented Jul 31, 2024

@andy-maier How we can customize the metrics.yaml to exclude the data collection from CPC BZ17? We just need to ignore it.


Enabling auto-update for CPC BZ17
Ignoring resource-based metrics for CPC BZ17, because enabling auto-update for it failed with ConnectionError: HTTPSConnectionPool(host='172.16.27.231', port=6794): Max retries exceeded with url: /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='172.16.27.231', port=6794): Read timed out. (read timeout=300)")), reason: HTTPSConnectionPool(host='172.16.27.231', port=6794): Read timed out. (read timeout=300)
Enabling auto-update for CPC BZ12
Enabling auto-update for CPC BZ09
Enabling auto-update for CPC BZ15
Enabling auto-update for CPC BZ16

@andy-maier
Copy link
Member

@fulwang Excluding the metrics for specific CPCs is not possible at the moment. There is an issue #323 open for that, targeted for the upcoming 2.0 version.

@andy-maier
Copy link
Member

andy-maier commented Jul 31, 2024

I created issue #564 for the one traceback error in the new log file.

Update: PR #564 solved that issue and has been merged for the upcoming version 1.7.0.

@andy-maier
Copy link
Member

andy-maier commented Jul 31, 2024

I think we should release version 1.7.0 now - the remaining two issues (STP config, and too many threads) cannot be solved by the exporter.

To avoid the too many threads error, I suggest to disable the following metric groups in the metric definition file (set fetch: false):

  • cpc-resource
  • logical-partition-resource
  • partition-resource
  • network-physical-adapter-port
  • partition-attached-network-interface
  • storagegroup-resource
  • storagevolume-resource

If that causes the error to go away, you can gradually enable the metric groups again, starting from the top of the list.

@andy-maier
Copy link
Member

andy-maier commented Jul 31, 2024

@fulwang The "too many threads" error happens when the HMC user has more than 25 requests open at the WS-API that are being processed (i.e. request sent, but not yet complete). I think that also applies to asynchronous operations whose jobs are not yet complete.

The exporter can have a maximum of 2 concurrent HMC requests open (the main thread, and a background fetch thread, and they all wait for the operations to complete before starting the next one).

Are you using the HMC userid for other tasks that run at the same time?

Could you please post a log file (with --log-comp all=debug --log exporter.log) so I can see the interactions with the HMC?

@fulwang
Copy link
Author

fulwang commented Aug 1, 2024

@andy-maier I have built a image with your latest code and it's now running on the testing env for debuging purpose. pls advise me when to feedback you the logs or any other information needed.

@andy-maier
Copy link
Member

@fulwang
So you currently do not experience the "too many threads" error anymore?
If so, I don't need any additional logs, and will release version 1.7.0.

@andy-maier
Copy link
Member

FuLong confirmed that the "too many threads" error did not show up anymore. I am closing this ticket now. Please open a new one if there are other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants