Errors when starting to perform the first collection #539

fulwang · 2024-07-25T03:31:51Z

Describe the bug
can't not collect any data after start the container

Expected behavior
can collect data and queried in the browser

To Reproduce

start the container with command as below:

podman run -itd -v /opt/zhmcexporter:/root/myconfig -p 9291:9291 --name zhmcexporter zhmcexporter:latest -c /root/myconfig/hmccreds.yaml -v

Environment information
zhmc_prometheus_exporter version: 1.7.0.dev1
zhmcclient version: 1.17.0
Verbosity level: 1

HMC version:
HMC certificate validation: False
HMC version: 2.15.0
HMC API version: 3.13
HMC features: None

Command output

Log file
zhmcexporter.log

fulwang · 2024-07-25T07:01:00Z

I checkout version 1.5.2 and build another container to have a try, but still can't get the data collected as before. attached is the console log of running the new container.
zhmcexporter-1.5.2.log

andy-maier · 2024-07-29T04:59:28Z

@fulwang If you use version 1.5.2 of the exporter, you also need to use the metric definition file for that version. The warning in your 1.5.2 log:

/usr/local/lib/python3.9/site-packages/zhmc_prometheus_exporter/zhmc_prometheus_exporter.py:540: UserWarning: Ignoring item because its condition "'storage-group-uris' in resource_obj.properties" does not properly evaluate: NameError: name 'resource_obj' is not defined
  warnings.warn("Ignoring item because its condition {!r} does not "

Is caused by using a metric definition file that uses the resource object in its conditions, with an exporter version that does not yet have that support.

On your original error with 1.7.0.dev1:

There are two main errors there:

HTTPError: 503,3: Too many concurrent threads per user [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]

I have never seen this before and have started a dialogue with the Z development team on that.

HTTPError: 400,14: 'absolute-ifl-capping'' is not a valid value for the corresponding query parm [GET /api/logical-partitions/0a92d550-d75c-35d2-bc51-2bd88fc01b3f]

That is an error in the exporter code, but to find that it would be very helpful to get an exporter log file.

-> Could you please run this version of the exporter again and add the following options to its command line: --log-comp all=debug --log exporter.log ?

fulwang · 2024-07-29T06:47:03Z

@andy-maier Thanks for the analysis?
Could you tell where i can get the metric definition file for version 1.5.2 and how to replace it before i rebuild the container image?

fulwang · 2024-07-29T06:56:16Z

@andy-maier For rerun v1.7.0, do i need to rebuild the container to add the log options you mentioned or just add it to the podman command line is enough?

fulwang · 2024-07-29T07:17:33Z

I just scheduled a run by adding the options on command line.

[root@lpar27 ~]# podman run -itd -v /opt/zhmcexporter:/root/myconfig -p 9291:9291 --name zhmcexporter zhmcexporter:v1.7.0 -c /root/myconfig/hmccreds.yaml -v --log-comp all=debug --log exporter.log
90f58391668f89d1ded5c3d4ebbdb23bb0ffde7bdf8db7f57fb6b0294c55e334
[root@lpar27 ~]#
[root@lpar27 ~]#
[root@lpar27 ~]# podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ccdb73d06b25 localhost/grafana:v10.3.4 4 days ago Up 4 days 0.0.0.0:3000->3000/tcp grafana
82ed76756548 localhost/prometheus:v2.53.0 --config.file=/et... 4 days ago Up 4 days 0.0.0.0:9090->9090/tcp prometheus
a219e9f8fcf6 localhost/nginx:v1.23.3 nginx -g daemon o... 4 days ago Up 4 days 0.0.0.0:1443->1443/tcp rproxy
8ee926934022 localhost/zhmcexporter:v1.5.2 -c /root/myconfig... 4 days ago Up 4 days 0.0.0.0:9292->9291/tcp zhmcexporter_1
c1087def4b23 localhost/s390x/mariadb:10.5.13 mysqld 4 hours ago Up 4 hours 0.0.0.0:3306->3306/tcp ecs_db
05ead85a9219 localhost/nginx:1.17.9 nginx -g daemon o... 4 hours ago Up 4 hours 0.0.0.0:443->443/tcp ecs_nginx
159b9ef35fd6 localhost/ecs_api:test python app.py 4 hours ago Up 4 hours 0.0.0.0:18443->18443/tcp ecs_api
90f58391668f localhost/zhmcexporter:v1.7.0 -c /root/myconfig... 8 seconds ago Up 8 seconds 0.0.0.0:9291->9291/tcp zhmcexporter
[root@lpar27 ~]#

andy-maier · 2024-07-29T09:45:36Z

podman passes the command line after the container name through to the invoked container, so your podman command line looks good to me.

andy-maier · 2024-07-29T09:47:09Z

The metric definition file for a specific exporter version can be downloaded from the repo, when selecting the tag for that version. For example, for version 1.5.2, this is the repo at that version: https://github.com/zhmcclient/zhmc-prometheus-exporter/tree/1.5.2, and the sample metric file for that version is https://github.com/zhmcclient/zhmc-prometheus-exporter/blob/1.5.2/examples/metrics.yaml

I don't know how you build your container image, and whether you have the metric definition file in the image (vs. mounting its directory). If you have it in the image (which I think is the case given your podman command line), then you need to rebuild your image, and then you probably already have a COPY directive in the Dockerfile that pulls it in from the local directory.

fulwang · 2024-07-29T14:47:54Z

zhmcexporter_1.log
@andy-maier I choose to have the metric definition file in the image, so i save the metrics.yaml of version 1.5.2 to "/root" directory and rebuild the image as below and run it on the testing environment again, but no luck for now.

cd /root
git clone https://github.com/zhmcclient/zhmc-prometheus-exporter
cd zhmc-prometheus-exporter/
git checkout 1.5.2
rm -fr examples/metrics.yaml
cp /root/metrics.yaml examples/
make docker

fulwang · 2024-07-30T07:53:22Z

@andy-maier Can this be something wrong with the HMC side? The physical server was shutdown for several days due to malfunctions of the cooling system and was powered on in last week. I can saw many of errors include "HTTPError: 409,272: Unable to obtain STP configuration data, rc=[0x1000] [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]".

I have checked the user for hmc access and the option of "Web Services API " was checked as before.

andy-maier · 2024-07-30T10:31:07Z

@fulwang The errors "Unable to obtain STP configuration data" are not severe, they only cause the "cpc" label not to be added to metrics for some types of resources.

Having said that, I suggest to configure STP on that HMC so that this error goes away.

Let's walk through the errors in the zhmcexporter_1.log file you attached above:

"Ignoring resource-based metrics for CPC BZ17, because enabling auto-update for it failed with HTTPError: 409,272: Unable to obtain STP configuration data, rc=[0x1000] [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]"

This causes the metrics from the resource-based metric group "cpc-resource" not to be available. See https://github.com/zhmcclient/zhmc-prometheus-exporter/blob/master/examples/metrics.yaml#L1488 for the metrics you are not getting due to this. Caused by STP not being configured on the HMC.
"UserWarning: Ignoring item because its condition "'processor-usage' in resource_obj.properties" does not properly evaluate: NameError: name 'resource_obj' is not defined"

This causes the LPAR metric "processor_mode_int" not to be available. Probably caused by a version mismatch between the metrics.yaml file you are using and the exporter version.
"UserWarning: Skipping metric with exporter name 'ifl_processor_count' in resource metric group 'logical-partition-resource' in metric definition file /etc/zhmc-prometheus-exporter/metrics.yaml, because its resource property 'number-ifl-processors' is not returned by the HMC for CPC 'BZ09'"

This should probably not be reported at all. It simply is a reminder that the z14 BZ09 does not yet have the corresponding property on its LPAR objects.
"Ignoring label 'cpc' on metric group 'storagegroup-resource' due to error in rendering label value Jinja2 expression: HTTPError: 409,272: Unable to obtain STP configuration data, rc=[0x1000] [GET /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4]

This causes some metrics to not have the "cpc" label. Caused by STP not being configured on the HMC.

andy-maier · 2024-07-30T10:40:32Z

@fulwang
On your Docker build:

If you use the "make build" command, then it uses the Dockerfile in the repo. That Dockerfile gets the metrics.yaml file from examples/metrics.yaml.

Your commands shown above first check out version 1.5.2, and then replace the examples/metrics.yaml file with /root/metrics.yaml. That step is not necessary, because when you check out version 1.5.2, the examples/metrics.yaml file already has the correct version for 1.5.2. Depending on the version of /root/metrics.yaml, that might have introduced the version mismatch.

So your commands should be (after removing /root/zhmc-prometheus-exporter):

cd /root
git clone https://github.com/zhmcclient/zhmc-prometheus-exporter
cd zhmc-prometheus-exporter/
git checkout 1.5.2
make docker

andy-maier · 2024-07-30T12:46:02Z

@fulwang
The messages have been improved in commit 511f7fe and in PR #559 (merged).

You may want to try out the latest version from the master branch (including its matching metrics.yaml file) to see if there are any issues remaining. I'll keep this issue open for a while.

fulwang · 2024-07-30T15:22:53Z

@andy-maier I built with the latest code and run on the testing env a moment ago, here is the log for your review.
zhmcexporter_new.log

fulwang · 2024-07-31T01:41:45Z

@fulwang On your Docker build:

If you use the "make build" command, then it uses the Dockerfile in the repo. That Dockerfile gets the metrics.yaml file from examples/metrics.yaml.

Your commands shown above first check out version 1.5.2, and then replace the examples/metrics.yaml file with /root/metrics.yaml. That step is not necessary, because when you check out version 1.5.2, the examples/metrics.yaml file already has the correct version for 1.5.2. Depending on the version of /root/metrics.yaml, that might have introduced the version mismatch.

So your commands should be (after removing /root/zhmc-prometheus-exporter):
cd /root
git clone https://github.com/zhmcclient/zhmc-prometheus-exporter
cd zhmc-prometheus-exporter/
git checkout 1.5.2
make docker

@andy-maier I realized this later and built the container image using the source code (tar.gz download from your repo) yesterday.

fulwang · 2024-07-31T03:31:33Z

@andy-maier How we can customize the metrics.yaml to exclude the data collection from CPC BZ17? We just need to ignore it.

Enabling auto-update for CPC BZ17
Ignoring resource-based metrics for CPC BZ17, because enabling auto-update for it failed with ConnectionError: HTTPSConnectionPool(host='172.16.27.231', port=6794): Max retries exceeded with url: /api/cpcs/348762ef-90df-36c2-ae18-8dd2abf730b4 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='172.16.27.231', port=6794): Read timed out. (read timeout=300)")), reason: HTTPSConnectionPool(host='172.16.27.231', port=6794): Read timed out. (read timeout=300)
Enabling auto-update for CPC BZ12
Enabling auto-update for CPC BZ09
Enabling auto-update for CPC BZ15
Enabling auto-update for CPC BZ16

andy-maier · 2024-07-31T05:54:38Z

@fulwang Excluding the metrics for specific CPCs is not possible at the moment. There is an issue #323 open for that, targeted for the upcoming 2.0 version.

andy-maier · 2024-07-31T06:00:41Z

I created issue #564 for the one traceback error in the new log file.

Update: PR #564 solved that issue and has been merged for the upcoming version 1.7.0.

andy-maier · 2024-07-31T06:49:20Z

I think we should release version 1.7.0 now - the remaining two issues (STP config, and too many threads) cannot be solved by the exporter.

To avoid the too many threads error, I suggest to disable the following metric groups in the metric definition file (set fetch: false):

cpc-resource
logical-partition-resource
partition-resource
network-physical-adapter-port
partition-attached-network-interface
storagegroup-resource
storagevolume-resource

If that causes the error to go away, you can gradually enable the metric groups again, starting from the top of the list.

andy-maier · 2024-07-31T07:02:35Z

@fulwang The "too many threads" error happens when the HMC user has more than 25 requests open at the WS-API that are being processed (i.e. request sent, but not yet complete). I think that also applies to asynchronous operations whose jobs are not yet complete.

The exporter can have a maximum of 2 concurrent HMC requests open (the main thread, and a background fetch thread, and they all wait for the operations to complete before starting the next one).

Are you using the HMC userid for other tasks that run at the same time?

Could you please post a log file (with --log-comp all=debug --log exporter.log) so I can see the interactions with the HMC?

fulwang · 2024-08-01T03:00:04Z

@andy-maier I have built a image with your latest code and it's now running on the testing env for debuging purpose. pls advise me when to feedback you the logs or any other information needed.

andy-maier · 2024-08-01T12:24:27Z

@fulwang
So you currently do not experience the "too many threads" error anymore?
If so, I don't need any additional logs, and will release version 1.7.0.

andy-maier · 2024-08-01T15:54:24Z

FuLong confirmed that the "too many threads" error did not show up anymore. I am closing this ticket now. Please open a new one if there are other issues.

andy-maier self-assigned this Jul 25, 2024

andy-maier added type: bug Something isn't working area: code rollback needed under work labels Jul 29, 2024

andy-maier added this to the 1.7.0 milestone Jul 29, 2024

andy-maier added type: cleanup resolution: mitigated Temporary solution discussion needed and removed type: bug Something isn't working rollback needed under work labels Jul 30, 2024

andy-maier mentioned this issue Jul 31, 2024

HTTPError: 400,14: 'absolute-ifl-capping'' is not a valid value for the corresponding query parm #564

Closed

andy-maier removed the discussion needed label Aug 1, 2024

andy-maier closed this as completed Aug 1, 2024

andy-maier mentioned this issue Aug 2, 2024

Simplify the internal data structures for resources #575

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors when starting to perform the first collection #539

Errors when starting to perform the first collection #539

fulwang commented Jul 25, 2024

fulwang commented Jul 25, 2024

andy-maier commented Jul 29, 2024

fulwang commented Jul 29, 2024

fulwang commented Jul 29, 2024 •

edited

Loading

fulwang commented Jul 29, 2024

andy-maier commented Jul 29, 2024

andy-maier commented Jul 29, 2024 •

edited

Loading

fulwang commented Jul 29, 2024 •

edited

Loading

fulwang commented Jul 30, 2024

andy-maier commented Jul 30, 2024

andy-maier commented Jul 30, 2024 •

edited

Loading

andy-maier commented Jul 30, 2024 •

edited

Loading

fulwang commented Jul 30, 2024

fulwang commented Jul 31, 2024

fulwang commented Jul 31, 2024

andy-maier commented Jul 31, 2024

andy-maier commented Jul 31, 2024 •

edited

Loading

andy-maier commented Jul 31, 2024 •

edited

Loading

andy-maier commented Jul 31, 2024 •

edited

Loading

fulwang commented Aug 1, 2024

andy-maier commented Aug 1, 2024

andy-maier commented Aug 1, 2024

Errors when starting to perform the first collection #539

Errors when starting to perform the first collection #539

Comments

fulwang commented Jul 25, 2024

start the container with command as below:

fulwang commented Jul 25, 2024

andy-maier commented Jul 29, 2024

fulwang commented Jul 29, 2024

fulwang commented Jul 29, 2024 • edited Loading

fulwang commented Jul 29, 2024

I just scheduled a run by adding the options on command line.

andy-maier commented Jul 29, 2024

andy-maier commented Jul 29, 2024 • edited Loading

fulwang commented Jul 29, 2024 • edited Loading

fulwang commented Jul 30, 2024

andy-maier commented Jul 30, 2024

andy-maier commented Jul 30, 2024 • edited Loading

andy-maier commented Jul 30, 2024 • edited Loading

fulwang commented Jul 30, 2024

fulwang commented Jul 31, 2024

fulwang commented Jul 31, 2024

andy-maier commented Jul 31, 2024

andy-maier commented Jul 31, 2024 • edited Loading

andy-maier commented Jul 31, 2024 • edited Loading

andy-maier commented Jul 31, 2024 • edited Loading

fulwang commented Aug 1, 2024

andy-maier commented Aug 1, 2024

andy-maier commented Aug 1, 2024

fulwang commented Jul 29, 2024 •

edited

Loading

andy-maier commented Jul 29, 2024 •

edited

Loading

fulwang commented Jul 29, 2024 •

edited

Loading

andy-maier commented Jul 30, 2024 •

edited

Loading

andy-maier commented Jul 30, 2024 •

edited

Loading

andy-maier commented Jul 31, 2024 •

edited

Loading

andy-maier commented Jul 31, 2024 •

edited

Loading

andy-maier commented Jul 31, 2024 •

edited

Loading