Cloudforms Troubleshooting 101

evm.log:

evm.log is the main log file of CloudForms. It's the log to follow for most of the basic problems with CloudForms.

First check to see what roles are on which systems in their environment:

# grep local= evm.log

Example:

[----] I, [2019-12-05T13:44:53.486372 #25740:bf0f68]  INFO -- : MiqServer: local=N, master=N, status= started, id=00004, pid=08944, guid=f1b6476f-b3e4-4ef0-a9e3-8f605b39d738, name=EVM, zone=RHV, hostname=cfme-rhv.example.com, ipaddress=10.0.0.1, version=5.10.12.3, build=20191029180220_f883bde, active roles=automate:database_operations:ems_inventory:ems_operations:event:reporting:smartstate:user_interface:web_services:websocket
[----] I, [2019-12-05T13:44:53.490625 #25740:bf0f68]  INFO -- : MiqServer: local=Y, master=N, status= started, id=00001, pid=25740, guid=d8658f16-0ba7-44b8-8f97-69501e6ae999, name=EVM, zone=default, hostname=cfme.example.com, ipaddress=10.0.0.2, version=5.10.12.3, build=20191029180220_f883bde, active roles=automate:database_operations:database_owner:embedded_ansible:ems_inventory:ems_operations:event:notifier:reporting:user_interface:web_services:websocket
[----] I, [2019-12-05T13:44:53.494736 #25740:bf0f68]  INFO -- : MiqServer: local=N, master=Y, status= started, id=00003, pid=32399, guid=a9cec71e-eda5-4b0d-90f7-24dd03e6b208, name=EVM, zone=vCenter Zone, hostname=cfme-vcenter1.example.com, ipaddress=10.0.0.3, version=5.10.12.3, build=20191029180220_f883bde, active roles=automate:database_operations:ems_inventory:ems_metrics_collector:ems_metrics_coordinator:ems_metrics_processor:ems_operations:event:reporting:scheduler:smartproxy:smartstate:web_services:websocket
[----] I, [2019-12-05T13:44:53.498199 #25740:bf0f68]  INFO -- : MiqServer: local=N, master=N, status= started, id=00002, pid=31673, guid=0613c826-1240-4327-bf94-f452e55acc93, name=EVM, zone=vCenter Zone, hostname=cfme-vcenter2.example.com, ipaddress=10.0.0.4, version=5.10.12.3, build=20191029180220_f883bde, active roles=automate:database_operations:ems_metrics_collector:ems_metrics_processor:ems_operations:reporting:smartproxy:smartstate:web_services:websocket

How to read these lines:

local = Do the logs belong to the appliance? N = no, Y = Yes
master = This makes sure that servers are heartbeating and when they don't, marking that server as not responding so that the roles can fail over to another server that is responding (1 per Region)
name = name of the appliance
zone = zone of the appliance
roles = Roles enabled on the appliance

Next check to see if the MiqQueue is healthy:

The MiqQueue is the queueing system that we use where all workers pull from. You want to ensure that nothing is growing in number as we check every 5 minutes as this will cause slowness in the environment. Each of the log lines the the Zone and the workers (i.e- smartstate,ems_metrics_processor,etc...) and the count of tasks that are queued up.

# grep "MiqQueue count for state=\[\"ready\"\]" evm.log

Example:

[----] I, [2019-12-02T12:25:24.775076 #2444:e16f54]  INFO -- : Q-task_id([log_status]) MIQ(MiqServer.log_system_status) [EVM Server (2236)] MiqQueue count for state=["ready"] by zone and role: {"Worker Appliance Zone"=>{"ems_metrics_processor"=>985, nil=>6}, "Database Appliance Zone"=>{"smartstate"=>1, "ems_metrics_processor"=>6, "event"=>182, nil=>1}, "UI Appliance Zone"=>{"ems_operations"=>1071, "event"=>127, "ems_metrics_processor"=>6, "smartstate"=>1}, "default"=>{nil=>17, "ems_inventory"=>1, "smartstate"=>1, "ems_metrics_processor"=>3}}

Then determine if they're the relevant logs, grep out all the errors and warnings. In general, once a relevant error log line is found, do a search to see if a bug is already open for it:

# grep -E ' E, | W, ' evm.log

Then I get the timestamp for the relevant error/warning and then do a less on the log to get the full stack trace.

automation.log

This is the main log for provisioning issues and issues dealing with automate domains. This is the log you will want to search when customer reports provisioning issues. Sometimes it will be due to custom code, and Red Hat offical support policy is that Red Hat doesn't troubleshoot customer's custom code they have created.

Determine if there is an automate role on the specific appliance logs you have, from there, do a similar grep for the evm.log:

# grep -E ' E, | W, ' automation.log

Again, once you find relevant error, copy the time stamp, do a less, and then search for the timestamp to get a full stacktrace.

production.log

This is a log to see errors that happen in the UI. This time, I like to do an additional grep for FATAL:

# grep -Ei 'fatal|error|warn' production.log

top_output.log

If you see any memory exceeded errors or refresh problems, it's good to check this log in order to see if the appliance is overburdened. This can happen if there are too many roles enabled on the appliance. Something that can remedy this is to either provision more appliances, adjust memory settings for workers, or more evenly distribute roles among appliances.

Provider specific logs

If it's a provider specific issue, these are the logs you need to look into, here are the more common providers:

aws.log - AWS
azure.log - Azure
fog.log - Openstack
kubernetes.log - Openshift
rhevm.log - Red Hat Enterprise Virtualization Manager
vim.log - vCenter

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloudforms Troubleshooting 101

evm.log:

automation.log

production.log

top_output.log

Provider specific logs

About

Releases

Packages

Contributors 2

davidtluong/cloudforms-troubleshooting-101

Folders and files

Latest commit

History

Repository files navigation

Cloudforms Troubleshooting 101

evm.log:

automation.log

production.log

top_output.log

Provider specific logs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages