-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More explicit failure indication in cbt run. #97
Comments
The last thing CBT does is copy over the logs and output files from the nodes/clients and brings them over to the head node. This is all raw data and FIO summary outputs so you need to create a parser if you want to visualize the data as cluster performance. |
Thanks for confirming/clarifying @ommoreno . |
im running cbt on existing cluster, but im not getting any output in "output.0" file.. all im getting is some output in "historic_ops.out.. tried running both librbdfio and rados benchmark.. |
Try running the fio or rados bench command standalone and see if you get an error. Then walk backwards in the command list until you find the first command that failed. I added code into CBT to check for failures while constructing the cluster, and throw an exception if one occurs, but did not enable failure checking everywhere - there are cases where some users may find it useful to ignore a single failure, such as a test that constructs a 1000-OSD cluster and encounters a single bad disk. You can turn it on anywhere you like by adding the parameter ", continue_if_error=False" as the last parameter in the common.pdsh calls in CBT code. It sounds like your cluster built if you are seeing historic_ops.out results. What happens when you run rados bench command that CBT runs by itself? Also look in benchmark/radosbench.py and enable error checking there, so that CBT will tell you what's going wrong. |
Thanks for the steps , got the cbt running after lots of troubleshooting. Really need to document the steps so wont get issues when run it on another cluster. |
When executing the cbt.py test suite, it is very hard to figure out which steps failed/passed.
My experience with this tool is very limited as I just started using it, but I see that the pdsh commands fail without any error, so it is hard to decipher why.
Also, the use_existing flag in cluster: configuration in the yaml file should be highlighted when using against an existing cluster. Once I go through a successful execution I will create a pull request for any doc changes if makes sense and other issues if I see.
Another issue I see is username and groupname are taken as the same which is not the case. Might be useful to add a groups filed as well.
Lastly -->
Now I think I have gotten past some of my inital hurdles and am able to execute an fio benchmark, but I am not sure what is next.
The last step I see is:
21:30:37 - DEBUG - cbt - pdsh -R ssh -w [email protected],[email protected],[email protected] sudo chown -R behzad_dastur.behzad_dastur /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite/* 21:30:37 - DEBUG - cbt - rpdcp -f 1 -R ssh -w [email protected],[email protected],[email protected] -r /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite/* /tmp/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite
I can see logs created at:
[root@cbtvm001-d658 cbt]# ls /tmp/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/read/ collectl.b-stageosd001-r19f29-prod.acme.symcpe.net collectl.v-stagemon-002-prod.abc.acme.net output.0.v-stagemon-001-prod.abc.acme.net collectl.v-stagemon-001-prod.abc.acme.net historic_ops.out.b-stageosd001-r19f29-prod.abc.acme.net
Are there ways to now visualize this data.
The text was updated successfully, but these errors were encountered: