-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decision in the matter of cloudmon/stackmon approach #576
base: main
Are you sure you want to change the base?
Changes from all commits
bcc4b2f
6f1520c
82ade46
de1a29f
7fb6258
7382ac6
1cb5529
68d3d0e
1df0f6e
3501430
33b28f3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
title: Evaluating the deployment of CloudMon in SCS infrastructure | ||
type: Decision Record | ||
--- | ||
|
||
## Introduction | ||
|
||
In the fast-paced environment of modern cloud computing, ensuring the reliability and performance | ||
of cloud infrastructure is paramount for organizations. Effective monitoring and testing framework | ||
are essential tools in this endeavor, enabling proactive identification and resolution of issues before | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. e.g. most monitoring doesn't really allow proactive resolution of issues. Monitoring is just analysing live data after the fact. e.g. you can see an API is slow in monitoring, or you can see a disk filling up. But all the activities monitoring allows a DevOps Team to do are reactive. You notice a problem in monitoring, then you react. A testing framework, on the other hand, can of course be used to detect problems e.g. before deployment. |
||
they impact critical operations. However, selecting the right monitoring solution can be a daunting task, | ||
particularly when faced with options that are overly complex and lack essential documentation and working | ||
scenarios. | ||
|
||
## Motivation | ||
|
||
One such solution, the [Cloudmon](https://stackmon.org/) project, faces challenges that might limit its effectiveness for organizations looking | ||
for efficient and reliable monitoring capabilities. This introduction outlines the reasons why our organization opts | ||
against utilizing the Cloudmon project and instead embraces a more streamlined and effective approach involving Gherkin | ||
test scenarios and mapping Python API calls to interact with OpenStack resources. By addressing the complexities and | ||
shortcomings of the Cloudmon project, SCS organization aims to adopt a monitoring solution that not only meets but | ||
exceeds our requirements for simplicity, reliability, and ease of use. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Honestly this phrasing is more then subjective. After dozens of hours spend explaining the documentation, setup scenarios, testing capabilities, workshops this statement is very offensive. SCS organization entered the evaluation and asked for multiple workshops to understand the tooling and offered help in improving the documentation. There was absolutely no output. |
||
|
||
## Design Considerations | ||
|
||
Our approach was to base on a technical concept. This document serves as a proposal, with the final decision | ||
subject to discussion with SCS team members. We propose a behavior-driven system based on solutions using Java framework | ||
Comment on lines
+26
to
+27
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are many different testing frameworks and testing strategies, which are not really differentiated in this text. You propose a behavior-driven test concept using cucumber, but somehow the document fails to mention any up- or downsides of behavior driven testing or of this specific framework. There is no reasoning I have read in this text why behavior driven testing is superior to the cloudmon approach. And I don't doubt that there might be advantages to it, but those need to be clearly spelled out somewhere. This text also talks about a "technical concept", where can I see that? It isn't linked here, or included. If it is the basis for the decision, it surely should be included? I only know both technologies superficially, for the record, that is cloudmon and cucumber. |
||
Cucumber, utilizing the Gherkin domain-specific language for defining executable specifications. This approach ensures | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So what you offer is clearly controversial to the objective you stated above: simplicity. Taking Java and a widely unknown framework with even less known DSL is going to harm simplicity and understand-ability of the solution. It brings enormous amount of efforts required by operators to learn those things and be able to extend it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. main idea of stackmon is in using ansible as a test scenario that literally anybody is able to read/understand/replay locally/extend (endlessly) |
||
clear, human-readable test behavior definitions, facilitating participation from both developers and non-technical | ||
contributors. Considering the team's proficiency in Python, the language's simplicity and clarity, alignment with | ||
OpenStack's ecosystem, and the robust support from the Python community, it's evident that Python presents a superior | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not get how you come from Java mentioned above to the python |
||
choice for implementing Gherkin-based testing over Java. By harnessing Python's strengths, we can maximize efficiency, | ||
accelerate development, and ensure seamless integration with OpenStack, ultimately enhancing the effectiveness of our | ||
testing processes. | ||
|
||
## Challanges | ||
|
||
During our assessment of the Cloudmon/Stackmon project, we encountered significant challenges related to documentation, | ||
particularly regarding lack of examples for configuration setups and usage guidelines. The lack of comprehensive | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't sound fair. You were offered dedicated session, pointed to the relevant documentation (and as mentioned above it was communicated towards the CloudMon team that taking those workshops you will help improving the documentation), pointed to detailed explanation of the configuration, pointed to the live production configuration. If production configuration is not considered as "lack of examples" I do not know what to say more. If production testing scenarios of a very big OpenStack based public cloud with much more services then vanila SCS comes with are not sufficient it shows me the evaluation was performed not very detailed. |
||
documentation impeded our understanding of the project and hindered effective utilization. We understood that some | ||
features implemented in the cloudmon/stackmon are not neccessary for the health monitor like running the tests from | ||
different virtual locations because every network in the project is responsible for different AZ and on the other hand | ||
with our tests approach we can just clone [scs-health-monitor](https://github.com/SovereignCloudStack/scs-health-monitor/tree/main) | ||
to another physical location and run the same tests from physically different place on earth. Examples like that | ||
confirmed our belief that atomic approach save us a lot of effort, time and costs. | ||
|
||
## Decision | ||
|
||
By opting for [Gherkin](https://cucumber.io/docs/gherkin/) test scenarios with Python mapping API calls to OpenStack, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OpenTelekomCloud is using alternative framework (Robot Framework) for doing system acceptance testing. This comes with a huge effort (which you do not explicitly mention here, but is read between the lines) of developing this "python mapping API calls to OpenStack". This is a huge and complex work that can be avoided by relying on existing tooling for OpenStack. And btw, RobotFramework (at OTC) and behavior driven testing is great in testing, but is absolutely unsuitable for health and performance monitoring |
||
we aim to address the complexities and shortcomings of the Cloudmon project while ensuring our monitoring and testing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
See comment above - what do these complexities and shortcomings consist of? |
||
processes remain efficient, reliable, and aligned with our organizational goals. To connect both technologies we use as | ||
well [Behave](https://behave.readthedocs.io/en/latest/) framework. This decision represents a strategic move towards | ||
enhancing our cloud infrastructure management capabilities and maximizing operational effectiveness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I have many problems with this.
The complete text lacks a clearer definition of how e.g. Cloudmon "monitors" "reliability and performance" - what that even means in this context, because the context is unclear.
First, it only talks about "Cloud Computing", from guessing and stuff I know about Cloudmon I think this is about the IaaS Layer of a Cloud, but it isn't even mentioned anywhere.
I think this should be spelled out.
Second, neither "reliability" or "performance" or "monitoring" are really defined here, and used intermixed, which poses some problems, as you can see when looking at my next questions.