Project4: Testing Strategy for Custos

Testing strategy

First of all, before anything else, we would really like the thank Professor Marru, Professor Pierce, All the TAs (Suresh, Shubham, Abhinav and Sreesha),a special thank you to Isuru, and our other classmates, with whom the discussions on Slack, Git, Classrooms provided us with great teaching, debugging and solutions for deploying and testing the Software.

Because of unavailability of Dev instances of Custos, we were parallel deploying Custos on Kubernetes bootstrapped by Rancher on Jetstream1 instances, and we did deploy till the vault part, and we also built a Custos developing branch and published Docker images to Docker Hub repository and at the end, we weren’t able to completely deploy all services of Custos and we were facing issues with pods restarting/Crash loops. But we would really want to explore more in this project, and we would definitely love to work more on this project after the course. This is the Rancher URL https://js-169-26.jetstream-cloud.org:30433/g/clusters where our Kubernetes is set up and in the deployment section, it can be seen that we have deployed till vault service. Below are the screenshots of what all we have achieved so far:

Kubernetes Cluster Dashboard:

Kubernetes Cluster Dashboard

Rancher UI:

Rancher UI

Services being deployed:

Services being deployed

Screenshot till vault which has been deployed:

Screenshot till vault which has been deployed

KeyCloak:

KeyCloak

Vault:

Vault

Load Test Plan:

Due to time constraints and some difficulties in deploying Custos on Jetstream, we decided to implement our test plan using the Python SDK notebook provided by Isuru, and then we decided to perform load test on a few APIs with the Custos REST API endpoints (REF: https://cwiki.apache.org/confluence/display/CUSTOS/Use+Custos+REST+Endpoints).

Hardware resource usage estimation

For all the resources combined, we have decided to use a memory of not more than 20GB and 8 CPU cores.

Number of users

We decided to experiment with different number of users via the thread group portal of JMeter, with values ranging from 10 to 100 users.

We decided to use JMeter for our load testing check, as the REST API endpoints of Custos was very helpful.

However, at a later stage in the last 2 days, we were facing a different problem altogether, hence we were only able to test 3 APIs.

Load and Spike Test:

We decided to run these tests on different kinds of APIs:

a. User Management APIs b. Group Management APIs c. Secret Management APIs d. Sharing Management APIs e. Agent Management (Community Accounts) APIs

Using a Python Script, we decided to generate the HTTP request bodies for the Jmeter tests, using a Random Generator based function from Python. We later converted them into CSV files and delivered the same via the Jmeter tests.

We decided to run these tests on different kinds of APIs:

a. User Management APIs b. Group Management APIs c. Secret Management APIs d. Sharing Management APIs e. Agent Management (Community Accounts) APIs

Using a Python Script, we decided to generate the HTTP request bodies for the Jmeter tests, using a Random Generator based function from Python. We later converted them into CSV files and delivered the same via the Jmeter tests.

A. User Management APIs: Register User API

So, we applied 100 users with 5 Thread groups on this API for register users, and we needed unique registration users every time. It turns out that the system crashed and responses stopped after 47 users.

Along with this, we also tested on Enable User API, Disable User API and Delete a User API, and we got breakpoints at around 38 users for Enable User API, 41 for Disable User API and 62 for Delete a User API on our Dev Instance.

B. Group Management APIs: Create Groups API

The problem with a few Group Management APIs was that it was throwing 404 Bad Request error without any text or context. We tried with the Create Groups API which was showing this kind of error. However, as a part of the test plan, we would slowly started increasing the number of groups from 10 to at least 100 to find out the break point.

C. Secret Management APIs: Generate SSH credential API

This was another POST Request, with several details, however, the RESTful endpoints were running, returning nothing, probably at that particular moment this particular API might be down, so unfortunately we couldn’t test on this one. However, as a part of our plan, we had again decided to use CSV config files with Body metadata: client_id, description, owner_id, for 100 users with 5 thread groups. If suppose the future steps our tasks work fine, we will certainly be able to figure the breakpoint for this one.

D. Sharing Management APIs: Create an Entity type

This is probably the most important feature of Custos for us especially as a good load balances can help the Sharing Management APIs, however, we couldn’t test on this one as well. However, as a plan, we would still go ahead with a similar approach as used above. For example the Create and Entity type API: we would create a CSV file with client_id, entity_type with details like id, name and description, and test with 50 to 100 such users.

E. Agent Management (Community Accounts) APIs: Register agent

Again we would use a similar approach for the same, creating a CSV file with 50 to 100 different values with our Python script with columns: id, realm_roles, client_roles, attributes, key and values. The main identifier here would be the id column.

Discussions:

We believe that the biggest constraint we faced was with the several issues with deploying on JetStream. With proper and timely deployment, we could have possibly tried more testing and achieved with proper values. As the APIs we have tested are not deployed, the results might not actually be how we got those.
Maybe more time for debugging can also help us in resolving these issues.
Another issue that we faced, and possibly a critical one was with the responses of some of the APIs. For instance, the Create Groups API under Group Management APIs, was giving 404 Bad Request error, however, the error didn’t have any response text, or anything, so we couldn’t even figure out any bugs within our process.
The system being down for last couple of days prevented us from more rigorous testing.

Recommended Improvements:

Documentation: Probably, some more details in the documentation and examples for deploying could be provided, so that users who are completely new in this might be able to be more comfortable in learning about the same.
RESTful API Calls: As pointed above, there were several APIs which weren’t providing good information, instead they were simply throwing 404 Bad Requests without any explicit information, so debugging the same also isn’t possible from our end.
Reliability: We wanted to perform more rigorous tests on this system, but probably due to load from other teams’ testing or other reasons, the servers have been down for the last 2 days, and it only works occasionally. So probably the fault tolerance, scalability and load balancing within the system can also be improvised.
Consistency: Within the Python SDK notebook, and the RESTful end points for custom had different body header contents for many APIs so, it used to confuse us, which order needs to be followed. We believe that the keys of the HTTP requests bodies and the Python SDK should be more consistent. It also wasted quite some time in understanding the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly