Skip to content

Commit

Permalink
Add some TDC guides
Browse files Browse the repository at this point in the history
  • Loading branch information
bryanfriedman committed Feb 14, 2024
1 parent 5468fc5 commit de054a6
Show file tree
Hide file tree
Showing 18 changed files with 5,667 additions and 0 deletions.
3,226 changes: 3,226 additions & 0 deletions metadata/lms/service-routing-istio-refarch/content.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions metadata/lms/service-routing-istio-refarch/description.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
A reference architecture for implementing the Istio Service Mesh
10 changes: 10 additions & 0 deletions metadata/lms/service-routing-istio-refarch/guide.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"id": "tanzu-holly-tdc-guides-kubernetes-service-routing-istio-refarch",
"slug": "service-routing-istio-refarch",
"title": "Istio Reference Architecture",
"type": "article",
"status": "unlisted",
"topics": [
"TDC"
]
}
781 changes: 781 additions & 0 deletions metadata/lms/workload-tenancy-autoscaling-refarch/content.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Guidance for autoscaling application workloads and cluster compute resources
10 changes: 10 additions & 0 deletions metadata/lms/workload-tenancy-autoscaling-refarch/guide.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"id": "tanzu-holly-tdc-guides-kubernetes-workload-tenancy-autoscaling-refarch",
"slug": "workload-tenancy-autoscaling-refarch",
"title": "Autoscaling Reference Architecture",
"type": "article",
"status": "unlisted",
"topics": [
"TDC"
]
}
878 changes: 878 additions & 0 deletions metadata/lms/workload-tenancy-cluster-tuning/content.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
A workflow for tuning Kubernetes clusters
10 changes: 10 additions & 0 deletions metadata/lms/workload-tenancy-cluster-tuning/guide.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"id": "tanzu-holly-tdc-guides-kubernetes-workload-tenancy-cluster-tuning",
"slug": "workload-tenancy-cluster-tuning",
"title": "Cluster Tuning Guide",
"type": "article",
"status": "unlisted",
"topics": [
"TDC"
]
}
283 changes: 283 additions & 0 deletions metadata/lms/workload-tenancy-platform-checklist/content.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
<!--
date: '2021-02-24'
description: A checklist for Kubernetes platform considerations
keywords:
- Kubernetes
lastmod: '2021-02-24'
linkTitle: Platform Readiness Checklist
parent: Workload Tenancy
title: Platform Readiness Checklist
weight: 100
featured: true
oldPath: "/content/guides/kubernetes/workload-tenancy-platform-checklist.md"
aliases:
- "/guides/kubernetes/workload-tenancy-platform-checklist"
level1: Building Kubernetes Runtime
level2: Building Your Kubernetes Platform
tags: []
-->


This list is a starting place for considerations about the Kubernetes platform
running your applications. It is not exhaustive and should be expanded based on
your requirements.

### Required


- etcd is highly available

It is important to configure etcd with high availability to minimize the risk of
data loss. Ensure a minimum of three nodes are running and are placed in
different fault domains.





- etcd cluster is healthy

Ensure the members of your etcd cluster are healthy. The below commands give you
a glimpse of the status of your cluster.

Note the need to substitute the node IP's for x, y and z below:

```
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints=https://x:2379,https://y:2379,https://z:2379 endpoint health
https://x:2379 is healthy: successfully committed proposal: took = 9.969618ms
https://y:2379 is healthy: successfully committed proposal: took = 10.675474ms
https://z:2379 is healthy: successfully committed proposal: took = 13.338815ms
```






- A Backup/restore strategy is outlined

Each cluster is different. Think through and understand what elements of your
cluster need to be backed up and restored in the event of single or multi-node
failure, database corruption or other problems. Consider the applications
running in the platform, the roles and role bindings, persistent storage
volumes, ingress configuration, security and network policies, etc.





- A certificate renewal process is in place

All certificates in the cluster have an expiration date. Although it is likely
the cluster will be upgraded/replaced before then, it is still recommended to
have a process to refresh/renew them documented.





- Failure domain/availability zones have been considered

In both cloud and on-premise installations, the importance of using different
availability zones/failure domains for your control plane nodes is fundamental
to cluster resiliency. Unless there is an architectural redundancy in your
topology (mirror clusters per AZ, or globally load balanced clusters) consider
control plane node location.




### Ingress


- Load balancer is redundant

Your ingress should be configured to run on several pre-defined nodes for
high availability. Configure the load balancer to route traffic to these nodes
accordingly. Test to make sure all defined nodes are getting traffic.





- Load balancer throughput meets requirements

Testing your cluster for network bottlenecks before going live is a must. Test
your load balancer and ingress throughput capacity to set realistic expectations
from the results obtained.




### Network / CNI


- Pod to Pod communication works

Validate that Pods can communicate with other Pods and that the Services
exposing those Pods correctly route traffic.





- Egress communication has been tested

Validate that Pods can reach endpoints outside of the cluster.





- DNS functionality validated

Test that containers in the platform can resolve external and internal domains.
Test internal domains with and without `.cluster.local` prefix.





- Network Policy configuration validated

When applicable, validate that network policies are being enforced as expected.




### Capacity


- Nodes can be added seamlessly

The worker node deployment and bootstrap process should be defined and
automated. Ensure new nodes can be seamlessly added to the cluster enabling and
preparing it for growth.





- Cluster autoscaling enabled

Cluster autoscaling automatically adjusts the size of the cluster when
insufficient resources are available to run new Pods or nodes have been
underutilized for an extended period of time. When applicable, verify it is
enabled and the expected actions are taken under these conditions.





- ResourceQuotas are defined

[ResourceQuotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
provide constraints that limit aggregate consumption of resources per namespace.
They limit the quantity of resources of a given type that can be created. Define
and implement them in the relevant namespaces.





- LimitRanges are defined

[LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/#enabling-limit-range)
define default, minimum and maximum memory, CPU and storage utilization per Pod
in a namespace. These should be defined to avoid running unbounded containers.

You can find more information about this in the
[resource limits](../workload-tenancy-cluster-tuning#resource-limits)
section of the cluster tuning library document.




### Monitoring


- Platform is monitored

The API, controllers, etcd, and worker node health status should be monitored.
Ensure notifications are correctly delivered, and a dead man's switch is
configured and working.





- Containers are monitored

The containers running on the platform should be monitored for performance and
availability. Kubernetes provides liveness and readiness checks, and prometheus
can give you more in-depth information of application performance.




### Storage


- Test storage classes

Ensure defined storage classes are working as expected. Test them by creating
persistent volume claims to ensure they work as defined. Validate the claims
bind properly and have the expected write permissions.




### Upgrades


- Define and test the upgrade process

Document, automate and test the upgrade process to ensure it is consistent and
repeatable.

This will help determine the upgrade expected downtime and availability impact
to properly set expectations with application owners.




### Identity


- Configure an identity provider

Verify that the configured identity provider is functional. Test the platform
login process using existing and new users.



- Define user groups and roles

Ensure roles and role bindings have been correctly applied to the groups
created.




### Metrics


- Validate metric collection

Verify that the metrics aggregation pipeline for the workloads and the platform
is functional. This is a requirement for cluster and pod autoscalers to work.



### Logging


- Validate log aggregation and forwarding

Verify the container and platform logs are reaching their destination. These can
be stored within the cluster or forwarded to an external system.



Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
A checklist for Kubernetes platform considerations
10 changes: 10 additions & 0 deletions metadata/lms/workload-tenancy-platform-checklist/guide.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"id": "tanzu-holly-tdc-guides-kubernetes-workload-tenancy-platform-checklist",
"slug": "workload-tenancy-platform-checklist",
"title": "Platform Readiness Checklist",
"type": "article",
"status": "unlisted",
"topics": [
"TDC"
]
}
Loading

0 comments on commit de054a6

Please sign in to comment.