Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Container insights on Windows #180

Conversation

KlwntSingh
Copy link

@KlwntSingh KlwntSingh commented Mar 2, 2024

Description:
This PR adds CI on Windows feature in awscontainerinsightsreceiver. This PR merges dev branch aws-ci-windows into aws-ci-dev

Ex. Adding a feature - Explain what this achieves.-->
This PR adds CI on Windows feature in awscontainerinsightsreceiver.

Testing:

  1. Unit tests pass for Windows
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/k8swindows      (cached)
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/k8swindows/extractors   (cached)
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/k8swindows/hcsshim      (cached)
ok      github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet      (cached)
  1. PR checks can confirm unit tests passing on Linux
  2. Complete e2e testing passed.
Screenshot 2024-03-06 at 4 06 34 AM Screenshot 2024-03-06 at 4 06 29 AM Screenshot 2024-03-06 at 4 06 17 AM Screenshot 2024-03-06 at 4 06 05 AM Screenshot 2024-03-06 at 4 05 55 AM Screenshot 2024-03-06 at 4 05 46 AM Screenshot 2024-03-06 at 4 02 29 AM Screenshot 2024-03-06 at 4 02 16 AM Screenshot 2024-03-06 at 4 01 16 AM Screenshot 2024-03-06 at 4 01 00 AM
  1. Sample of output in performance log group for Windows node
{
    "AutoScalingGroupName": "eks-windows-2022-mng-42c6d6f1-7ce3-0ec3-XXXXXXXXXXXXXXX",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_status_condition_pid_pressure",
                    "Unit": "Count"
                },
                {
                    "Name": "node_status_condition_disk_pressure",
                    "Unit": "Count"
                },
                {
                    "Name": "node_status_condition_memory_pressure",
                    "Unit": "Count"
                },
                {
                    "Name": "node_cpu_reserved_capacity",
                    "Unit": "Percent"
                },
                {
                    "Name": "node_status_allocatable_pods",
                    "Unit": "Count"
                },
                {
                    "Name": "node_status_condition_ready",
                    "Unit": "Count"
                },
                {
                    "Name": "node_status_condition_unknown",
                    "Unit": "Count"
                },
                {
                    "Name": "node_number_of_running_pods",
                    "Unit": "Count"
                },
                {
                    "Name": "node_cpu_limit"
                },
                {
                    "Name": "node_status_capacity_pods",
                    "Unit": "Count"
                },
                {
                    "Name": "node_memory_limit",
                    "Unit": "Bytes"
                },
                {
                    "Name": "node_cpu_usage_total"
                },
                {
                    "Name": "node_memory_working_set",
                    "Unit": "Bytes"
                },
                {
                    "Name": "node_memory_reserved_capacity",
                    "Unit": "Percent"
                },
                {
                    "Name": "node_network_total_bytes",
                    "Unit": "Bytes/Second"
                },
                {
                    "Name": "node_cpu_utilization",
                    "Unit": "Percent"
                },
                {
                    "Name": "node_memory_utilization",
                    "Unit": "Percent"
                },
                {
                    "Name": "node_number_of_running_containers",
                    "Unit": "Count"
                }
            ]
        }
    ],
    "ClusterName": "XXXXX-CI-2022",
    "InstanceId": "i-XXXXXXXXXXXX",
    "InstanceType": "m5.large",
    "NodeName": "ip-XX-XX-XX-XX.us-west-2.compute.internal",
    "OperatingSystem": "windows",
    "Sources": [
        "kubelet",
        "pod",
        "calculated"
    ],
    "Timestamp": "1709726796214",
    "Type": "Node",
    "Version": "0",
    "kubernetes": {
        "host": "ip-XX-XX-XX-XX.us-west-2.compute.internal"
    },
    "node_cpu_request": 550,
    "node_cpu_usage_system": 0,
    "node_cpu_usage_user": 0,
    "node_memory_pgfault": 0,
    "node_memory_pgmajfault": 0,
    "node_memory_request": 448790528,
    "node_memory_rss": 0,
    "node_memory_usage": 2205806592,
    "node_network_rx_bytes": 324.75,
    "node_network_rx_dropped": 0,
    "node_network_rx_errors": 0,
    "node_network_tx_bytes": 207.86666666666667,
    "node_network_tx_dropped": 0,
    "node_network_tx_errors": 0,
    "node_cpu_limit": 2000,
    "node_cpu_reserved_capacity": 27.500000000000004,
    "node_cpu_usage_total": 2,
    "node_cpu_utilization": 0.1,
    "node_memory_limit": 8274866176,
    "node_memory_reserved_capacity": 5.423538199344531,
    "node_memory_utilization": 7.9183519112418335,
    "node_memory_working_set": 655233024,
    "node_network_total_bytes": 532.6166666666666,
    "node_number_of_running_containers": 2,
    "node_number_of_running_pods": 2,
    "node_status_allocatable_pods": 110,
    "node_status_capacity_pods": 110,
    "node_status_condition_disk_pressure": 0,
    "node_status_condition_memory_pressure": 0,
    "node_status_condition_pid_pressure": 0,
    "node_status_condition_ready": 1,
    "node_status_condition_unknown": 0
}
  1. Link to Integ tests results for both Linux and Windows. CW agent runs successfully on Both Linux and Windows. Fluenbit fails due to timeout issue when destroying the cluster. There is no issue in functional working of fluentbit.
    https://github.com/aws/private-amazon-cloudwatch-agent-staging/actions/runs/8175470234/job/22352686273

Documentation:
This commit adds documentation - #166

* Add pod level metric collection for Windows

This PR defines code structure for metric provider which works on Windows.

1. Changed receiver.go in awscontainerinsights to run for Windows with metric provider.
2. Added summary API in kubeletclient
3. Add kubeletProvider to return metrics at different levels i.e. pod, contianer, node.
4. Updated hostInfo providers to run for Windows.
5. Updated ebsVolume Info provider to run for Windows.

1. Define correct ebsVolume Info provider for Windows
2. Change logic around k8s leader election to run for Windows

# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/cpu_extractor.go
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/diskio_extractor.go
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/extractor.go
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/fs_extractor.go
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/mem_extractor.go
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/net_extractor.go
#	receiver/awscontainerinsightreceiver/receiver.go
* Add CPU extractors from kubelet summary API

1. Make cadvisor helper func's public to be used in k8swindows extractor
2. Add CPU extractor and add utilization fields
3. Add unit test for CPU extractor.
4. Add unit test data for kubelet summary API
5. Add helper func to convert Pod and Node summary stats to RawMetric

* Refactor code

1. Changed cExtractor to cextractors
2. Add nil checks to avoid panic during pointer deferences

* Refactored code

1. Added missing HasValue func in extractors
2. Replaced cextractor with cExtractor
3. Corrected extractorhelper name with missing characters
# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/cpu_extractor.go
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/diskio_extractor.go
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/extractor.go
#	receiver/awscontainerinsightreceiver/internal/cadvisor/extractors/mem_extractor.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet.go
* Add memory extractor at pod and node level

1. Added memory extractor at pod and node level
2. Add unit tests for memory extractor
3. use cpu and memory extractor for k8s windows

* Fix adding tags to collected metrics from extractors
# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/k8swindows/extractors/cpu_extractor_test.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/k8swindows.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet.go
* Define structs for CPU and Memory and stats

1. Create new structs to represent CPU and memory stats in RawMetric. This
removes RawMetric dependency on Kubelet CPU and memory stats.
2. Refactor existing RawMetric struct to use new CPU and memory stats.
3. Add more unit tests for extractorhelpers

* Refactor: Remove parameters passing by reference in extractors

* Remove extra comments in error
# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/k8swindows/extractors/cpu_extractor.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/extractors/extractor.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet.go
…tributing#150)

* Add container level CPU and memory metrics collection

    1. Add metric collection at container level
    2. Refactor existing kubelet code to make it unit testable.
    3. Add units for kubelet to test pod, node and container level metric collection.

* Refactor: rename port -> hostPort and summaryProvider -> kubeletSummaryProvider

* Refactor: Change naming in kubelet

2. Remove pas by reference to extractors

* Refactored: address naming inconsistencies

* Refactor: remove extra GetClient

* Fix merge conflict
# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/k8swindows/extractors/extractor.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet.go
…#151)

* Add storage metrics for container and node level

1. Add storage extractors for container and node level
2. Add metric source for Windows metric collection
3. Refactor metric source for cadvisor
4. Add os label for windows

* Address chad's and pooja's comments

* Refactor: Address review comments

* Refactor: remove extra add source func
# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/k8swindows/k8swindows.go
…iner (amazon-contributing#153)

* Enable awscontainerinsights receiver to run inside Host Process container

1. Added workaround to fix ServiceAccount token and cert path for kubelet account inside HPC.
2. Added workarond to fix above issue in k8s clientset.

* Addressed chad's comments

* Addrssed pooja's comments

* Fix go.mod
…ntributing#154)

* Add storage metrics for container and node level

1. Add storage extractors for container and node level
2. Add metric source for Windows metric collection
3. Refactor metric source for cadvisor
4. Add os label for windows

# Conflicts:
#	internal/aws/containerinsight/const.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/extractors/extractor.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/extractors/extractorhelpers.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/extractors/fs_extractor.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/k8swindows.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet/kubelet.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet/kubelet_test.go
#	receiver/awscontainerinsightreceiver/internal/stores/utils.go
#	receiver/awscontainerinsightreceiver/internal/stores/utils_test.go

* Add hcsshim API as source

1. Added hcshim API as alternative to kubelet for networking stats
2. Added unit tests for hcsshim provider
3. Add new fields to network extractor

* Fix e2e test for docker build

1. Ran `go mod tidy` inside cmd/otelcontribcol
# Conflicts:
#	receiver/awscontainerinsightreceiver/go.mod
#	receiver/awscontainerinsightreceiver/go.sum

# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/k8swindows/extractors/extractor.go
#	receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet/kubelet.go
…on-contributing#156)

# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/k8swindows/hcsshim/hcsshim.go
…#161)

* 1. Fix CPU cores in Windows

2. Sum fs usage field

* fix fs type

* fix fs issue

* fix fs issue
* Added readme for awscontainerinsights for Windows

* Removed Linux metrics from Windows metrics

* Added back documentation removed by mistake
Removed completed todos

# Conflicts:
#	receiver/awscontainerinsightreceiver/internal/k8swindows/kubelet/kubelet.go
@KlwntSingh KlwntSingh requested a review from mxiamxia as a code owner March 2, 2024 07:30
1. Add windows build tag to fix building cw agent on Windows
2. Downgrade internal/aws/containerinsight from 0.92 0.89
3. Separate unit tests in util.go specific for Windows
4. Fix util unit tests applicable for Windows
5. Fix goporto issue
6. Fix lint issue
7. Fix regression in unit tests caused due to rebasing mainlin
@KlwntSingh KlwntSingh force-pushed the aws-cwa-ciwindows-cherry-picked-2 branch from 839747f to c1bda62 Compare March 2, 2024 08:17

var metricsExtractors = []extractors.MetricExtractor{}

func New(logger *zap.Logger, decorator *stores.K8sDecorator, hostInfo host.Info) (*K8sWindows, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests for this file methods

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests for this file. Unit test coverage is only limited to initializing k8sWindows object.
Adding unit test for GetMetrics will require some refactoring. Can follow up with adding unit test ater release.

1. Add unit test for kubelet client on Windows
2. Run DCGM scrapper only for CW agent on Linux
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 44.32432% with 103 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (aws-cwa-dev@9cb314e). Click here to learn what that means.

Files Patch % Lines
receiver/awscontainerinsightreceiver/receiver.go 11.11% 31 Missing and 1 partial ⚠️
internal/aws/k8s/k8sclient/clientset.go 16.66% 25 Missing ⚠️
...ainerinsightreceiver/internal/host/nodeCapacity.go 31.81% 14 Missing and 1 partial ⚠️
...eiver/internal/stores/kubeletutil/kubeletclient.go 0.00% 14 Missing ⚠️
internal/aws/containerinsight/utils.go 0.00% 7 Missing ⚠️
internal/kubelet/client.go 60.00% 3 Missing and 1 partial ⚠️
...ontainerinsightreceiver/internal/host/ebsvolume.go 50.00% 2 Missing and 1 partial ⚠️
...scontainerinsightreceiver/internal/stores/utils.go 91.89% 2 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@              Coverage Diff               @@
##             aws-cwa-dev     #180   +/-   ##
==============================================
  Coverage               ?   82.44%           
==============================================
  Files                  ?     1772           
  Lines                  ?   165884           
  Branches               ?        0           
==============================================
  Hits                   ?   136767           
  Misses                 ?    25131           
  Partials               ?     3986           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

1. Fix k8s windows unit tests
2. Disable ebs volume unit test for Windows
3. Separate out node Volume unit tests for Windows
Copy link

@nathalapooja nathalapooja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
"Name": "node_cpu_limit"
},
and
{
"Name": "node_cpu_usage_total"
},
metrics are missing units. May I know why didn't we include the statsd and emf integ testing for windows EKS?Have you added additional unit tests from yesterday discussion for Extractor helpers class?

@KlwntSingh KlwntSingh merged commit a7e0c68 into amazon-contributing:aws-cwa-dev Mar 7, 2024
47 of 67 checks passed
@KlwntSingh
Copy link
Author

{ "Name": "node_cpu_limit" }, and { "Name": "node_cpu_usage_total" }, metrics are missing units. May I know why didn't we include the statsd and emf integ testing for windows EKS?Have you added additional unit tests from yesterday discussion for Extractor helpers class?

The following metrics are reported in unit of millicores, and cloudwatch doesn't support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants