Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroups v1: failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument #19418

Open
juananinca opened this issue Dec 11, 2023 · 5 comments

Comments

@juananinca
Copy link

Nomad version

Nomad v1.5.6

Operating system and Environment details

OracleLinux 8.8

Issue

One specific job is showing the following error sometimes when nomad tries to restart a container:
failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument

Reproduction steps

Cannot provide reproduction steps since not always the error comes out, but here you have the nomad config as well as the nomad job definition:

Nomad config:

region = "my-region"
name = "my-hostname"
log_level = "WARN"
leave_on_interrupt = true
leave_on_terminate = true
data_dir = "/var/nomad/data"
bind_addr = "0.0.0.0"
disable_update_check = true
limits {
        https_handshake_timeout   = "10s"
        http_max_conns_per_client = 400
        rpc_handshake_timeout     = "10s"
        rpc_max_conns_per_client  = 400
}
advertise {
    http = "my-ip:4646"
    rpc = "my-ip:4647"
    serf = "my-ip:4648"
}
tls {
  http = true
  rpc  = true
  cert_file = "/opt/nomad/ssl/server.pem"
  key_file = "/opt/nomad/ssl/server-key.pem"
  ca_file = "/opt/nomad/ssl/nomad-ca.pem"
  verify_server_hostname = true
  verify_https_client    = true

}
log_file = "/var/log/nomad/"
log_json = true
log_rotate_max_files = 7
consul {
    address = "127.0.0.1:8500"
    server_service_name = "nomad-server"
    client_service_name = "nomad-client"
    auto_advertise = true
    server_auto_join = true
    client_auto_join = true

    ssl = true
    ca_file = "/opt/consul/ssl/consul-ca.pem"
    cert_file = "/opt/consul/ssl/server.pem"
    key_file = "/opt/consul/ssl/server-key.pem"
    token = "my-token"

}
acl {
  enabled = true
}

vault {
    enabled = true
    address = "https://vault.my-org.com:8200/"
    ca_file = "/opt/vault/ssl/vault-ca.pem"
    cert_file = "/opt/vault/ssl/client-vault.pem"
    key_file = "/opt/vault/ssl/client-vault-key.pem"
}
telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

datacenter = "my-dc"

client {
    enabled = true
    network_interface = "ens192"
    cni_path = "/opt/cni/bin"
    cni_config_dir = "/etc/cni/net.d/"
}

plugin "docker" {
  config {
    auth {
      config = "/etc/docker/config.json"
    }
    allow_privileged = true
    volumes {
      enabled = true
    }
  }
}

Expected Result

The job is able to restart successfully.

Actual Result

The job shows failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument when nomad restarts it.

Job file (if appropriate)

job "my-job" {
  region      = "my-region"
  datacenters = ["my-dc"]
  type        = "service"
  priority    = 50

  update {
    stagger      = "10s"
    max_parallel = 1
  }

  group "my-job" {
    count = 1
    network {
      port "http" {
        to = 3000
      }
    }

    update {
      max_parallel     = 1
      min_healthy_time = "30s"
      healthy_deadline = "10m"
      progress_deadline = "11m"
      auto_revert      = true
    }

    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }

    task "my-job" {
      driver = "docker"

      vault{
        policies = ["policy-ro-team", "policy-ro-commons"]
      }

      env {
        ENV = "PRO-V2"
        DEVELOPER_TEAM = "team"
        CONSUL_URL = "https://${attr.unique.hostname}:8500"   
        CONSUL_KV = "/v1/kv/team/my-job.json"
        CONSUL_CA = "/opt/consul/ssl/consul-ca.pem"
        CONSUL_CLI = "/opt/consul/ssl/cli.pem"
        CONSUL_KEY = "/opt/consul/ssl/cli-key.pem"
        VAULT_URL = "https://vault.my-org.com:8200"
        VAULT_CA = "/opt/vault/ssl/vault-ca.pem"
        VAULT_CLI = "/opt/vault/ssl/client-vault.pem"
        VAULT_KEY = "/opt/vault/ssl/client-vault-key.pem"
        VAULT_KV = "/v1/secret/team/my-job.json"
      }

      config {
        extra_hosts = []
        
        security_opt = [
          "no-new-privileges"
        ]
        pids_limit = 200


        image =  "registry.my-org.com/my-image:1.1.0-RELEASE"
        ports = ["http"]

        force_pull = false


        volumes = [
          "/var/log/my-org:/tmp/my-org/logs",
          "/opt/consul/ssl:/opt/consul/ssl:ro", 
          "/opt/vault/ssl:/opt/vault/ssl:ro"
        ]

        labels {
          image = "my-job:1.1.0-RELEASE"
          service = "my-job"
          dc = "my-dc"
        }
      }

      service {
        name = "my-job"
        port = "http"

        check {
          name     = "my-job alive"
          type     = "http"
          port     = "http"
          path     = "/health"
          interval = "10s"
          timeout  = "2s"
          
          check_restart {
            limit = "10"
            grace = "120s"
          }

        }

      }

      resources {
        memory = 256
      }

      logs {
        max_files     = 1
        max_file_size = 15
      }

      kill_timeout = "20s"
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

{"@level":"error","@message":"running driver failed","@module":"client.alloc_runner.task_runner","@timestamp":"2023-12-10T07:31:36.442363+01:00","alloc_id":"18f4cbc7-b698-0e7f-cb7f-d8cbfb1246c5","error":"failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument","task":"my-job"}
{"@level":"warn","@message":"timed out waiting for read-side of process output pipe to close","@module":"logmon","@timestamp":"2023-12-10T07:31:40.447551+01:00","alloc_id":"18f4cbc7-b698-0e7f-cb7f-d8cbfb1246c5","task":"my-job","timestamp":"2023-12-10T07:31:40.447+0100"}
{"@level":"warn","@message":"timed out waiting for read-side of process output pipe to close","@module":"logmon","@timestamp":"2023-12-10T07:31:40.447683+01:00","alloc_id":"18f4cbc7-b698-0e7f-cb7f-d8cbfb1246c5","task":"my-job","timestamp":"2023-12-10T07:31:40.447+0100"}

There is an open issue related where a user reported something similar (#17890 (comment)) but not sure if it is the same case.

@tgross tgross added the theme/cgroups cgroups issues label Dec 11, 2023
@tgross
Copy link
Member

tgross commented Dec 14, 2023

Hi @juananinca! The "invalid argument" error suggests that we're trying to write to a now-invalid PID. While restarting, there's the existing task that's been stopped, and the new task being started (both in the same allocation). Can you share the task events from nomad alloc status? That might help narrow down what's happening here.

Also, can you provide the client logs during the fingerprint process of startup? Specifically looking for the logs around fingerprint.cpu and/or fingerprint.cgroup.

And can you verify which cgroups version you've got here?

@tgross tgross self-assigned this Dec 14, 2023
@the-nando
Copy link
Contributor

the-nando commented Jan 10, 2024

Hi @tgross 😄

I'm running into the same error on one of our clusters, with a specific Docker container and I'm having a hard time narrowing down the cause of the problem.
It doesn't happen all the time and it happens also on first alloc start of a brand new job, i.e. not on a restart.
I seem to be able to reproduce it only with cgroups v1 and not v2.

Fingerprint on startup:

2024-01-10T07:24:10.564Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
2024-01-10T07:24:10.564Z [DEBUG] client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
2024-01-10T07:24:10.564Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup initial_period=15s
2024-01-10T07:24:10.613Z [TRACE] consul.sync: Consul supports TLSSkipVerify
2024-01-10T07:24:10.613Z [TRACE] consul.sync: able to contact Consul
2024-01-10T07:24:10.613Z [TRACE] consul.sync: execute sync: reason=periodic
2024-01-10T07:24:10.632Z [INFO]  client.fingerprint_mgr.consul: consul agent is available
2024-01-10T07:24:10.632Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
2024-01-10T07:24:10.633Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU model: name="Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz"
2024-01-10T07:24:10.633Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU frequency: mhz=2300
2024-01-10T07:24:10.633Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU core count: EXTRA_VALUE_AT_END=4
2024-01-10T07:24:10.633Z [DEBUG] client.fingerprint_mgr.cpu: client configuration reserves these cores for node: cores=[]
2024-01-10T07:24:10.633Z [DEBUG] client.fingerprint_mgr.cpu: set of reservable cores available for tasks: cores=[0, 1, 2, 3]

Logs:

2024-01-10T07:50:34.332Z [TRACE] client.driver_mgr.docker: binding volumes: driver=docker task_name=nando-test volumes=["/data/nomad/alloc/87b70e3e-b9ff-5083-0dd1-a138fda76b80/alloc:/alloc", "/data/nomad/alloc/87b70e3e-b9ff-5083-0dd1-a138fda76b80/nando-test/local:/local", "/data/nomad/alloc/87b70e3e-b9ff-5083-0dd1-a138fda76b80/nando-test/secrets:/secrets"]
2024-01-10T07:50:34.332Z [TRACE] client.driver_mgr.docker: no docker log driver provided, defaulting to plugin config: driver=docker task_name=nando-test
2024-01-10T07:50:34.332Z [DEBUG] client.driver_mgr.docker: configured resources: driver=docker task_name=nando-test memory=1073741824 memory_reservation=0 cpu_shares=100 cpu_quota=0 cpu_period=0
2024-01-10T07:50:34.332Z [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=nando-test binds="[]string{\"/data/nomad/alloc/87b70e3e-b9ff-5083-0dd1-a138fda76b80/alloc:/alloc\", \"/data/nomad/alloc/87b70e3e-b9ff-5083-0dd1-a138fda76b80/nando-test/local:/local\", \"/data/nomad/alloc/87b70e3e-b9ff-5083-0dd1-a138fda76b80/nando-test/secrets:/secrets\"}"
2024-01-10T07:50:34.332Z [DEBUG] client.driver_mgr.docker: networking mode not specified; using default: driver=docker task_name=nando-test
2024-01-10T07:50:34.332Z [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=nando-test labels="map[com.hashicorp.nomad.alloc_id:87b70e3e-b9ff-5083-0dd1-a138fda76b80 com.hashicorp.nomad.job_name:nando-test com.hashicorp.nomad.namespace:cie-orch com.hashicorp.nomad.node_name:client-default-10-98-8-77 com.hashicorp.nomad.task_group_name:nando-test com.hashicorp.nomad.task_name:nando-test]"
2024-01-10T07:50:34.332Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=nando-test container_name=nando-test-87b70e3e-b9ff-5083-0dd1-a138fda76b80
2024-01-10T07:50:34.355Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=42230c249da70e03b3c546fa2f996a464acc1ed22417caffa1554f57967cab20
2024-01-10T07:50:34.878Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=42230c249da70e03b3c546fa2f996a464acc1ed22417caffa1554f57967cab20
2024-01-10T07:50:34.974Z [DEBUG] client: updated allocations: index=928139137 total=11 pulled=0 filtered=11
2024-01-10T07:50:34.974Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2024-01-10T07:50:34.974Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2024-01-10T07:50:34.975Z [TRACE] client: next heartbeat: period=19.669962852s
2024-01-10T07:50:35.101Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=87b70e3e-b9ff-5083-0dd1-a138fda76b80 task=nando-test type="Driver Failure" msg="failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument" failed=false
2024-01-10T07:50:35.102Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=87b70e3e-b9ff-5083-0dd1-a138fda76b80 task=nando-test error="failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument"
2024-01-10T07:50:35.102Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=87b70e3e-b9ff-5083-0dd1-a138fda76b80 task=nando-test reason="Error was unrecoverable"
2024-01-10T07:50:35.102Z [TRACE] client.alloc_runner.task_runner: setting task state: alloc_id=87b70e3e-b9ff-5083-0dd1-a138fda76b80 task=nando-test state=dead

Alloc events from one of the tests:

Recent Events:
Time                       Type            Description
2024-01-10T09:06:49+01:00  Not Restarting  Error was unrecoverable
2024-01-10T09:06:49+01:00  Driver Failure  failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument
2024-01-10T09:06:48+01:00  Task Setup      Building Task Directory
2024-01-10T09:06:48+01:00  Received        Task received by client

Not sure why it is the case but if I wrap the container's command (yarn in this case) into a script and call that, I'm unable to reproduce the issue. It looks like a timing issue but it's strange as, to my understanding, the PID that is added to cgroup.procs should only be known by Nomad after its creation...

config {
  image = "<img>:<ver>"
  command = "/local/run.sh"
}

template {
  destination = "/local/run.sh"
  perms       = "555"
  data        = <<-EOF
    #!/usr/bin/env sh
    yarn run next start
    EOF
}

@tgross
Copy link
Member

tgross commented Jan 11, 2024

Thanks @the-nando. And you're running on a 1.5.x version of Nomad as well? The cgroups code got a lot of reworking as part of 1.7, so I want to make sure we're chasing the same bug here. If so, can you reproduce the problem on 1.7.x?

Not sure why it is the case but if I wrap the container's command (yarn in this case) into a script and call that, I'm unable to reproduce the issue. It looks like a timing issue but it's strange as, to my understanding, the PID that is added to cgroup.procs should only be known by Nomad after its creation...

What's especially strange about that is that your script is still running because it doesn't run exec yarn, so that PID still exists!

@the-nando
Copy link
Contributor

Sorry, I forgot to mention that this cluster is on 1.6.2-ent. I'm afraid I won't be able to test with 1.7 on the short term as I don't have a cluster at hand already upgraded with the same setup.
I can confirm thought that, so far, switching to cgroups v2 fixes the issue.

@tgross tgross changed the title failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument cgroups v1: failed to set the cpuset cgroup for container: write /sys/fs/cgroup/cpuset/nomad/shared/cgroup.procs: invalid argument Oct 3, 2024
@tgross tgross added the hcc/jira label Oct 3, 2024
@tgross tgross removed their assignment Oct 3, 2024
@tgross
Copy link
Member

tgross commented Oct 3, 2024

Doing a bit of issue triage cleanup. I'm going to move this onto our internal roadmapping board for follow-up, but:

  • This is for cgroups v1, and
  • This is only reported against a version of Nomad that will no longer be in support in a few weeks, and we have at least some reason to believe the 1.7.x codebase has solved for this set of problems.

If we get updated info that this is reproduced on 1.7.x+, we'll be sure to re-prioritize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants