Deploy taking incredibly long suddenly. #102

digitlninja · 2020-08-21T13:08:38Z

How can I troubleshoot why the deploy is taking incredibly long all of a sudden?

my task definition


{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::185944984862:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/identity-backend",
          "awslogs-region": "eu-west-2",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [
        {
          "hostPort": 3001,
          "protocol": "tcp",
          "containerPort": 3001
        }
      ],
      "command": null,
      "linuxParameters": null,
      "cpu": 1024,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": [
        {
          "valueFrom": "xxx",
          "name": "IoTBackend-Staging"
        }
      ],
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "185944984862.dkr.ecr.eu-west-2.amazonaws.com/identity:4f177c4240adda5b3bf8f5f83f7b766e490e2775",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "identity-backend"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": "arn:aws:iam::185944984862:role/ecsTaskExecutionRole",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:eu-west-2:185944984862:task-definition/identity-backend:3",
  "family": "identity-backend",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.secrets.asm.environment-variables"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "1024",
  "revision": 3,
  "status": "ACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": []
}

The text was updated successfully, but these errors were encountered:

allisaurus · 2020-08-28T17:53:27Z

@digitlninja as a first step, you can check the ECS service events in the AWS ECS web console to see whether your tasks are flapping (coming up/down) or some specific task (like becoming healthy in the load balancer) is taking a while to stabilize. Or, if you have CloudWatch logs enabled, you can check the task logs in the ECS or CloudWatch logs web console to see if specific containers are hanging on any particular start up step. Also, if your container image recently increased in size or changed location (like a diff region) it might be taking longer to pull?

Sorry these are rather generic advice, but I can't think of a specific reason this action would cause deployments to take longer. If it just recently started happening w/o any changes on your end we could continue to investigate. LMK if any of the above it helpful.

allisaurus · 2020-09-08T17:50:28Z

Closing due to lack of response, but feel free to reopen if you find reason to believe this action is effecting your deployment times.

JefferyHus · 2021-04-12T17:45:32Z

I can confirm this. The ECS task is running & the new updates have been applied, but the GitHub action is still loading. It takes up to 10 minutes to finish and sometimes more.

SunnGHubX · 2021-04-19T22:40:25Z

Same with me while deploying Task definitions via ecs using github actions, it hangs to passing all the way to 25 mins. I had to stop it. I had investigated this issue and verify that from the Events tab, old ones are still not shutting down when new are trying to deploy.. meaning this is a safety secure feature as green-->blue deployment just incase new deployment is not good.
But for me my deployment didn't error out just hung up .
Is there a fixed for this?

Also verified if its because of image, but my image is a *-alpine so its not even that big.. I am running a frontend+backend react service..
Is there a workaround, I can't seem to find answers...
....

allisaurus · 2021-04-23T16:03:19Z

@JefferyHus can you tell me more about that status of your service when the GH action is hanging? Has a new revision successfully been deployed and stabilized, or has a rollback occurred? Any output you have from the service's events (tab visible in ECS console or via ecs:describe-services) or errors from the GH action event itself would be helpful

@SunnGHubX what you describe sounds similar to #113 (comment) . Can you let me know if adjusting the deployment preferences fixes your issue?

JefferyHus · 2021-04-23T17:06:21Z

@allisaurus No errors whatsoever cominng from both the ECS logs or GH actions. GH action just hangs there loading for minutes before moving to the next & final step, the build howevere is successful. The only thing that I could think of is that the GH plugin is waiting a status & ECS doesn't return a status till fargate switches the old container with the new one.

mcsrk · 2021-05-26T22:47:49Z

Im facing the same situation using GH actions. In my case, it happened like this:

I used ECS to deploy a django image from ECR. Simple Fargate cluster and simple task definition and NOT using load balancer. When I was messign around to test my deployment I triggered several times my actions an the average time was 4 min. (90% of the time it was consumed by the "Deploy Amazon ECS task definition" step (shown in mage).

Due to my requirements, my implementation must have a static ip, so I did my research and restructured the whole thing, so I added to the ECS Service a Network Load Balancer that uses an Elastic IP. So I was doing my multiples commits to test the result and this time the "Deploy Amazon ECS task definition" was taking 12 minutes each time I ran the actions. I don´t fully understand why it might take more time by adding a Load Balancer.

PS: Apart from the LB, the thign i made different the second time was creating a ecs service from the "task def." tab in the ECS console, instead of go into a cluster and click "Create Service" or "Run task" which was the way I did on my 1st try.

damusix · 2022-02-14T20:37:31Z

I get timed out at 30 minutes. Everything deploys, but the GH Actions runs until failure.

baranberkay96 · 2022-04-08T08:46:51Z

Is there any update?

samlachance · 2022-04-28T21:17:21Z

I am experiencing this as well. The container appears to be deployed and functioning but github actions just spins. I also use a load balancer for what it's worth.

chihiros · 2022-05-12T20:56:01Z

I too take a long time to deploy.
How can I shorten it?

amalsgit · 2022-05-17T08:39:16Z

It takes around 12 mins for me to update my Fargate service 😢

aencalado · 2022-06-13T14:11:51Z

Same problem, any update?

grommir · 2022-06-15T15:00:48Z

I think it's not an action issue, but ECS.
I took a look at the ECS service console and found that status 2/1 Tasks running lasts here for a long time

grommir · 2022-06-16T09:18:14Z

The problem is in deregistration_delay parameter, whose default value is 300 seconds.

I tried to set it to 5 seconds and now deploy of the task definition takes about 2 minutes

aencalado · 2022-06-16T09:23:45Z

I found that if you disable the task stability check then it takes only a few seconds/minutes to deploy

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@de0132cf8cdedb79975c6d42b77eb7ea193cf28e
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: false. # <--- default is true

ghost · 2022-06-22T23:01:29Z

The same task, unchanged, will suddenly take 20-30 minutes, or fail altogether. Seems like an issue worth checking out.

Deep1144 · 2022-07-01T16:30:20Z

Facing the same issue, is there any update?

naarkhoo · 2022-09-14T13:27:09Z

same issue - the same task from two month ago now takes for ever - I thought its about RAM/CPU but hey I am using elastic container ...

0Lucifer0 · 2022-10-04T10:39:49Z

Same for me: what's surprised me is that a lot of ecs instance are failing to start and are killed until one finally successfully start. It is likely due to AWS ECS and not this action.

STOPPED (ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-west-2.amazonaws.com/": dial tcp 52.119.174.83:443: i/o timeout)

willisplummer · 2022-10-16T01:48:31Z

I had the same experience as @0Lucifer0 — would be nice if the github action could fail loudly or provide more insight into why the task isn't getting deployed correctly

nosachamos · 2022-11-26T07:06:20Z

Also hitting the same problem. Seems random - sometimes finishes in 2 mins, sometimes in 20. Deploying using aws-actions/amazon-ecs-deploy-task-definition@v1 as everyone else here.

This is a major, major pain and eats up billable minutes in github actions. Please take a look, AWS.

DLoBoston · 2022-12-02T18:16:08Z

I can confirm that the hanging is during the stability check. Not a fan of turning this off, but in order to save on billable minutes I am. You can use other automated health checks to check on the service and keep this action limited to build a deploy.

Excerpt of debug info:

@aencalado 's solution worked for me to turn off wait-for-service-stability

ddaniel27 · 2022-12-05T20:32:48Z

The solution given by @grommir worked for me.

0Lucifer0 · 2022-12-05T20:39:44Z

@ddaniel27 can you give more details on how to achieve this ? how/where did you change the deregistration_delay property ?

ddaniel27 · 2022-12-05T21:13:56Z

@0Lucifer0 If you are in the AWS console, go to EC2 > target groups > your target group > attributes > Edit. There you just must to change the 300 to something like 5 seconds and that's all. I don't know how this can affect something else in the ECS behavior but for this issue, it works.

0Lucifer0 · 2022-12-05T21:19:10Z

I guess that won't work for me as for some reason there is no target groups 😢

RazGvili · 2023-02-15T11:50:09Z

I'm having the same issue with deregistration_delay of 10 sec

SmashingQuasar · 2023-05-09T15:12:19Z

I can confirm this is still an issue. When a deployment is failing, it seems AWS does not answer anything which leads to an extremely long deployment time that ultimately ends up in timeout.

nosachamos · 2023-05-09T15:20:48Z

For what is worth, my long deployment time was due to using the wrong VPC somewhere which was causing a delay and then a timeout. Em ter., 9 de mai. de 2023 às 10:12, Nicolas Cordier-Cerezo < ***@***.***> escreveu:

…

I can confirm this is still an issue. When a deployment is failing, it seems AWS does not answer anything which leads to an extremely long deployment time that ultimately ends up in timeout. — Reply to this email directly, view it on GitHub <#102 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJUA5WZFG3X2R3Z46IBK73XFJNF7ANCNFSM4QHITRQQ> . You are receiving this because you commented.Message ID: <aws-actions/amazon-ecs-deploy-task-definition/issues/102/1540348250@ github.com>

sombriks · 2023-07-14T15:11:22Z

Hes the issue still here, deployment failed, action hanged out

SebastianDix · 2023-09-05T11:19:49Z

Guys because it is not just deploying the task definition, it is also waiting for service stability. Service stability is # of required tasks == number of active tasks + Health checks passing. If it takes 30 minutes (default) then it's because it was waiting 30 minutes for healthchecks to pass and they didn't. You can set a timeout to a lower value than 30 minutes.

erwan-joly · 2023-09-05T11:30:00Z

Guys because it is not just deploying the task definition, it is also waiting for service stability. Service stability is # of required tasks == number of active tasks + Health checks passing. If it takes 30 minutes (default) then it's because it was waiting 30 minutes for healthchecks to pass and they didn't. You can set a timeout to a lower value than 30 minutes.

The issue is not really that it timeout after 30min it is that is take a long time (25min is quite long even if it does not timeout) to successfully start in some cases. Sadly the error seems more like an underlying issue with AWS than with that specific github action

#102 (comment)

rahulbhanushali · 2024-01-18T16:03:54Z

Anybody figure a solution for this? Facing this frequently once in a while and is very annoying.

We have a staging environment where we have set desired count to 1. When deployment gets stuck, we face service downtime.

I can see the task logs and see service has started by for some reason ecs deployment still shows deploying and ELB won't route to the new task instance.

carolzbnbr · 2024-02-06T01:18:10Z

Same here :(

JGSweets · 2024-06-03T18:26:08Z

I looked into the configuration of the wait timer and my thoughts are as follows:

We can only set the wait-for-minutes which equates to maxWaitTime
In the code itself, minDelay is forcibly set to 15 seconds via const WAIT_DEFAULT_DELAY_SEC = 15;
Another option is not available even in this code but is available in the timer settings maxDelay
The timer itself has an exponential backoff (i.e. each failure squares the time it takes until it reaches maxDelay)
The maxDelay is defaulted to 120 seconds

As a result, stability is checked early and will almost always default to 120 seconds since early attempts will fail.

That means once stability is reached (depending on your AWS settings of the target group / ecs health check delays etc) an additional 120 seconds seems to be tacked on top of the stability.

Solutions to fix:

Update the gh-action to make maxDelay and minDelay as available options for stability checking
- e.g. wait-for-min-delay-seconds
- e.g. wait-for-max-delay-seconds

safwanshamsir99 · 2024-06-27T12:37:56Z

I faced the same issue. Usually, it takes about 4-5 mins.

Yangeok · 2024-08-17T10:37:14Z

Same issue with python 3.11 poetry environment.

Jepkosgei3 · 2024-09-12T09:05:55Z

I found that if you disable the task stability check then it takes only a few seconds/minutes to deploy

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@de0132cf8cdedb79975c6d42b77eb7ea193cf28e
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: false. # <--- default is true

this solved mine

allisaurus added the guidance Further information is requested label Aug 28, 2020

allisaurus closed this as completed Sep 8, 2020

allisaurus reopened this Apr 23, 2021

Deploy taking incredibly long suddenly. #102

Deploy taking incredibly long suddenly. #102

Comments

digitlninja commented Aug 21, 2020

allisaurus commented Aug 28, 2020

allisaurus commented Sep 8, 2020

JefferyHus commented Apr 12, 2021 • edited Loading

SunnGHubX commented Apr 19, 2021 • edited Loading

allisaurus commented Apr 23, 2021

JefferyHus commented Apr 23, 2021

mcsrk commented May 26, 2021

damusix commented Feb 14, 2022

baranberkay96 commented Apr 8, 2022

samlachance commented Apr 28, 2022

chihiros commented May 12, 2022

amalsgit commented May 17, 2022

aencalado commented Jun 13, 2022

grommir commented Jun 15, 2022

grommir commented Jun 16, 2022

aencalado commented Jun 16, 2022

ghost commented Jun 22, 2022

Deep1144 commented Jul 1, 2022

naarkhoo commented Sep 14, 2022

0Lucifer0 commented Oct 4, 2022 • edited Loading

willisplummer commented Oct 16, 2022

nosachamos commented Nov 26, 2022

DLoBoston commented Dec 2, 2022

ddaniel27 commented Dec 5, 2022

0Lucifer0 commented Dec 5, 2022

ddaniel27 commented Dec 5, 2022

0Lucifer0 commented Dec 5, 2022

RazGvili commented Feb 15, 2023

SmashingQuasar commented May 9, 2023

nosachamos commented May 9, 2023 via email

sombriks commented Jul 14, 2023

SebastianDix commented Sep 5, 2023

erwan-joly commented Sep 5, 2023

rahulbhanushali commented Jan 18, 2024

carolzbnbr commented Feb 6, 2024

JGSweets commented Jun 3, 2024 • edited Loading

safwanshamsir99 commented Jun 27, 2024

Yangeok commented Aug 17, 2024

Jepkosgei3 commented Sep 12, 2024

JefferyHus commented Apr 12, 2021 •

edited

Loading

SunnGHubX commented Apr 19, 2021 •

edited

Loading

0Lucifer0 commented Oct 4, 2022 •

edited

Loading

JGSweets commented Jun 3, 2024 •

edited

Loading