Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy taking incredibly long suddenly. #102

Open
digitlninja opened this issue Aug 21, 2020 · 39 comments
Open

Deploy taking incredibly long suddenly. #102

digitlninja opened this issue Aug 21, 2020 · 39 comments
Labels
guidance Further information is requested

Comments

@digitlninja
Copy link

How can I troubleshoot why the deploy is taking incredibly long all of a sudden?

my task definition


{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::185944984862:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/identity-backend",
          "awslogs-region": "eu-west-2",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [
        {
          "hostPort": 3001,
          "protocol": "tcp",
          "containerPort": 3001
        }
      ],
      "command": null,
      "linuxParameters": null,
      "cpu": 1024,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": [
        {
          "valueFrom": "xxx",
          "name": "IoTBackend-Staging"
        }
      ],
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "185944984862.dkr.ecr.eu-west-2.amazonaws.com/identity:4f177c4240adda5b3bf8f5f83f7b766e490e2775",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "identity-backend"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": "arn:aws:iam::185944984862:role/ecsTaskExecutionRole",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:eu-west-2:185944984862:task-definition/identity-backend:3",
  "family": "identity-backend",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.secrets.asm.environment-variables"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "1024",
  "revision": 3,
  "status": "ACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": []
}

@allisaurus allisaurus added the guidance Further information is requested label Aug 28, 2020
@allisaurus
Copy link
Contributor

@digitlninja as a first step, you can check the ECS service events in the AWS ECS web console to see whether your tasks are flapping (coming up/down) or some specific task (like becoming healthy in the load balancer) is taking a while to stabilize. Or, if you have CloudWatch logs enabled, you can check the task logs in the ECS or CloudWatch logs web console to see if specific containers are hanging on any particular start up step. Also, if your container image recently increased in size or changed location (like a diff region) it might be taking longer to pull?

Sorry these are rather generic advice, but I can't think of a specific reason this action would cause deployments to take longer. If it just recently started happening w/o any changes on your end we could continue to investigate. LMK if any of the above it helpful.

@allisaurus
Copy link
Contributor

Closing due to lack of response, but feel free to reopen if you find reason to believe this action is effecting your deployment times.

@JefferyHus
Copy link

JefferyHus commented Apr 12, 2021

I can confirm this. The ECS task is running & the new updates have been applied, but the GitHub action is still loading. It takes up to 10 minutes to finish and sometimes more.

@SunnGHubX
Copy link

SunnGHubX commented Apr 19, 2021

Same with me while deploying Task definitions via ecs using github actions, it hangs to passing all the way to 25 mins. I had to stop it. I had investigated this issue and verify that from the Events tab, old ones are still not shutting down when new are trying to deploy.. meaning this is a safety secure feature as green-->blue deployment just incase new deployment is not good.
But for me my deployment didn't error out just hung up .
Is there a fixed for this?

  • Also verified if its because of image, but my image is a *-alpine so its not even that big.. I am running a frontend+backend react service..
    Is there a workaround, I can't seem to find answers...
    ....

@allisaurus allisaurus reopened this Apr 23, 2021
@allisaurus
Copy link
Contributor

@JefferyHus can you tell me more about that status of your service when the GH action is hanging? Has a new revision successfully been deployed and stabilized, or has a rollback occurred? Any output you have from the service's events (tab visible in ECS console or via ecs:describe-services) or errors from the GH action event itself would be helpful

@SunnGHubX what you describe sounds similar to #113 (comment) . Can you let me know if adjusting the deployment preferences fixes your issue?

@JefferyHus
Copy link

@allisaurus No errors whatsoever cominng from both the ECS logs or GH actions. GH action just hangs there loading for minutes before moving to the next & final step, the build howevere is successful. The only thing that I could think of is that the GH plugin is waiting a status & ECS doesn't return a status till fargate switches the old container with the new one.

@mcsrk
Copy link

mcsrk commented May 26, 2021

Im facing the same situation using GH actions. In my case, it happened like this:

I used ECS to deploy a django image from ECR. Simple Fargate cluster and simple task definition and NOT using load balancer. When I was messign around to test my deployment I triggered several times my actions an the average time was 4 min. (90% of the time it was consumed by the "Deploy Amazon ECS task definition" step (shown in mage).
imagen

Due to my requirements, my implementation must have a static ip, so I did my research and restructured the whole thing, so I added to the ECS Service a Network Load Balancer that uses an Elastic IP. So I was doing my multiples commits to test the result and this time the "Deploy Amazon ECS task definition" was taking 12 minutes each time I ran the actions. I don´t fully understand why it might take more time by adding a Load Balancer.

imagen

PS: Apart from the LB, the thign i made different the second time was creating a ecs service from the "task def." tab in the ECS console, instead of go into a cluster and click "Create Service" or "Run task" which was the way I did on my 1st try.

@damusix
Copy link

damusix commented Feb 14, 2022

I get timed out at 30 minutes. Everything deploys, but the GH Actions runs until failure.

@baranberkay96
Copy link

Is there any update?

@samlachance
Copy link

I am experiencing this as well. The container appears to be deployed and functioning but github actions just spins. I also use a load balancer for what it's worth.

@chihiros
Copy link

I too take a long time to deploy.
How can I shorten it?

image

@amalsgit
Copy link

It takes around 12 mins for me to update my Fargate service 😢

@aencalado
Copy link

Same problem, any update?

@grommir
Copy link

grommir commented Jun 15, 2022

I think it's not an action issue, but ECS.
I took a look at the ECS service console and found that status 2/1 Tasks running lasts here for a long time
image

@grommir
Copy link

grommir commented Jun 16, 2022

The problem is in deregistration_delay parameter, whose default value is 300 seconds.

I tried to set it to 5 seconds and now deploy of the task definition takes about 2 minutes
image

@aencalado
Copy link

I found that if you disable the task stability check then it takes only a few seconds/minutes to deploy

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@de0132cf8cdedb79975c6d42b77eb7ea193cf28e
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: false. # <--- default is true

@ghost
Copy link

ghost commented Jun 22, 2022

The same task, unchanged, will suddenly take 20-30 minutes, or fail altogether. Seems like an issue worth checking out.
Screen Shot 2022-06-22 at 16 00 47

@Deep1144
Copy link

Deep1144 commented Jul 1, 2022

Facing the same issue, is there any update?

@naarkhoo
Copy link

same issue - the same task from two month ago now takes for ever - I thought its about RAM/CPU but hey I am using elastic container ...

@0Lucifer0
Copy link

0Lucifer0 commented Oct 4, 2022

Same for me: what's surprised me is that a lot of ecs instance are failing to start and are killed until one finally successfully start. It is likely due to AWS ECS and not this action.

STOPPED (ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-west-2.amazonaws.com/": dial tcp 52.119.174.83:443: i/o timeout)

@willisplummer
Copy link

I had the same experience as @0Lucifer0 — would be nice if the github action could fail loudly or provide more insight into why the task isn't getting deployed correctly

@nosachamos
Copy link

Also hitting the same problem. Seems random - sometimes finishes in 2 mins, sometimes in 20. Deploying using aws-actions/amazon-ecs-deploy-task-definition@v1 as everyone else here.

This is a major, major pain and eats up billable minutes in github actions. Please take a look, AWS.

@DLoBoston
Copy link

I can confirm that the hanging is during the stability check. Not a fan of turning this off, but in order to save on billable minutes I am. You can use other automated health checks to check on the service and keep this action limited to build a deploy.

Excerpt of debug info:

image

@aencalado 's solution worked for me to turn off wait-for-service-stability

@ddaniel27
Copy link

The solution given by @grommir worked for me.
image

@0Lucifer0
Copy link

@ddaniel27 can you give more details on how to achieve this ? how/where did you change the deregistration_delay property ?

@ddaniel27
Copy link

@0Lucifer0 If you are in the AWS console, go to EC2 > target groups > your target group > attributes > Edit. There you just must to change the 300 to something like 5 seconds and that's all. I don't know how this can affect something else in the ECS behavior but for this issue, it works.

@0Lucifer0
Copy link

I guess that won't work for me as for some reason there is no target groups 😢

@RazGvili
Copy link

I'm having the same issue with deregistration_delay of 10 sec

@SmashingQuasar
Copy link

I can confirm this is still an issue. When a deployment is failing, it seems AWS does not answer anything which leads to an extremely long deployment time that ultimately ends up in timeout.

@nosachamos
Copy link

nosachamos commented May 9, 2023 via email

@sombriks
Copy link

Hes the issue still here, deployment failed, action hanged out

@SebastianDix
Copy link

Guys because it is not just deploying the task definition, it is also waiting for service stability. Service stability is # of required tasks == number of active tasks + Health checks passing. If it takes 30 minutes (default) then it's because it was waiting 30 minutes for healthchecks to pass and they didn't. You can set a timeout to a lower value than 30 minutes.

@erwan-joly
Copy link

Guys because it is not just deploying the task definition, it is also waiting for service stability. Service stability is # of required tasks == number of active tasks + Health checks passing. If it takes 30 minutes (default) then it's because it was waiting 30 minutes for healthchecks to pass and they didn't. You can set a timeout to a lower value than 30 minutes.

The issue is not really that it timeout after 30min it is that is take a long time (25min is quite long even if it does not timeout) to successfully start in some cases. Sadly the error seems more like an underlying issue with AWS than with that specific github action

#102 (comment)

@rahulbhanushali
Copy link

Anybody figure a solution for this? Facing this frequently once in a while and is very annoying.

We have a staging environment where we have set desired count to 1. When deployment gets stuck, we face service downtime.

I can see the task logs and see service has started by for some reason ecs deployment still shows deploying and ELB won't route to the new task instance.

@carolzbnbr
Copy link

Same here :(

@JGSweets
Copy link

JGSweets commented Jun 3, 2024

I looked into the configuration of the wait timer and my thoughts are as follows:

  • We can only set the wait-for-minutes which equates to maxWaitTime
  • In the code itself, minDelay is forcibly set to 15 seconds via const WAIT_DEFAULT_DELAY_SEC = 15;
  • Another option is not available even in this code but is available in the timer settings maxDelay
  • The timer itself has an exponential backoff (i.e. each failure squares the time it takes until it reaches maxDelay)
  • The maxDelay is defaulted to 120 seconds

As a result, stability is checked early and will almost always default to 120 seconds since early attempts will fail.

That means once stability is reached (depending on your AWS settings of the target group / ecs health check delays etc) an additional 120 seconds seems to be tacked on top of the stability.


Solutions to fix:

  • Update the gh-action to make maxDelay and minDelay as available options for stability checking
    • e.g. wait-for-min-delay-seconds
    • e.g. wait-for-max-delay-seconds

@safwanshamsir99
Copy link

image

I faced the same issue. Usually, it takes about 4-5 mins.

@Yangeok
Copy link

Yangeok commented Aug 17, 2024

Same issue with python 3.11 poetry environment.

@Jepkosgei3
Copy link

I found that if you disable the task stability check then it takes only a few seconds/minutes to deploy

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@de0132cf8cdedb79975c6d42b77eb7ea193cf28e
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: false. # <--- default is true

this solved mine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Further information is requested
Projects
None yet
Development

No branches or pull requests