Long running tasks being killed with CeleryKubernetesExecutor #28589

potiuk · 2022-06-15T11:38:04Z

potiuk
Jun 15, 2022
Collaborator

Discussed in #24462

UPDATED with logs from @karakanb after 2.3.2 migration

^{Originally posted by karakanb May 26, 2022}

Apache Airflow version

2.3.2 (oiriginaly 2.2.5)

What happened

Hi there, this one is a bit of a weird one to reproduce, but I'll try my best to giive as much information as possible.

General Information

First of all, here's some list information:

Airflow version: v2.2.5
Deployed on k8s with the user-community helm chart:
- 2 scheduler pods
- 5 worker pods
- 1 flower pod
- 2 web pods
- Using managed Redis from DigitalOcean
Executor: CeleryKubernetesExecutor
Deployed on DigitalOcean Managed Kubernetes
Uses DigitalOcean Managed Postgres
I am using the official Airflow Docker images
There are no spikes in the DB metrics, Kubernetes cluster, or anything else that I could find.

These are my relevant env variables:

AIRFLOW__CELERY__WORKER_AUTOSCALE=8,4
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__VISIBILITY_TIMEOUT=64800
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags/repo/
AIRFLOW__CORE__EXECUTOR=CeleryKubernetesExecutor
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=1
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG=15
AIRFLOW__CORE__PARALLELISM=30
AIRFLOW__CORE__SECURE_MODE=True
AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=True
AIRFLOW__KUBERNETES__LOGS_VOLUME_CLAIM=airflow-v2-logs
AIRFLOW__KUBERNETES__NAMESPACE=airflow
AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE=/opt/airflow/pod_templates/pod_template.yaml
AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE=20
AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=120
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=30
AIRFLOW__WEBSERVER__EXPOSE_CONFIG=False
AIRFLOW__WEBSERVER__RBAC=True
AIRFLOW__WEBSERVER__WORKER_CLASS=gevent
AIRFLOW_HOME=/opt/airflow
AIRFLOW_INSTALLATION_METHOD=apache-airflow
AIRFLOW_PIP_VERSION=21.3.1
AIRFLOW_USER_HOME_DIR=/home/airflow
AIRFLOW_VERSION=2.2.5

The issue that I will be describing here started happening a week ago after I have moved from KubernetesExecutor to CeleryKubernetesExecutor, so it must have something to do with it.

Problem Statement

I have some DAGs that have some long-running tasks: be it sensors that take hours to complete, or large SQL queries that take a very long time. Given that the sensors are waiting hours in many cases, we use reschedule for the sensors; however, the long running SQL queries cannot be executed that way unfortunately, therefore the tasks stay open.

Here's a sample log to show how the logs look when a query is executed successfully:

[2022-05-26, 05:25:41 ] {cursor.py:705} INFO - query: [SELECT * FROM users WHERE...]
[2022-05-26, 05:57:22 ] {cursor.py:729} INFO - query execution done

Here's a sample log for a task that started at 2022-05-26, 05:25:37, that actually demonstrates the problem where the task runs for a longer time:

[2022-05-26, 05:57:22 ] {cursor.py:705} INFO - query: [----- CREATE OR REPLACE TABLE table1 AS WITH users AS ( ...]
[2022-05-26, 06:59:41 ] {taskinstance.py:1033} INFO - Dependencies not met for <TaskInstance: mycompany.task1 scheduled__2022-05-25T01:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-05-26, 06:59:41 ] {taskinstance.py:1033} INFO - Dependencies not met for <TaskInstance: mycompany.task1 scheduled__2022-05-25T01:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-05-26, 06:59:41 ] {local_task_job.py:99} INFO - Task is not able to be run

Apparently, when the task runs for a longer time, it is being killed. It is not just happening with a single instance time, but with many others, therefore it is not an operator-specific issue. There are no timeouts, and no additional configuration defined on the individual tasks.

Some additional interesting observations:

For all those tasks that are killed, I am seeing the same log: Task is not able to be run
For these tasks, the retry counts are going above the retries being set for the DAG as well.
- The DAG has 3 retries configured, and there'll be usually 4 instances running.
- This smells like a race condition somewhere, but not sure.

Unfortunately, I don't have the scheduler logs, but I am on the lookout for them.

As I have mentioned, this has only started happening after I switched to CeleryKubernetesExecutor. I'd love to investigate this further, and it is causing a lot of pain now so I might need to get back to Kubernetes Executor, but I really don't want to given that KubernetesExecutor is much slower than CeleryKubernetesExecutor due to git clone happening on every task.

Let me know if I can provide additional information, I am trying to find more patterns and details around this so that we can fix this issue, so any leads around what should be looked at is much appreciated.

More info from the discussion:

@pingzh I don't have the zombies_killed metric in my /metrics endpoint, not sure.

@MattiaGallegati thanks a lot for the information. I haven't observed the issue for the past 3 days after the upgrade, I'll keep observing and report here.

I am seeing the issue much rarer than before, but it still happens after the upgrade. Here's one that has failed:

*** Reading local file: /opt/airflow/logs/dag_id=company/run_id=scheduled__2022-06-13T01:00:00+00:00/task_id=my_task_id/attempt=1.log
[2022-06-14, 04:10:00 ] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: company.my_task_id scheduled__2022-06-13T01:00:00+00:00 [queued]>
[2022-06-14, 04:10:00 ] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: company.my_task_id scheduled__2022-06-13T01:00:00+00:00 [queued]>
[2022-06-14, 04:10:00 ] {taskinstance.py:1356} INFO - 
--------------------------------------------------------------------------------
[2022-06-14, 04:10:00 ] {taskinstance.py:1357} INFO - Starting attempt 1 of 4
[2022-06-14, 04:10:00 ] {taskinstance.py:1358} INFO - 
--------------------------------------------------------------------------------
[2022-06-14, 04:10:00 ] {taskinstance.py:1377} INFO - Executing <Task(SnowflakeOperator): my_task_id> on 2022-06-13 01:00:00+00:00
[2022-06-14, 04:10:00 ] {standard_task_runner.py:52} INFO - Started process 982 to run task
[2022-06-14, 04:10:00 ] {standard_task_runner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'company', 'my_task_id', 'scheduled__2022-06-13T01:00:00+00:00', '--job-id', '182516', '--raw', '--subdir', 'DAGS_FOLDER/dag_v3.py', '--cfg-path', '/tmp/tmpckf3rysy', '--error-file', '/tmp/tmpzqc4fc0m']
[2022-06-14, 04:10:00 ] {standard_task_runner.py:80} INFO - Job 182516: Subtask my_task_id
[2022-06-14, 04:10:00 ] {warnings.py:109} WARNING - /home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py:525: DeprecationWarning: The sql_alchemy_conn option in [core] has been moved to the sql_alchemy_conn option in [database] - the old setting has been used, but please update your config.
  option = self._get_environment_variables(deprecated_key, deprecated_section, key, section)

[2022-06-14, 04:10:01 ] {warnings.py:109} WARNING - /home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py:525: DeprecationWarning: The sql_alchemy_conn option in [core] has been moved to the sql_alchemy_conn option in [database] - the old setting has been used, but please update your config.
  option = self._get_environment_variables(deprecated_key, deprecated_section, key, section)

[2022-06-14, 04:10:01 ] {task_command.py:370} INFO - Running <TaskInstance: company.my_task_id scheduled__2022-06-13T01:00:00+00:00 [running]> on host airflow-v2-worker-5.airflow-v2-worker.airflow.svc.cluster.local
[2022-06-14, 04:10:01 ] {taskinstance.py:1569} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=company
AIRFLOW_CTX_TASK_ID=my_task_id
AIRFLOW_CTX_EXECUTION_DATE=2022-06-13T01:00:00+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2022-06-13T01:00:00+00:00
[2022-06-14, 04:10:01 ] {snowflake.py:118} INFO - Executing: <some sql statement here>
[2022-06-14, 04:10:01 ] {base.py:68} INFO - Using connection ID 'my-connection-id' for task execution.
[2022-06-14, 04:10:01 ] {connection.py:257} INFO - Snowflake Connector for Python Version: 2.7.8, Python Version: 3.8.13, Platform: Linux-5.10.0-0.bpo.9-amd64-x86_64-with-glibc2.2.5
[2022-06-14, 04:10:01 ] {connection.py:876} INFO - This connection is in OCSP Fail Open Mode. TLS Certificates would be checked for validity and revocation status. Any other Certificate Revocation related exceptions or OCSP Responder failures would be disregarded in favor of connectivity.
[2022-06-14, 04:10:01 ] {connection.py:894} INFO - Setting use_openssl_only mode to False
[2022-06-14, 04:10:02 ] {cursor.py:710} INFO - query: [<some sql statement here>]
[2022-06-14, 04:10:02 ] {cursor.py:734} INFO - query execution done
[2022-06-14, 04:10:02 ] {snowflake.py:324} INFO - Running statement: <some sql statement here>
[2022-06-14, 04:10:02 ] {cursor.py:710} INFO - query: [<some sql statement here>]
[2022-06-14, 04:10:02 ] {cursor.py:734} INFO - query execution done
[2022-06-14, 04:10:02 ] {snowflake.py:334} INFO - Statement execution info - {'status': 'Statement executed successfully.'}
[2022-06-14, 04:10:02 ] {snowflake.py:338} INFO - Rows affected: 1
[2022-06-14, 04:10:02 ] {snowflake.py:339} INFO - Snowflake query id: <some-uuid-here>
[2022-06-14, 04:10:02 ] {snowflake.py:324} INFO - Running statement: <some sql statement here>
[2022-06-14, 04:10:02 ] {cursor.py:710} INFO - query: [<some sql statement here>]
[2022-06-14, 04:10:03 ] {cursor.py:734} INFO - query execution done
[2022-06-14, 04:10:03 ] {snowflake.py:334} INFO - Statement execution info - {'status': 'some_table already exists, statement succeeded.'}
[2022-06-14, 04:10:03 ] {snowflake.py:338} INFO - Rows affected: 1
[2022-06-14, 04:10:03 ] {snowflake.py:339} INFO - Snowflake query id:  <some-uuid-here>
[2022-06-14, 04:10:03 ] {snowflake.py:324} INFO - Running statement: <some sql statement here>
[2022-06-14, 04:10:03 ] {cursor.py:710} INFO - query: [<some sql statement here>]
[2022-06-14, 04:10:08 ] {cursor.py:734} INFO - query execution done
[2022-06-14, 04:10:08 ] {snowflake.py:334} INFO - Statement execution info - {'number of rows deleted': 562}
[2022-06-14, 04:10:08 ] {snowflake.py:338} INFO - Rows affected: 562
[2022-06-14, 04:10:08 ] {snowflake.py:339} INFO - Snowflake query id: <some-uuid-here>
[2022-06-14, 04:10:08 ] {snowflake.py:324} INFO - Running statement: <some sql statement here>
[2022-06-14, 04:10:08 ] {cursor.py:710} INFO - query: [<some sql statement here>]
[2022-06-14, 04:16:29 ] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: company.my_task_id scheduled__2022-06-13T01:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-06-14, 04:16:29 ] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: company.my_task_id scheduled__2022-06-13T01:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-06-14, 04:16:29 ] {local_task_job.py:101} INFO - Task is not able to be **run**

What you think should happen instead

The tasks should keep running until they are finished.

How to reproduce

I really don't know, sorry. I have tried my best to explain the situation above.

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

(Updated from the original post)

apache-airflow==2.3.2
apache-airflow-providers-amazon==3.4.0
apache-airflow-providers-celery==2.1.4
apache-airflow-providers-cncf-kubernetes==4.0.2
apache-airflow-providers-docker==2.7.0
apache-airflow-providers-elasticsearch==3.0.3
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-google==7.0.0
apache-airflow-providers-grpc==2.0.4
apache-airflow-providers-hashicorp==2.2.0
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-microsoft-azure==3.9.0
apache-airflow-providers-microsoft-mssql==2.0.1
apache-airflow-providers-mysql==2.2.3
apache-airflow-providers-odbc==2.0.4
apache-airflow-providers-postgres==4.1.0
apache-airflow-providers-redis==2.0.4
apache-airflow-providers-sendgrid==2.0.4
apache-airflow-providers-sftp==2.6.0
apache-airflow-providers-slack==4.2.3
apache-airflow-providers-snowflake==2.7.0
apache-airflow-providers-sqlite==2.1.3
apache-airflow-providers-ssh==2.3.0
google-cloud-orchestration-airflow==1.3.1

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

potiuk · 2022-06-15T11:40:42Z

potiuk
Jun 15, 2022
Collaborator Author

cc: @karakanb

0 replies

karakanb · 2022-06-15T15:34:11Z

karakanb
Jun 15, 2022

I am unable to edit the original submission, therefore here's my current list of provider versions after upgrading to Airflow v2.3.2:

apache-airflow==2.3.2
apache-airflow-providers-amazon==3.4.0
apache-airflow-providers-celery==2.1.4
apache-airflow-providers-cncf-kubernetes==4.0.2
apache-airflow-providers-docker==2.7.0
apache-airflow-providers-elasticsearch==3.0.3
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-google==7.0.0
apache-airflow-providers-grpc==2.0.4
apache-airflow-providers-hashicorp==2.2.0
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-microsoft-azure==3.9.0
apache-airflow-providers-microsoft-mssql==2.0.1
apache-airflow-providers-mysql==2.2.3
apache-airflow-providers-odbc==2.0.4
apache-airflow-providers-postgres==4.1.0
apache-airflow-providers-redis==2.0.4
apache-airflow-providers-sendgrid==2.0.4
apache-airflow-providers-sftp==2.6.0
apache-airflow-providers-slack==4.2.3
apache-airflow-providers-snowflake==2.7.0
apache-airflow-providers-sqlite==2.1.3
apache-airflow-providers-ssh==2.3.0
google-cloud-orchestration-airflow==1.3.1

0 replies

karakanb · 2022-06-17T19:10:50Z

karakanb
Jun 17, 2022

I keep experiencing the same, although I have limited logs due to a lack of log collector yet, although I see no harm in sharing whatever I can find here in case someone can help. This will be a bit long, so apologies in advance.

Task 1 - Example

here's a task that has experienced the same issue recently:

*** Reading local file: /opt/airflow/logs/dag_id=my-company/run_id=scheduled__2022-06-15T01:00:00+00:00/task_id=my-task-id/attempt=1.log
[2022-06-16, 02:14:50 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [queued]>
[2022-06-16, 02:14:50 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [queued]>
[2022-06-16, 02:14:50 UTC] {taskinstance.py:1356} INFO - 
--------------------------------------------------------------------------------
[2022-06-16, 02:14:50 UTC] {taskinstance.py:1357} INFO - Starting attempt 1 of 4
[2022-06-16, 02:14:50 UTC] {taskinstance.py:1358} INFO - 
--------------------------------------------------------------------------------
[2022-06-16, 02:14:50 UTC] {taskinstance.py:1377} INFO - Executing <Task(SnowflakeOperator): my-task-id> on 2022-06-15 01:00:00+00:00
[2022-06-16, 02:14:50 UTC] {standard_task_runner.py:52} INFO - Started process 472 to run task
[2022-06-16, 02:14:50 UTC] {standard_task_runner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'my-company', 'my-task-id', 'scheduled__2022-06-15T01:00:00+00:00', '--job-id', '22342', '--raw', '--subdir', 'DAGS_FOLDER/dag_v3.py', '--cfg-path', '/tmp/tmpe6oklcxp', '--error-file', '/tmp/tmp7eje3qad']
[2022-06-16, 02:14:50 UTC] {standard_task_runner.py:80} INFO - Job 22342: Subtask my-task-id
[2022-06-16, 02:14:51 UTC] {task_command.py:370} INFO - Running <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [running]> on host airflow-v2-worker-5.airflow-v2-worker.airflow.svc.cluster.local
[2022-06-16, 02:14:51 UTC] {taskinstance.py:1569} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=my-company
AIRFLOW_CTX_TASK_ID=my-task-id
AIRFLOW_CTX_EXECUTION_DATE=2022-06-15T01:00:00+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2022-06-15T01:00:00+00:00
[2022-06-16, 02:14:51 UTC] {snowflake.py:118} INFO - Executing: <some sql statement here>
[2022-06-16, 02:14:51 UTC] {base.py:68} INFO - Using connection ID 'my-company-connection-abcd' for task execution.
[2022-06-16, 02:14:51 UTC] {connection.py:257} INFO - Snowflake Connector for Python Version: 2.7.8, Python Version: 3.8.13, Platform: Linux-5.10.0-0.bpo.9-amd64-x86_64-with-glibc2.2.5
[2022-06-16, 02:14:51 UTC] {connection.py:876} INFO - This connection is in OCSP Fail Open Mode. TLS Certificates would be checked for validity and revocation status. Any other Certificate Revocation related exceptions or OCSP Responder failures would be disregarded in favor of connectivity.
[2022-06-16, 02:14:51 UTC] {connection.py:894} INFO - Setting use_openssl_only mode to False
[2022-06-16, 02:14:52 UTC] {cursor.py:710} INFO - query: [<some sql statement here>]
[2022-06-16, 02:14:52 UTC] {cursor.py:734} INFO - query execution done
[2022-06-16, 02:14:52 UTC] {snowflake.py:324} INFO - Running statement: <some sql statement here>
[2022-06-16, 02:14:52 UTC] {cursor.py:710} INFO - query: [<some sql statement here>]
[2022-06-16, 02:14:53 UTC] {cursor.py:734} INFO - query execution done
[2022-06-16, 02:14:53 UTC] {snowflake.py:334} INFO - Statement execution info - {'status': 'Statement executed successfully.'}
[2022-06-16, 02:14:53 UTC] {snowflake.py:338} INFO - Rows affected: 1
[2022-06-16, 02:14:53 UTC] {snowflake.py:339} INFO - Snowflake query id: <some uuid here>
[2022-06-16, 02:14:53 UTC] {snowflake.py:324} INFO - Running statement: <some sql statement here>
[2022-06-16, 02:14:53 UTC] {cursor.py:710} INFO - query: [<some sql statement here>]
[2022-06-16, 02:53:25 UTC] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-06-16, 02:53:25 UTC] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-06-16, 02:53:25 UTC] {local_task_job.py:101} INFO - Task is not able to be run

unfortunately, I don't have scheduler logs from the start of the task execution because the pods have restarted and the logs are gone, but my logs start somewhere in the middle and I see this:

[2022-06-16 03:46:03,405] {scheduler_job.py:353} INFO - 16 tasks up for execution:
... some other tasks listed here
[2022-06-16 03:46:03,405] {scheduler_job.py:353}     <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [scheduled]>
[2022-06-16 03:46:03,409] {scheduler_job.py:418} INFO - DAG my-company has 15/15 running and queued tasks
[2022-06-16 03:46:03,409] {scheduler_job.py:425} INFO - Not executing <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [scheduled]> since the number of tasks running or queued from DAG my-company is >= to the DAG's max_active_tasks limit of 15

As you can see in the first few lines, the same task is one of the tasks that is already scheduled, and it is not being scheduled only because max_active_tasks is preventing the DAG to have more tasks running.

For my worker logs, I don't have them from the beginning as well, but I have them from when the first attempt ends, 02:53, and check this out:

[2022-06-16 02:53:24,246] INFO/ForkPoolWorker-57 Running <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [running]> on host airflow-v2-worker-1.airflow-v2-worker.airflow.svc.cluster.local

Right at the time that the first attempt has ended, there is a log in the worker that the an instance of the same task is being picked up.

The curious thing is that, according to my task logs for the 2nd attempt, the task hasn't started until 04:00 UTC:

*** Reading local file: /opt/airflow/logs/dag_id=my-company/run_id=scheduled__2022-06-15T01:00:00+00:00/task_id=my-task-id/attempt=2.log
[2022-06-16, 04:00:42 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [queued]>
[2022-06-16, 04:00:42 UTC] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: my-company.my-task-id scheduled__2022-06-15T01:00:00+00:00 [queued]>
[2022-06-16, 04:00:42 UTC] {taskinstance.py:1356} INFO - 
--------------------------------------------------------------------------------
[2022-06-16, 04:00:42 UTC] {taskinstance.py:1357} INFO - Starting attempt 2 of 4
[2022-06-16, 04:00:42 UTC] {taskinstance.py:1358} INFO - 
--------------------------------------------------------------------------------
[2022-06-16, 04:00:42 UTC] {taskinstance.py:1377} INFO - Executing <Task(SnowflakeOperator): my-task-id> on 2022-06-15 01:00:00+00:00
[2022-06-16, 04:00:42 UTC] {standard_task_runner.py:52} INFO - Started process 15625 to run task
...

Facts

This DAG has 251 task, and the distribution is:
- SnowflakeOperators: 224
- PythonOperators: 13
- S3KeySensors: 12
- SlackWebhookOperator: 1
- DummyOperator: 1
The DAG runs once a day.
So far, I have been able to observe this issue only with SnowflakeOperator tasks after the upgrade.
- In v2.2.5, I was able to see this issue with different operators as well.
- After upgrading to v2.3.2, I see this issue only with SnowflakeOperator.
- This is my biggest DAG, and it is full of Snowflake operators, therefore it might be the case that I only see this issue with SnowflakeOperator because this is my biggest DAG, or some other reason. So I don't want to say "this issue is specific to SnowflakeOperator".

Hypothesis

Here's my hypothesis about what might have happened:

the first task starts at 02:14:50 UTC
sometime between 02:14:50 - 02:53:24, scheduler schedules another instance of this task, and the task manages to start at 02:53:24 UTC.
- the task might have started if there were less than 15 tasks running for this DAG, which is the max_active_tasks limit for the DAG.
when the second instance of the first attempt starts, the scheduler kills the first task with Task is not able to be run.
- the task is killed at 02:53:25 UTC, and the worker has a log for running this task at 02:53:24,246, so it could be the case that this ghost attempt that started in the background caused the original one to be killed.
For some reason, there are no logs for this ghost task anywhere.
At 04:00:42 UTC, scheduler starts the actual second attempt of this task.

I hope these are helpful, I'll try to collect more logs.

0 replies

karakanb · 2022-06-17T19:17:52Z

karakanb
Jun 17, 2022

Huh, here's sth weird I have observed: I have 3 tasks that have failed with the same issue, and they have all failed at the same 2 seconds that day: 2022-06-16, 02:53:24 UTC and 2022-06-16, 02:53:25 UTC.

Here are the final logs from all 3 tasks:

task1 - which is the tasks that I have shared the logs for above:

[2022-06-16, 02:53:25 UTC] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: my-company.my-task-id__2022-06-15T01:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-06-16, 02:53:25 UTC] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: my-company.my-task-id__2022-06-15T01:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-06-16, 02:53:25 UTC] {local_task_job.py:101} INFO - Task is not able to be run

task2:

[2022-06-16, 02:53:24 UTC] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: my-company.my-second-task scheduled__2022-06-15T01:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-06-16, 02:53:24 UTC] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: my-company.my-second-task scheduled__2022-06-15T01:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-06-16, 02:53:24 UTC] {local_task_job.py:101} INFO - Task is not able to be run

task3:

[2022-06-16, 02:53:24 UTC] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: my-company.my-third-task scheduled__2022-06-15T01:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-06-16, 02:53:24 UTC] {taskinstance.py:1149} INFO - Dependencies not met for <TaskInstance: my-company.my-third-task scheduled__2022-06-15T01:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-06-16, 02:53:24 UTC] {local_task_job.py:101} INFO - Task is not able to be run

Just checked my cluster resources, I don't see anything abnormal. Timezone in the image is BST, therefore it corresponds to when the tasks were killed.

0 replies

potiuk · 2022-06-19T22:44:10Z

potiuk
Jun 19, 2022
Collaborator Author

Strange. Any other logs around? I guess some more evidence need to be gathered - maybe there are some logs elsewhere in your cluster/deployment configuration that can be corellated - otherwise it's really difficult to guess where it came from.

0 replies

karakanb · 2022-06-20T16:25:15Z

karakanb
Jun 20, 2022

Unfortunately, I had to bring back the big pipeline to run on Kubernetes by setting the queue to kubernetes for all of them, and the issue hasn't happened since I made the change. I am on the lookout for other opportunities to collect more logs, but if anyone has any leads I would very much appreciate.

0 replies

NickYadance · 2022-06-24T02:49:53Z

NickYadance
Jun 24, 2022

Seen this issue with CeleryKubernetesExecutor when restarting the Celery worker pod. After restart, some long running task instances can be set as up_for_retry and rerun successfully while other task instances failed and its dagrun failed too. Due to this i cannot restart worker pods without draining the worker pods first.

This should be reproduceable and i'm investigating on it.

0 replies

potiuk · 2022-06-24T07:31:42Z

potiuk
Jun 24, 2022
Collaborator Author

This should be reproduceable and i'm investigating on it.

Woudl be great to get more evidence of it - especially to see if it is reproducible in latest released Airflow.!

0 replies

merlinux · 2022-07-22T12:45:43Z

merlinux
Jul 22, 2022

May the cause is you are running two schedulers at time? I've read somewhere that tasks could be triggered twice.
This guys created a failover controller as workaround. https://github.com/teamclairvoyant/airflow-scheduler-failover-controller/blob/master/README.md#installation

0 replies

karakanb · 2022-07-22T17:08:37Z

karakanb
Jul 22, 2022

I am running two schedulers, that is correct.

If there's an issue with the HA setup of the scheduler then it needs to be fixed in the scheduler I believe, rather than introducing an additional controller.

The weird thing that I cannot explain still is why is this happening only with the tasks that run on Celery, and not with those that are running on Kubernetes. In both cases I am using CeleryKubernetesExecutor, and just the queue value is different.

Unfortunately, I had to revert things back to the kubernetes cluster instead of running them with Celery, but in the upcoming weeks I will try to enable it back and collect some more logs.

0 replies

hterik · 2022-10-07T07:09:25Z

hterik
Oct 7, 2022

Also seeing similar issues with CeleryKuberenetesExecutor on 2.3.1. Tasks on the celery worker randomly just abort halfway through with following message in the log

2022-07-15 13:25:27 I/task     Dependencies not met for <TaskInstance: xxxxxxxxx [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state. 
2022-07-15 13:25:27 I/task     Dependencies not met for <TaskInstance: xxxxxxxxx [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state

Only running one scheduler.
Unfortunately also missing the scheduler logs around those times.
One more unusual thing we do is occasionally send celery.control.broadcast("shutdown") to auto-update the worker container, this should wait for all tasks to complete and again sorry to be missing logs to correlate if those can be related.

0 replies

SchernHe · 2022-11-04T09:58:55Z

SchernHe
Nov 4, 2022

Are there any news on this issue. I am also seeing similar behavior for a long running task.

My DAG includes an SSHOperator that executes a long-running process on a Google Cloud Compute Engine. The tasks keeps on running, even though the underlying process gets killed for some reason along the way. The weird thing is, that I do not see any log messages that states why the process was killed. It just stops logging.

General Information

Version: apache-airflow==2.4.2
Executor: CeleryKubernetesExecutor + Redis
Docker Setup as described here
Compute Engine keeps on running and no reboot occured
Compute Engine shows no spikes or anything regarding CPU / GPU usage

Task Logs

Here are the logs of the latest execution:

*** Reading local file: /opt/airflow/logs/dag_id=XXXX/run_id=manual__2022-11-03T22:37:42.838999+00:00/task_id=XXXX/attempt=1.log
[2022-11-04, 01:31:11 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: complete_training_pipeline.XXXX manual__2022-11-03T22:37:42.838999+00:00 [queued]>
[2022-11-04, 01:31:11 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: complete_training_pipeline.XXXX manual__2022-11-03T22:37:42.838999+00:00 [queued]>
[2022-11-04, 01:31:11 UTC] {taskinstance.py:1362} INFO - 
--------------------------------------------------------------------------------
[2022-11-04, 01:31:11 UTC] {taskinstance.py:1363} INFO - Starting attempt 1 of 1
[2022-11-04, 01:31:11 UTC] {taskinstance.py:1364} INFO - 
--------------------------------------------------------------------------------
[2022-11-04, 01:31:11 UTC] {taskinstance.py:1383} INFO - Executing <Task(SSHOperator): XXXX> on 2022-11-03 22:37:42.838999+00:00
[2022-11-04, 01:31:11 UTC] {standard_task_runner.py:55} INFO - Started process 18209 to run task
[2022-11-04, 01:31:11 UTC] {standard_task_runner.py:82} INFO - Running: ['***', 'tasks', 'run', 'XXXX', 'XXXX', 'manual__2022-11-03T22:37:42.838999+00:00', '--job-id', '214', '--raw', '--subdir', 'DAGS_FOLDER/XXXX.py', '--cfg-path', '/tmp/tmp1y9_3kwh']
[2022-11-04, 01:31:11 UTC] {standard_task_runner.py:83} INFO - Job 214: Subtask XXXX
[2022-11-04, 01:31:11 UTC] {task_command.py:376} INFO - Running <TaskInstance: XXXX.XXXX manual__2022-11-03T22:37:42.838999+00:00 [running]> on host 6eac1f0ec121
[2022-11-04, 01:31:12 UTC] {taskinstance.py:1590} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=XXXX
AIRFLOW_CTX_TASK_ID=XXXX
AIRFLOW_CTX_EXECUTION_DATE=2022-11-03T22:37:42.838999+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-11-03T22:37:42.838999+00:00
[2022-11-04, 01:31:12 UTC] {ssh.py:137} INFO - Creating ssh_client
...
[2022-11-04, 01:31:19 UTC] {transport.py:1874} INFO - Connected (version 2.0, client OpenSSH_7.9p1)
[2022-11-04, 01:31:20 UTC] {transport.py:1874} INFO - Authentication (publickey) successful!
[2022-11-04, 01:31:20 UTC] {ssh.py:469} INFO - Running command: .....
....
[2022-11-04, 02:11:20 UTC] {ssh.py:501} INFO - "My last log message"
[2022-11-04, 07:32:16 UTC] {taskinstance.py:1155} INFO - Dependencies not met for <TaskInstance: complete_training_pipeline.start_bbox_training manual__2022-11-03T22:37:42.838999+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-11-04, 07:32:16 UTC] {taskinstance.py:1155} INFO - Dependencies not met for <TaskInstance: complete_training_pipeline.start_bbox_training manual__2022-11-03T22:37:42.838999+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-11-04, 07:32:16 UTC] {local_task_job.py:103} INFO - Task is not able to be run

Scheduler Logs

The CPU Utilization of the instance shows that the task stopped somewhere around 02:40:00 - 02:45:00UTC, but the scheduler logs do not show anything around that time. Only the periodic
[2022-11-04 02:45:37,716] {scheduler_job.py:1381} INFO - Resetting orphaned tasks for active dag runs

0 replies

potiuk · 2022-11-11T16:02:42Z

potiuk
Nov 11, 2022
Collaborator Author

I do not think it is question of Airflow at all @SchernHe - this loosk like very different problem. Most likely you should set different keepalive on your SSH connection. It's likely GCP is much more aggressive on killing long-running outgoing connections without actitivity. Also I believe in GCP - regardless from keepalive, the GCP will kill long running connections anyway. I strongly suspect this is the reason. You can try it yourself by replacing the SSHOperator of yours with running SSH command in bash without more frequent keepalive (See SSHHook for details and SSH command line help on how you can do it) and see if you observe the same behaviours.

0 replies

SchernHe · 2022-11-16T11:54:16Z

SchernHe
Nov 16, 2022

@potiuk Thanks for your response.

The weird thing about the described issue was that the airflow task (an d DAG) did not terminate once the process was killed on the GCP machine. Shouldn't the task also be terminated in case the SSH connection gets closed?

Either way, I just re-structured my code, running the task in GCP in a background process and using sensors to check for the termination criteria. Felt bad to keep the SSH connection open for the whole time in the first place.

0 replies

potiuk · 2022-11-16T14:10:14Z

potiuk
Nov 16, 2022
Collaborator Author

@potiuk Thanks for your response.

The weird thing about the described issue was that the airflow task (an d DAG) did not terminate once the process was killed on the GCP machine. Shouldn't the task also be terminated in case the SSH connection gets closed?

It depends. TCP connections work this way that if the client does not send anything on the connection, closing the connection by firewall might make the client not realise that the connection has been broken. This is one reason why keep-alive is needed to make sure such connections are closed. The state machine for TCP connection and packets sent to close/shutdown the connections are pretty complex and there are mechanisms in place which eventually shutdown such opened connections with kernel-configured timeouts, but if you are not sending data over TCP to "ping" the other side, there are scenarios where either of the sides might not realise that connections have been closed.

The thing is TCP connection is not a physical "link" to be broken as you might imagine it. It's just agreement between client and server that if a packet is sent over the network and destination/source and port numbers agree, then such a packet gets routed by the kernel to the right client that "keeps" the right socket open. But if - suddenly - someone in between starts dropping all the packets, when there is no keep-alive neither of the parties might realise tha the link has been broken. So it is really the question on "how" the firewall breaks the connection. If it will signal both ends that the connection has been brokent (by sending TCP shutdown/close packet seuence to either parties), they they will get "broken pipe" error. But if the firewall will simply stop forwarding packets. then you got a "hanging connection".

Either way, I just re-structured my code, running the task in GCP in a background process and using sensors to check for the termination criteria. Felt bad to keep the SSH connection open for the whole time in the first place.

Good idea.

0 replies

Long running tasks being killed with CeleryKubernetesExecutor #28589

potiuk Jun 15, 2022 Collaborator

Discussed in #24462

Apache Airflow version

What happened

General Information

Problem Statement

More info from the discussion:

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 15 comments

potiuk Jun 15, 2022 Collaborator Author

Task 1 - Example

Facts

Hypothesis

potiuk Jun 19, 2022 Collaborator Author

potiuk Jun 24, 2022 Collaborator Author

General Information

Task Logs

Scheduler Logs

potiuk Nov 11, 2022 Collaborator Author

potiuk Nov 16, 2022 Collaborator Author

potiuk
Jun 15, 2022
Collaborator

potiuk
Jun 15, 2022
Collaborator Author

potiuk
Jun 19, 2022
Collaborator Author

potiuk
Jun 24, 2022
Collaborator Author

potiuk
Nov 11, 2022
Collaborator Author

potiuk
Nov 16, 2022
Collaborator Author