Replies: 15 comments
-
cc: @karakanb |
Beta Was this translation helpful? Give feedback.
-
I am unable to edit the original submission, therefore here's my current list of provider versions after upgrading to Airflow v2.3.2:
|
Beta Was this translation helpful? Give feedback.
-
I keep experiencing the same, although I have limited logs due to a lack of log collector yet, although I see no harm in sharing whatever I can find here in case someone can help. This will be a bit long, so apologies in advance. Task 1 - Examplehere's a task that has experienced the same issue recently:
unfortunately, I don't have scheduler logs from the start of the task execution because the pods have restarted and the logs are gone, but my logs start somewhere in the middle and I see this:
As you can see in the first few lines, the same task is one of the tasks that is already scheduled, and it is not being scheduled only because For my worker logs, I don't have them from the beginning as well, but I have them from when the first attempt ends, 02:53, and check this out:
Right at the time that the first attempt has ended, there is a log in the worker that the an instance of the same task is being picked up. The curious thing is that, according to my task logs for the 2nd attempt, the task hasn't started until 04:00 UTC:
Facts
HypothesisHere's my hypothesis about what might have happened:
I hope these are helpful, I'll try to collect more logs. |
Beta Was this translation helpful? Give feedback.
-
Huh, here's sth weird I have observed: I have 3 tasks that have failed with the same issue, and they have all failed at the same 2 seconds that day: Here are the final logs from all 3 tasks: task1 - which is the tasks that I have shared the logs for above:
task2:
task3:
Just checked my cluster resources, I don't see anything abnormal. Timezone in the image is BST, therefore it corresponds to when the tasks were killed. |
Beta Was this translation helpful? Give feedback.
-
Strange. Any other logs around? I guess some more evidence need to be gathered - maybe there are some logs elsewhere in your cluster/deployment configuration that can be corellated - otherwise it's really difficult to guess where it came from. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately, I had to bring back the big pipeline to run on Kubernetes by setting the queue to |
Beta Was this translation helpful? Give feedback.
-
Seen this issue with CeleryKubernetesExecutor when restarting the Celery worker pod. After restart, some long running task instances can be set as This should be reproduceable and i'm investigating on it. |
Beta Was this translation helpful? Give feedback.
-
Woudl be great to get more evidence of it - especially to see if it is reproducible in latest released Airflow.! |
Beta Was this translation helpful? Give feedback.
-
May the cause is you are running two schedulers at time? I've read somewhere that tasks could be triggered twice. |
Beta Was this translation helpful? Give feedback.
-
I am running two schedulers, that is correct. If there's an issue with the HA setup of the scheduler then it needs to be fixed in the scheduler I believe, rather than introducing an additional controller. The weird thing that I cannot explain still is why is this happening only with the tasks that run on Celery, and not with those that are running on Kubernetes. In both cases I am using CeleryKubernetesExecutor, and just the Unfortunately, I had to revert things back to the |
Beta Was this translation helpful? Give feedback.
-
Also seeing similar issues with CeleryKuberenetesExecutor on 2.3.1. Tasks on the celery worker randomly just abort halfway through with following message in the log
Only running one scheduler. |
Beta Was this translation helpful? Give feedback.
-
Are there any news on this issue. I am also seeing similar behavior for a long running task. My DAG includes an SSHOperator that executes a long-running process on a Google Cloud Compute Engine. The tasks keeps on running, even though the underlying process gets killed for some reason along the way. The weird thing is, that I do not see any log messages that states why the process was killed. It just stops logging. General Information
Task LogsHere are the logs of the latest execution:
Scheduler LogsThe CPU Utilization of the instance shows that the task stopped somewhere around 02:40:00 - 02:45:00UTC, but the scheduler logs do not show anything around that time. Only the periodic |
Beta Was this translation helpful? Give feedback.
-
I do not think it is question of Airflow at all @SchernHe - this loosk like very different problem. Most likely you should set different keepalive on your SSH connection. It's likely GCP is much more aggressive on killing long-running outgoing connections without actitivity. Also I believe in GCP - regardless from keepalive, the GCP will kill long running connections anyway. I strongly suspect this is the reason. You can try it yourself by replacing the SSHOperator of yours with running SSH command in bash without more frequent keepalive (See SSHHook for details and SSH command line help on how you can do it) and see if you observe the same behaviours. |
Beta Was this translation helpful? Give feedback.
-
@potiuk Thanks for your response. The weird thing about the described issue was that the airflow task (an d DAG) did not terminate once the process was killed on the GCP machine. Shouldn't the task also be terminated in case the SSH connection gets closed? Either way, I just re-structured my code, running the task in GCP in a background process and using sensors to check for the termination criteria. Felt bad to keep the SSH connection open for the whole time in the first place. |
Beta Was this translation helpful? Give feedback.
-
It depends. TCP connections work this way that if the client does not send anything on the connection, closing the connection by firewall might make the client not realise that the connection has been broken. This is one reason why keep-alive is needed to make sure such connections are closed. The state machine for TCP connection and packets sent to close/shutdown the connections are pretty complex and there are mechanisms in place which eventually shutdown such opened connections with kernel-configured timeouts, but if you are not sending data over TCP to "ping" the other side, there are scenarios where either of the sides might not realise that connections have been closed. The thing is TCP connection is not a physical "link" to be broken as you might imagine it. It's just agreement between client and server that if a packet is sent over the network and destination/source and port numbers agree, then such a packet gets routed by the kernel to the right client that "keeps" the right socket open. But if - suddenly - someone in between starts dropping all the packets, when there is no keep-alive neither of the parties might realise tha the link has been broken. So it is really the question on "how" the firewall breaks the connection. If it will signal both ends that the connection has been brokent (by sending TCP shutdown/close packet seuence to either parties), they they will get "broken pipe" error. But if the firewall will simply stop forwarding packets. then you got a "hanging connection".
Good idea. |
Beta Was this translation helpful? Give feedback.
-
Discussed in #24462
UPDATED with logs from @karakanb after 2.3.2 migration
Originally posted by karakanb May 26, 2022
Apache Airflow version
2.3.2 (oiriginaly 2.2.5)
What happened
Hi there, this one is a bit of a weird one to reproduce, but I'll try my best to giive as much information as possible.
General Information
First of all, here's some list information:
These are my relevant env variables:
The issue that I will be describing here started happening a week ago after I have moved from KubernetesExecutor to CeleryKubernetesExecutor, so it must have something to do with it.
Problem Statement
I have some DAGs that have some long-running tasks: be it sensors that take hours to complete, or large SQL queries that take a very long time. Given that the sensors are waiting hours in many cases, we use
reschedule
for the sensors; however, the long running SQL queries cannot be executed that way unfortunately, therefore the tasks stay open.Here's a sample log to show how the logs look when a query is executed successfully:
Here's a sample log for a task that started at
2022-05-26, 05:25:37
, that actually demonstrates the problem where the task runs for a longer time:Apparently, when the task runs for a longer time, it is being killed. It is not just happening with a single instance time, but with many others, therefore it is not an operator-specific issue. There are no timeouts, and no additional configuration defined on the individual tasks.
Some additional interesting observations:
Task is not able to be run
retries
being set for the DAG as well.Unfortunately, I don't have the scheduler logs, but I am on the lookout for them.
As I have mentioned, this has only started happening after I switched to
CeleryKubernetesExecutor
. I'd love to investigate this further, and it is causing a lot of pain now so I might need to get back to Kubernetes Executor, but I really don't want to given that KubernetesExecutor is much slower thanCeleryKubernetesExecutor
due togit clone
happening on every task.Let me know if I can provide additional information, I am trying to find more patterns and details around this so that we can fix this issue, so any leads around what should be looked at is much appreciated.
More info from the discussion:
@pingzh I don't have the zombies_killed metric in my /metrics endpoint, not sure.
@MattiaGallegati thanks a lot for the information. I haven't observed the issue for the past 3 days after the upgrade, I'll keep observing and report here.
I am seeing the issue much rarer than before, but it still happens after the upgrade. Here's one that has failed:
What you think should happen instead
The tasks should keep running until they are finished.
How to reproduce
I really don't know, sorry. I have tried my best to explain the situation above.
Operating System
Debian GNU/Linux 10 (buster)
Versions of Apache Airflow Providers
(Updated from the original post)
Deployment
Other 3rd-party Helm chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions