-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler gets stuck without a trace #7935
Comments
I'm running Airflow 1.10.4 as Celery in k8s. The scheduler pod is getting stuck while starting up at the step 'Resetting orphaned tasks'.
This causes the UI to say
The same thing happens even after restarting the scheduler pod. (regardless of the CPU usage) Any leads to solve this? |
What database are you using? |
@mik-laj PostgreSQL. Thats running as a pod too. |
We are also facing Scheduler stuck issue which sometimes gets resolved by restarting the scheduler pod. There are not log trace in the scheduler process. We are using airflow 1.10.9 with postgres and redis. |
We're also seeing this same issue... no idea how to debug. airflow 1.10.9 with postgres / rabbitmq |
I see a similar issue on 1.10.9 where the scheduler runs fine on start but typically after 10 to 15 days the CPU utilization actually drops to near 0%. The scheduler health check in the webserver does still pass, but no jobs get scheduled. A restart fixes this. Seeing as I observe a CPU drop instead of a CPU spike, I'm not sure if these are the same issues, but they share symptoms. |
I see a similar issue on 1.10.10... there are no logs to indicate the problem. Airflow with mysql, redis and celery executor. PS: we still run the scheduler with the arguments |
I've anecdotally noticed that once I've dropped argument |
Could someone try to run pyspy when this incident occurs? This may bring us to a solution. Thanks to this, we will be able to check what code is currently being executed without restarting the application. |
|
Happened again today
@mik-laj does it help ? |
Ok so I have more info, here the situation when the scheduler gets stuck:
I managed to revive the scheduler by killing both 5977 & 5978 pids.
|
We also have this issue: Kubernetes version (if you are using kubernetes) (use kubectl version): v1.14.10-gke.42 Environment: Cloud provider or hardware configuration: Google Cloud Kubernetes |
This is happening to us also. No errors appear in the logs but the scheduler will not create new pods, pipelines stall with tasks in 'queued' state, and the scheduler pod must be deleted in order to get things running again. |
Any fix for this issue yet? Our scheduler has no heartbeat, CPU spikes then drops, and scheduler is back up after 15 minutes. This is slowing our team down a lot. |
Hi, this is happening at Slack too. We are using celery executor. The scheduler just gets stuck, no trace in the logs. Seeing a lot of defunct processes. Restart fixes it. @turbaszek @kaxil @potiuk any ideas what is going on? |
We are also facing the same issue with the |
@msumit I see the exact same symptom. Please let us know if you find something. |
We've experienced this issue twice now, with the CPU spiking to 100% and failing to schedule any tasks after. Our config is
Which would point to the scheduler running out of memory, likely due to log buildup (I added log cleanup tasks retroactively). I'm not sure if this is related to the scheduler getting stuck though. |
Is disk space everyone's issue? I recall either v 1.10.5 or v 1.10.6 had some not-fit-for-production use issue that was fixed in the next version. 1.10.9 has been working okay for us and importantly I'm curious if you could work around it with In the meantime we have a systemd timer service (or you use cron) that runs basically (gnu) find: find <base_log_dir> -mindepth 2 -type f -mtime +6 -delete -or -type d -empty -delete E.G. $ tree -D dir/
dir/
└── [Sep 6 23:10] dir
├── [Sep 6 23:10] dir
│ └── [Jan 1 2020] file.txt
├── [Sep 6 23:09] diry
└── [Sep 6 23:10] dirz
└── [Sep 6 23:10] file.txt
4 directories, 2 files
$ find dir -mindepth 2 -type f -mtime +6 -delete -or -type d -empty -delete
$ tree -D dir/
dir/
└── [Sep 6 23:13] dir
└── [Sep 6 23:10] dirz
└── [Sep 6 23:10] file.txt
2 directories, 1 file |
All system vitals like the disk, cpu, and mem are absolutely fine whenever the stuck happens for us. Whenever the process stuck, it doesn't respond to any other kill signals except 9 & 11. I did a strace on the stuck process, it shows the following Then I killed the process with
|
If it helps, the last time this happened, with debug logging on, the scheduler logs this: |
#11306 |
We also are experiencing a similar issue at Nextdoor with 1.10.12 / Postgres / Celery / AWS ECS. Ours looks much like @sylr 's post #7935 (comment) where we have many extra processes spawned that by program args appear identical to the scheduler main process and everything is stuck. However, ours has CPU go to 0 and RAM spike up quite high. |
We have a change that correlates (causation is not yet verified) to fixing the issue the @sylr mentioned here where many scheduler main processes spawn at the same time then disappear (which caused an OOM error for us). The change was the following:
And we run MAX_THREADS=10. Is it possible that reaching pool_size or pool_size+max_overflow caused processes to back up or spawn oddly? Before this change, the scheduler was getting stuck 1-2 times per day, now we have not seen this issue since the change 6 days ago. We do not see the issue of many processes spawning at once anymore like this:
Can anyone else verify this change helps or not? |
Same issue here with 1.10.12 + rabbitmq + celery + k8s. The scheduler keeps logging |
Seeing this on 1.10.9 |
Seeing this on 1.10.8 with Celery executor.
|
@ashb perhaps there is somewhere in the scheduler loop where there is a race condition? Would be interesting to see this same thread trace on 2.0. |
Airflow doesn't use threads - so I'm not sure why there are two threads in the about trace. Oh multiprocessing uses threads internally |
Started seeing this for the first time ever after we upgraded from 1.10.5 to 1.10.14. |
We just saw this on 2.0.1 when we added a largish number of new DAGs (We're adding around 6000 DAGs total, but this seems to lock up when about 200 try to be scheduled at once). Here's py-spy stacktraces from our scheduler:
What I think is happening is that the pipe between the From what I can see the airflow/airflow/utils/dag_processing.py Line 374 in beb8af5
and that the SchedulerJob is responsible for calling it's airflow/airflow/jobs/scheduler_job.py Line 1388 in beb8af5
However, the SchedulerJob is blocked from calling |
Nice debugging @MatthewRBruce - and your diagnosis seems sound. We'll start on a fix next week. |
Have a theory of why the Airflow scheduler may stuck at CeleryExecutor._send_tasks_to_celery (our scheduler stuck in a different place 😃). The size of the return value from For example, the following code easily deadlock on Python 3.6.3:
|
@milton0825 Sounds plausible for what I know of your usecase 😁 You're still on 1.10.x right? The scheduler on 2.0 sends a lot less data over the MP pipes, (it doesn't send the DAG, that gets written to the DB) so that particular issue won't be for 2.0+ |
Right we are still on 1.10.8 |
Seeing this on 1.10.14 + CeleryExecutor + python 3.8, will this be fix on 1.10.x? for some reason our company has to use mysql 5.6.
the [airflow schedul] defunct process is keep restarting all the time. |
@DreamyWen unlikely I'm afraid, at least not by me. I'll happily review a PR if anyone has time to submit it, but can't put any time to fixing this on 1.10 release branch, sorry |
+1 on this issue. Airflow 2.0.1 CeleryExecutor. 7000 dags~ seems to happen under load (when we have a bunch all dags all kick off at midnight) py-spy dump --pid 132 --locals
py-spy dump --pid 134 --locals
|
We had the same issue with Airflow on Google Cloud until increased the setting AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW |
@ashb considering what @oleksandr-yatsuk found, maybe this is a database issue? |
No freezes since bumping |
I've got a fix for the case reported by @MatthewRBruce (for 2.0.1) coming in 2.0.2 |
Hi @ashb I would like to report that we've been seeing something similar to this issue in Airflow 2.0.2 recently. We are using airflow 2.0.2 with a single airflow-scheduler + a few airflow-worker using CeleryExecutor and postgres backend running dozens of dags each with hundreds to a few thousand tasks. Python version is 3.8.7. Here's what we saw:
When the scheduler was in this state, there was also a child When I manually SIGTERM the child airflow scheduler process, it died. And immediately the main
One other observation was that when the airflow scheduler was in the stuck state, the |
@yuqian90 I have almost the exact same environment as you, and I have the same problem.
The problem happens roughly twice per day. I get the same last log message you do: As a last resort, I plan to watch for a hanged subprocess of the scheduler and kill it in a cron job... just like you, when I kill the subprocess manually, the main scheduler process continues as if nothing happened. |
The same behaviour in my previous comment happened again so I took a The child This is the
This is the
|
Have been struggling with this since we migrated to 2.0 our lower environments. Scheduler works for a couple of days, then stops scheduling, but doesn't trigger any heartbeat errors. Not sure it's helpful, but our PROD instance is running smoothly with Airflow 1.10.9 and Python 3.7.8. Restarting the scheduler brings it back to life after Docker restarts the service.
|
@sterling-jackson Your use case might be fixed by 2.1.0 (currently in RC stage) |
Hi @ashb @davidcaron I managed to reproduce this issue consistently with a small reproducing example and traced the problem down to |
I just wanted to share that the User-Community Airflow Helm Chart now has a mitigation for this issue that will automatically restart the scheduler if no tasks are created within some threshold time. It's called the scheduler "Task Creation Check", but its not enabled by default as, the "threshold" must be longer than your shorted DAG |
Apache Airflow version:
Kubernetes version (if you are using kubernetes) (use
kubectl version
):Environment:
uname -a
):What happened:
The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop.
The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.
Scheduler configs,
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5
executor = LocalExecutor
parallelism = 32
Please help. I would be happy to provide any other information needed
What you expected to happen:
How to reproduce it:
Anything else we need to know:
Moved here from https://issues.apache.org/jira/browse/AIRFLOW-401
The text was updated successfully, but these errors were encountered: