Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airflow worker restarts every few hours, no jobs get done #42

Open
7yl4r opened this issue Apr 7, 2023 · 7 comments
Open

airflow worker restarts every few hours, no jobs get done #42

7yl4r opened this issue Apr 7, 2023 · 7 comments
Assignees
Labels

Comments

@7yl4r
Copy link
Member

7yl4r commented Apr 7, 2023

I saw this issue in the docker logs.

I brought down all airflow-related containers (but left grafana and influx up so the existing data isn't affected).

Then brought them back up w/ docker compose up --build -d.

Jobs appear to be completing now. Will check on the data tomorrow.

@7yl4r 7yl4r added bug Something isn't working client-fgbnms labels Apr 7, 2023
@7yl4r 7yl4r self-assigned this Apr 7, 2023
@7yl4r
Copy link
Member Author

7yl4r commented May 5, 2023

Seeing jobs failing as "Not Yet started" in airflow web GUI with a weird error also when trying to get the task logfile.

reset command:

docker container restart mbon-dashboard-server-airflow-worker-1 mbon-dashboard-server-airflow-webserver-1 mbon-dashboard-server-airflow-scheduler-1 mbon-dashboard-server-flower-1 mbon-dashboard-server-redis-1 mbon-dashboard-server-postgres-1

after doing this they are working again.

@7yl4r
Copy link
Member Author

7yl4r commented May 21, 2023

This is an ongoing issue. When trying to view a job log in the airflow web GUI:

*** Log file does not exist: /opt/airflow//logs/ts_ingest/ingest_sat_roi_fgb_MODA_chlor_a_SS1/2023-05-20T00:00:00+00:00/1.log
*** Fetching from: http://:8793/log/ts_ingest/ingest_sat_roi_fgb_MODA_chlor_a_SS1/2023-05-20T00:00:00+00:00/1.log
*** Failed to fetch log file from worker. The request to ':///' is missing either an 'http://'                         or 'https://' protocol.

@7yl4r
Copy link
Member Author

7yl4r commented Jun 13, 2023

seeing the same issue on fknms board now

@7yl4r
Copy link
Member Author

7yl4r commented Jun 13, 2023

Trying to restart one container at a time to narrow down where the issue might be.
After restarting the container I wait ~15min, then clear a DAG and observe the tasks

container name t waited status
mbon-dashboard-server-airflow-worker-1 00:15 no change
mbon-dashboard-server-airflow-scheduler-1 04:00 no change
mbon-dashboard-server-airflow-webserver-1 00:10 no change
mbon-dashboard-server-redis-1 00:15 working again.

@7yl4r
Copy link
Member Author

7yl4r commented Jun 15, 2023

From docker logs on the redis container:

* Connecting to MASTER 194.38.20.196:8886
* MASTER <-> REPLICA sync started
# Error condition on socket for SYNC: Connection refused

related SO Q

@7yl4r 7yl4r closed this as completed in 9c8910b Jun 15, 2023
@7yl4r
Copy link
Member Author

7yl4r commented Jun 15, 2023

restarting the fknms board to see if 9c8910b actually fixed it:

tylarmurray@fknms-dashboard-04:~/mbon-dashboard-server$ docker compose down --volumes --rmi all && docker compose up airflow-init && sudo chmod -R 777 airflow/ influxdb/ grafana/ postgres/ && docker compose up airflow-init && docker compose up --build -d

@7yl4r 7yl4r reopened this Jun 15, 2023
@7yl4r
Copy link
Member Author

7yl4r commented Jun 23, 2023

doing the same for fgbnms:

tylarmurray@fgbnms-dashboard-02:~/mbon-dashboard-server$ docker compose down --volumes --rmi all && docker compose up airflow-init && sudo chmod -R 777 airflow/ influxdb/ grafana/ postgres/ && docker compose up airflow-init && docker compose up --build -d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant