Frequent 504s and Poor Uptime on Docker Compose deployments #821

TheOnlyWayUp · 2024-12-06T04:42:19Z

To Reproduce

Create multiple docker compose services in the same project
Uptime checking

This isn't an issue with UptimeKuma, because there are long periods of inactivity on my statistics as well.

Uptime stats with large blocks of empty:

Before moving composes to Dokploy

During a "downtime",

Nothing on service logs
Traefik request logs show 504s and timeouts

Current vs. Expected behavior

Services are supposed to be online until turned off
Current: Services are online on Dokploy's console but unreachable by the network intermittently

Provide environment information

CPU: AMD Ryzen 7 3700X (16) @ 3.600
GPU: 2b:00.0 ASPEED Technology, Inc
Memory: 13894MiB / 64221MiB
OS: Ubuntu 24.04 LTS x86_64
Host: 1.0
Kernel: 6.8.0-49-generic
Dokploy Version: v0.12.0

Which area(s) are affected? (Select all that apply)

Docker Compose

Are you deploying the applications where Dokploy is installed or on a remote server?

Same server where Dokploy is installed

Additional context

This doesn't happen when deploying on the host system without Dokploy, circumventing traefik.

Will you send a PR to fix it?

Maybe, need help

TheOnlyWayUp · 2024-12-06T04:45:09Z

Services go down and come back up in a few minutes all throughout the day, it's tanked uptime to 30%.

Forgejo Docker compose:

version: "3"

services:
  server:
    image: codeberg.org/forgejo/forgejo:8
    container_name: forgejo
    environment:
      - USER_UID=1000
      - USER_GID=1000
    restart: always
    networks:
      - default
    volumes:
      - /root/Projects/Forge/forgejo:/data
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    ports:
      - 5005:5005
      - 222:22
    expose:
      - 5005
networks:
  default:

Ghost Docker compose

version: '3.1'

services:
  ghost:
    image: ghost:5-alpine
    restart: always
    expose:
      - 2368
    networks:
      - default
    environment:
      # see https://ghost.org/docs/config/#configuration-options
      database__client: mysql
      database__connection__host: db
      database__connection__user: root
      database__connection__password: 
      database__connection__database: ghost
      # this url value is just an example, and is likely wrong for your environment!
      url: https://blog.rambhat.la
      # contrary to the default mentioned in the linked documentation, this image defaults to NODE_ENV=production (so development mode needs to be explicitly specified if desired)
      #NODE_ENV: development
    labels:
    - "traefik.enable=true"

    # Middleware for replacing content in the body
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].regex=</head>"
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].replacement=<script defer src='https://stats.towu.dev/script.js' data-website-id='4d72a7bf-3049-4c82-8ff4-05c0bc4f8edf'></script></head>"

    # Link the middleware to the router
    - "traefik.http.routers.blog.middlewares=inject-script"

    volumes:
      - ghost_ghost:/var/lib/ghost/content
    depends_on:
      - db

  db:
    image: mysql:8.0
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: 
    expose:
      - 3306
    volumes:
      - ghost_db:/var/lib/mysql
        #
volumes:
  ghost_ghost:
    external: true
  ghost_db:
    external: true

networks:
  default:

Umami:

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    environment:
      DATABASE_URL: postgresql://umami:umami@db:5432/umami
      DATABASE_TYPE: postgresql
      APP_SECRET: 
    depends_on:
      db:
        condition: service_healthy
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'curl http://localhost:3000/api/heartbeat']
      interval: 5s
      timeout: 5s
      retries: 5
    expose:
      - 3000
    ports:
      - 4999:3000
    networks:
      - default
      
  db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: umami
      POSTGRES_USER: umami
      POSTGRES_PASSWORD: umami
    expose:
      - 5432
    volumes:
      - /root/Projects/miami/var/lib/postgresql/data:/var/lib/postgresql/data
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - default

networks:
  default:
    driver: bridge

These are the docker composes for affected services

TheOnlyWayUp · 2024-12-06T04:52:27Z

I believe it's an issue with Traefik, I can access the port-forwarded services (for example, Umami is forwarded to 4999 on the host and stats.towu.dev via traefik).

When stats.towu.dev is down, I can still access host:4999 to see Umami, so I'm pretty confident it's a proxy issue.

Something peculiar, while all the affected compose services go down at the same time (Ghost, Umami, and Forgejo). Other compose projects, like Immich, don't go down at all. Immich is a photo-management app which has a website as a part of the dockercompose, like the other services.

Immich (no dowmtime) Dockercompose

version: "3"
name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    networks:
      - default
    extends:
      file: ../../hwaccel.transcoding.yml
      service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    volumes:
      # Do not edit the next line. If you want to change the media storage location on your system, edit the value of UPLOAD_LOCATION in the .env file
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - /root/Projects/Immich/external:/mnt/media:ro
    env_file:
      - .env
    ports:
      - xxxx:2283
    expose:
      - 2283
    depends_on:
      - redis
      - database
    restart: always
    healthcheck:
      disable: false

  immich-machine-learning:
    container_name: immich_machine_learning
    networks:
      - default
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: hwaccel.ml.yml
    #   service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always
    healthcheck:
      disable: false

  redis:
    container_name: immich_redis
    networks:
      - default
    image: docker.io/redis:6.2-alpine@sha256:e3b17ba9479deec4b7d1eeec1548a253acc5374d68d3b27937fcfe4df8d18c7e
    healthcheck:
      test: redis-cli ping || exit 1
    restart: always

  database:
    container_name: immich_postgres
    networks:
      - default
    image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      # Do not edit the next line. If you want to change the database storage location on your system, edit the value of DB_DATA_LOCATION in the .env file
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    command: ["postgres", "-c", "shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
    restart: always

volumes:
  model-cache:

networks:
  default:

TheOnlyWayUp · 2024-12-08T03:58:22Z

Immich has downtime as well.

Related: #656 #734 #752

Related documentation, https://docs.dokploy.com/docs/core/troubleshooting#docker-compose-domain-not-working

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    ...
    expose:
      - 3000
    ports:
-      - 4999:3000
+     - 3000
    networks:
      - default
      
  db:
    image: postgres:15-alpine
    ...
    networks:
      - default

networks:
  default:
    driver: bridge

I'm trying this just to check, I need the ports forwarded as I can't upload large files through the cloudflare-proxied domain for Immich, for example.

TheOnlyWayUp · 2024-12-08T04:06:39Z

The ghost service goes down often (not from Dokploy's template), and has no ports forwarded.

version: '3.1'

services:

  ghost:
    image: ghost:5-alpine
    expose:
      - 2368
    networks:
      - default
    ...
    labels:
    - "traefik.enable=true"

    # Middleware for replacing content in the body
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].regex=</head>"
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].replacement=<script defer src='https://stats.towu.dev/script.js' data-website-id='4d72a7bf-3049-4c82-8ff4-05c0bc4f8edf'></script></head>"

    # Link the middleware to the router
    - "traefik.http.routers.blog.middlewares=inject-script"

    depends_on:
      - db

  db:
    image: mysql:8.0
    ...
    expose:
      - 3306

volumes:
  ghost_ghost:
    external: true
  ghost_db:
    external: true

networks:
  default:

I'll keep an eye on the uptime

Siumauricio · 2024-12-09T04:13:57Z

I know what could be the error, currently there is a very rare bug related to docker compose, if you use the name of a duplicate service in several places it is possible that the information is mixed somehow, I have not yet found a solution to this problem, my suggestion would be, change the name of the service

services:
      db:
          .....

to something like this

services:
      ghost-db:
          .....

TheOnlyWayUp · 2024-12-10T14:04:47Z

I've updated my services to use prefixed names, I guess that's what the randomize compose is for.

Is there anything I can do to provide some more insight? Traefik logs, if you lmk how I can get em. (docker logs would be enough?)

Likely related, umami-software/umami#3080 (reply in thread) - I believe another service was attempting to access Umami's database, leading to that error.

TheOnlyWayUp · 2024-12-10T14:08:13Z

Oh, is it because all the containers are part of the dokploy-network network, and names are resolved over this network instead of default? Dokploy also removes the default network unless it's explicitly included in the compose.

TheOnlyWayUp · 2024-12-10T15:00:46Z

@Siumauricio I updated the services to have unique names and rebuilt the project

Still having uptime issues, this is my updated compose

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    environment:
      DATABASE_URL: postgresql://umami:xxx@umami_db:5432/umami
    depends_on:
      db:
        condition: service_healthy
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'curl http://localhost:3000/api/heartbeat']
      interval: 5s
      timeout: 5s
      retries: 5
    expose:
      - 3000
    ports:
      - 3000
    networks:
      - default
      
  umami_db:
    image: postgres:15-alpine
    expose:
      - 5432
    volumes:
      - ...:/var/lib/postgresql/data
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - default

networks:
  default:
    driver: bridge

Siumauricio · 2024-12-14T18:24:23Z

the problem still persists?

TheOnlyWayUp · 2024-12-14T20:21:55Z

Yep,

kamellperry · 2024-12-18T19:28:09Z

Experiencing very similar issues. I'm also using the cloud hosted version of dokploy instead of self hosted because I thought that might have been why. After doing some digging It's definitely the reverse proxy stuff.

2shrestha22 · 2024-12-20T07:32:56Z

My server was down 5 min ago. I am not monitoring but I assume this is still a issue.

TheOnlyWayUp · 2024-12-22T12:20:31Z

@Siumauricio This issue is causing me a lot of trouble, is there anything I can do to help?

Siumauricio · 2024-12-23T06:43:36Z

Yes I definitely think it is a bug in docker at the network level, I think we must find a solution to this problem because currently we can not have 2 instances of the same template because sometimes it causes the information to be mixed which is a very strange behavior, I will investigate in more detail how to solve this, the idea would be to isolate the docker compose in a separate network.

TheOnlyWayUp · 2025-01-03T18:48:31Z

@Siumauricio I tried the fix in #1004 (randomize compose names) and the uptime hasn't improved at all.

This issue is urgent and affecting my users. Broken networking is a dealbreaker, is there anything else I can try?

Last ditch effort would be disabling Traefik and using a reverse proxy on host networking, or moving to another platform - which is a huge effort.

Are there any blockers for this issue? Any logs or information you need? Anything?

dreiekk · 2025-01-04T00:48:13Z

I'm having similiar problems - randomizing compose names also didn't fix it for me.

I also suspect it has something to do with the same internal port which is published from similiar services/containers on the same dokploy-network or the traefik config gets broken because of that same internal port despite they are on different services.

Feel free to ping me as well if I can provide any logs, information or test something helpful to this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent 504s and Poor Uptime on Docker Compose deployments #821

Frequent 504s and Poor Uptime on Docker Compose deployments #821

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 8, 2024

TheOnlyWayUp commented Dec 8, 2024

Siumauricio commented Dec 9, 2024

TheOnlyWayUp commented Dec 10, 2024

TheOnlyWayUp commented Dec 10, 2024

TheOnlyWayUp commented Dec 10, 2024

Siumauricio commented Dec 14, 2024

TheOnlyWayUp commented Dec 14, 2024

kamellperry commented Dec 18, 2024

2shrestha22 commented Dec 20, 2024

TheOnlyWayUp commented Dec 22, 2024

Siumauricio commented Dec 23, 2024

TheOnlyWayUp commented Jan 3, 2025

dreiekk commented Jan 4, 2025

dreiekk commented Jan 4, 2025

TheOnlyWayUp commented Jan 5, 2025

Frequent 504s and Poor Uptime on Docker Compose deployments #821

Frequent 504s and Poor Uptime on Docker Compose deployments #821

Comments

TheOnlyWayUp commented Dec 6, 2024

To Reproduce

Current vs. Expected behavior

Provide environment information

Which area(s) are affected? (Select all that apply)

Are you deploying the applications where Dokploy is installed or on a remote server?

Additional context

Will you send a PR to fix it?

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 8, 2024

TheOnlyWayUp commented Dec 8, 2024

Siumauricio commented Dec 9, 2024

TheOnlyWayUp commented Dec 10, 2024

TheOnlyWayUp commented Dec 10, 2024

TheOnlyWayUp commented Dec 10, 2024

Siumauricio commented Dec 14, 2024

TheOnlyWayUp commented Dec 14, 2024

kamellperry commented Dec 18, 2024

2shrestha22 commented Dec 20, 2024

TheOnlyWayUp commented Dec 22, 2024

Siumauricio commented Dec 23, 2024

TheOnlyWayUp commented Jan 3, 2025

dreiekk commented Jan 4, 2025

dreiekk commented Jan 4, 2025

TheOnlyWayUp commented Jan 5, 2025