Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job stuck in queued state and never picked up #1590

Open
luizkowalski opened this issue Jan 29, 2025 · 5 comments
Open

Job stuck in queued state and never picked up #1590

luizkowalski opened this issue Jan 29, 2025 · 5 comments

Comments

@luizkowalski
Copy link
Contributor

Hey 👋🏻

Every now and then, I see a weird issue, just in development: a job gets scheduled but never picked up.

I have an endpoint that schedules a job when gets a callback:

 def show
  client.tap { it.code = params[:code] }

  raw_token = client.fetch_access_token!

  account = Account.from_token(raw_token, current_user: Current.user)
  account.sync # Here is where a job gets enqueued

  store_and_redirect(account.user)
rescue ActiveRecord::RecordInvalid => e
  redirect_to root_path, flash: { alert: e.message }
end

Sometimes, without any specific trigger or event, the job gets queued but doesn't get picked up:

Image

This one is sitting in the queue for 5 minutes.

I also noticed that when this happens, it's like the process enters a deadlock state or something. I said this because it stops pinging:

Image

and when I stop the process, it actually has to get killed

worker    | Interrupting...
...
worker    | Killing...

while when everything is fine, it just exits

worker    | [GoodJob] Notifier unsubscribed with UNLISTEN                                                                                                                                                 worker    | [GoodJob] [98819] GoodJob shutting down scheduler...                                                                                                                                          worker    | [GoodJob] [98819] GoodJob scheduler is shut down.                                                                                                                                             worker    | [GoodJob] [98819] GoodJob shutting down scheduler...
worker    | [GoodJob] [98819] GoodJob scheduler is shut down.
worker    | [GoodJob] [98819] GoodJob shutting down scheduler...
worker    | [GoodJob] [98819] GoodJob scheduler is shut down.                                                                                                                                             worker    | [GoodJob] [98819] GoodJob shutting down scheduler...
worker    | [GoodJob] [98819] GoodJob scheduler is shut down.                                                                                                                                             worker    | Exited with code 0

I'm not sure if this rings any bell but I'm seeing this in two different machines, in dev only..

Here is my config:

# frozen_string_literal: true

Rails.application.configure do
  config.good_job = Rails.application.config_for(:good_job).merge(
    on_thread_error: ->(exception) { Rails.error.report(exception) }
  )
end
default: &default
  queues: urgent:10;default:5;low_priority:3;*
  max_threads: <%= ENV.fetch("GOOD_JOB_MAX_THREADS", 5).to_i %>
  execution_mode: external
  enable_cron: <%= ENV["CRON_ENABLED"] == "1" %>

production:
  <<: *default
  queues: urgent:5;default:3;low_priority:2;*
  cron:
    query_stats:
      cron: "*/30 * * * *" # every 30 minutes
      class: "CaptureQueryStatsJob"
    refresh_tokens:
      cron: "0 */2 * * *" # every 2 hours
      class: "Accounts::RefreshTokensJob"
    renew_expired_channels:
      cron: "0 3 * * *" # every day at 3am
      class: "Calendars::RenewExpiredChannelsJob"

development:
  <<: *default

test:
  <<: *default

Running Rails from main branch and the latest Good Job (4.8.2 as of now). Both machines being Mac, one is a M1 and another M2 Pro. I start my dev env with Overmind:

#!/usr/bin/env sh

if ! overmind --version; then
  echo "Installing overmind..."
  brew install overmind
fi

# Default to port 3000 if not specified
export PORT="${PORT:-3000}"

echo "Cleaning logs"

rm -rf log/*.log

OVERMIND_SKIP_ENV=1 exec overmind start -f Procfile.dev "$@"

and my Procfile looks like this

web: bin/rails server
js: bun run build --watch
css: bun run build:css --watch
worker: bin/good_job start
stripe_fw: stripe listen --forward-to http://127.0.0.1:3000/pay/webhooks/stripe

I also don't know exactly what triggers this, I'm trying to reproduce and find something but just want to put this out in case you know something.

@bensheldon
Copy link
Owner

That's really weird! It sounds like it's getting into some kind of deadlock state (implied by you having to kill it). That it's happening in development only makes me think that constant autoloading/reloading is involved

Would you be able to share what the Account#sync method does, as well as what the job is doing?

And do you have any initializers/configuration of GoodJob (or anything else in your code paths) that might be touching an autoloader constant (a model or job class, for example)

@luizkowalski
Copy link
Contributor Author

Would you be able to share what the Account#sync method does, as well as what the job is doing?

Sure, not a lot actually

def sync
  Accounts::SyncCalendarsJob.set(wait: 2.seconds).perform_later(self)
end

the job then pulls the user's google calendar and either remove locally if they are removed from upstream or just updates

class Accounts::SyncCalendarsJob < UrgentJob
  def perform(account)
    upstream_calendars = account.calendar_service.list_calendar_lists

    # Delete calendars that are no longer in the list
    account.calendars.where.not(google_id: upstream_calendars.items.map(&:id)).map(&:enqueue_for_removal)

    upstream_calendars.items.map do |item|
      Calendar.from_google(item:, account:)
    ensure
      account.user.synced!
    end
  end
end

And do you have any initializers/configuration of GoodJob (or anything else in your code paths) that might be touching an autoloader constant (a model or job class, for example)

not a lot going on in the initializers or config folder, just configuring some gems like pay-rails, high-voltage and rails_icons. I do have Sentry but it is behind a if Rails.env.production? check

However there is something now that I'm thinking: a calendar is linked to a webhook (called watch channel) and if the user decides to stop listening to webhook, I need to delete the job that is enqueued to renew the webhook:

class WatchChannel < ApplicationRecord
  before_destroy :dequeue_renewal_job, if: -> { !expired? }
  ...

  def dequeue_renewal_job
    GoodJob::Job.scheduled.find_by(id: renewal_job_id)&.discard_job(
      "Watch channel stopped earlier"
    )
  end

@bensheldon
Copy link
Owner

That deletion is not my favorite (generally not safe to reach directly into the Jobs) but I don't think it would be the cause of this (though maybe!)

I think you could remove it by having the renewal job do discard_on ActiveJob::DeserializationError, assuming that the deleted WatchChannel record is one of the arguments of that renewal job and it raises that error because Active Job is unable to load it from the database when deserializing the arguments because it's been deleted.

I still don't think that it would lock up the entire process though; the process heartbeats happen on their own thread/scheduler and should be independent of jobs. It still looks to me like a Ruby VM deadlock. 🤔

@bensheldon
Copy link
Owner

Here's the best way to debug this:

  1. Add rbtrace to your gemfile
  2. When the deadlock happens, get the PID of the process that's hanging, and run this:
    bundle exec rbtrace --pid THE_PID -e "puts output = ActionDispatch::DebugLocks.new(nil).send(:render_details, nil); output"

I have a script here with some helpers if you wanted to see what else you might do, but it's that DebugLocks one that is useful:

https://github.com/bensheldon/good_job/blob/fdf9a01fc8c5f93a5394a2a59cc59ba1ab82af03/demo/config/initializers/rbtrace.rb

@luizkowalski
Copy link
Contributor Author

assuming that the deleted WatchChannel record is one of the arguments of that renewal job

unfortunately it is not, it is triggered by the calendar so the calendar is the argument. unrelated but now I see that "renew" is the wrong term here, there is no renew, once it expires, you create a new webhook, not renew it.

I will add rbtrace and wait for it to happen again. doesn't look like it's Good Job's fault here so I think this issue can be closed (or moved to a discussion?) and I'll update if I have more data and if it is indeed GJ's fault somehow, we can reopen this

Thanks a lot once more, Ben!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

2 participants