Skip to content

Commit

Permalink
(PE-36939) query.monitor: don't restart pdb for monitor failures
Browse files Browse the repository at this point in the history
If the monitor thread fails, make sure to report it as an error in the
log, but don't allow it to restart pdb.  This does assume that even on
failures, the monitor doesn't entangle other code in ways that could
break queries (or anything else critical).

The current expectation is that for this particular subsystem, since
it is fairly detached, it's better to allow pdb to continue in a
degraded (query monitorless) state until the problem can be fixed with
a new release, than to risk catastrophic reboot loops.
  • Loading branch information
rbrw committed Oct 27, 2023
1 parent a8bd876 commit f4dbee8
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 13 deletions.
3 changes: 1 addition & 2 deletions src/puppetlabs/puppetdb/cli/services.clj
Original file line number Diff line number Diff line change
Expand Up @@ -1032,8 +1032,7 @@
% request-shutdown "gc")

query-monitor (when (and env-monitor-queries? monitor-queries?)
(-> (qmon/monitor :on-fatal-error request-shutdown)
qmon/start))
(qmon/start (qmon/monitor)))
:error #(when %
(or (qmon/stop % (stop-query-monitor-wait-ms))
(log/error
Expand Down
19 changes: 10 additions & 9 deletions src/puppetlabs/puppetdb/query/monitor.clj
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,10 @@
set the info's :forget to true.
No operations should block forever; they should all eventually (in
some cases customizably) time out.
some cases customizably) time out, and the current implementation is
intended, overall, to try to let pdb keep running, even if the
monitor (thread) dies somehow. The precipitating errors should
still be reported to the log.
Every monitored query will have a SelectionKey associated with it,
The key is cancelled during forget, but won't be removed from the
Expand Down Expand Up @@ -82,7 +85,7 @@
[puppetlabs.puppetdb.jdbc :as jdbc]
[puppetlabs.puppetdb.scf.storage-utils :refer [db-metadata]]
[puppetlabs.puppetdb.time :refer [ephemeral-now-ns]]
[puppetlabs.puppetdb.utils :refer [with-monitored-execution]]
[puppetlabs.puppetdb.utils :refer [with-noisy-failure]]
[schema.core :as s])
(:import
(java.lang AutoCloseable)
Expand Down Expand Up @@ -261,11 +264,10 @@
ns-per-ms)))))

(defn- monitor-queries [{:keys [exit queries ^Selector selector] :as _monitor}
terminate-query
on-fatal-error]
terminate-query]
;; We depend on the fact that any new queries will wake us
;; up (i.e. wrt possible deadline advancements).
(with-monitored-execution on-fatal-error
(with-noisy-failure
;; Create a non-trivial buffer so we'll clear out any noise from
;; the client (shouldn't be any). On an X release, suppose we
;; could make that an error.
Expand Down Expand Up @@ -304,9 +306,8 @@
(compare (.hashCode skey-1) (.hashCode skey-2)))))

(defn monitor
[& {:keys [terminate-query on-fatal-error]
:or {terminate-query puppetlabs.puppetdb.query.monitor/terminate-query
on-fatal-error identity}}]
[& {:keys [terminate-query]
:or {terminate-query puppetlabs.puppetdb.query.monitor/terminate-query}}]

;; With the current implementation, there may not always be an entry
;; in :deadlines corresponding to an entry in :selector-keys. In
Expand All @@ -324,7 +325,7 @@
:deadlines (sorted-map-by compare-deadline-keys)})
:terminate-query terminate-query})]
(assoc m :thread
(Thread. #(monitor-queries m terminate-query on-fatal-error)
(Thread. #(monitor-queries m terminate-query)
"pdb query monitor"))))

(defn start [{:keys [^Thread thread] :as monitor}]
Expand Down
3 changes: 1 addition & 2 deletions test/puppetlabs/puppetdb/query/monitor_test.clj
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,7 @@
termination-latch (CountDownLatch. expected-terminations)]
(with-log-suppressed-unless-notable notable?
(with-open [m (qmon/monitor :terminate-query (recording-terminator terminations
termination-latch)
:on-fatal-error identity)]
termination-latch))]
(qmon/start m)

;; Test a query that finishes before its deadline (ok), a
Expand Down

0 comments on commit f4dbee8

Please sign in to comment.