Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meet java.util.concurrent.RejectedExecutionException after running some days #913

Closed
keyboardfann opened this issue Feb 27, 2018 · 11 comments

Comments

@keyboardfann
Copy link

keyboardfann commented Feb 27, 2018

When I run the riemann after some days, it will start to throw the error: java.util.concurrent.RejectedExecutionException , I see another resolved issue #655 change the queue-size to 10000, I have to change the queue size to 10000 and max-pool-size to 1024 but after some days the alert will show again and the alert will run out of our disk space.

riemann version: 0.2.14

riemann.conf

; -*- mode: clojure; -*-
; vim: filetype=clojure


; non-streams related configuration is located here
(include "/etc/riemann/include")

(instrumentation {:interval 60 :enabled? true})

(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      ; Log expired events.
      (expired
        (fn [event] (info "expired" event)))))

  (streams
        ;Output riemann internal metrics to log
	(where  (service #"^riemann ")
	  #(info %)
	)

    (by [:host :service]
      (where (not (expired? event))
        (smap (fn [events]
          (dissoc
             (assoc events :site (get (re-find (re-pattern "([A-Za-z0-9]*)[.](ops)[.]([a-z]*)") hostname) 1) :ttl 360 )
           :okthreshold :criticalthreshold :departmentkey :alertformat :alerter :alertstate
           :alertdescription :failure_message :step_message :sqlinstance_owner :is_job_running :original_host :original_service
               :description :job_category :job_id :job_owner :failure_time :next_runtime :frequency :step_id :sqlinstanceowner
            ))
     send_to_central_riemann))))
)

include/centralriemann.clj

; -*- mode: clojure; -*-
; vim: filetype=clojure

; Load the Riemann TCP client. Were going to use this client to send events downstream.
(require 'riemann.client)

; Forward events to the central riemann server
; 1. We define the downstream symbol ('send_to_central_riemann'). When that symbol is referenced events are passed into it.
; 2. Events first go into the batch stream. Every 100 events or 1/10th of a second events are batched and sent on.
; 3. The batched events are passed to the async-queue! stream.
; 4. The async-queue! stream passes the events to the forward stream which sends them to the central riemann server over 5555
(def send_to_central_riemann (batch 1000 1/10 
				(async-queue! :agg { :queue-size 10000
					:core-pool-size 32 
					:max-pool-size 1024}
						(forward
							(riemann.client/tcp-client :host "10.x.x.x")))) ; Need to set it to loadbalancer IP

riemann.log error log

102497 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102498 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102499 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :        time 379907009233/250, :ttl 120}
 102500 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 conns, :state ok, :description nil, :metric 0, :tags [riemann], :time 379907009        233/250, :ttl 120}
 102501 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 3799        07009233/250, :ttl 120}
 102502 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 37990        7009233/250, :ttl 120}
 102503 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.0, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 379907009233/250, :ttl 120}
 102504 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.5, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 379907009233/250, :ttl 120}
 102505 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102506 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102507 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :        time 379907009233/250, :ttl 120}
 102508 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.0, :state ok, :description nil, :metric nil, :tags [riemann], :tim        e 379907009233/250, :ttl 120}
 102509 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.5, :state ok, :description nil, :metric nil, :tags [riemann], :tim        e 379907009233/250, :ttl 120}
 102510 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 379907009233/250, :ttl 120}
 102511 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 379907009233/250, :ttl 120}
 102512 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102513 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg accepted rate, :state ok, :description nil, :metric 223500/30001, :tags [riemann], :time         379907009233/250, :ttl 120}
 102514 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg completed rate, :state ok, :description nil, :metric 0N, :tags [riemann], :time 379907009        233/250, :ttl 120}
 102515 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg rejected rate, :state ok, :description nil, :metric 0N, :tags [riemann], :time 3799070092        33/250, :ttl 120}
 102516 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue capacity, :state ok, :description nil, :metric 10000, :tags [riemann], :time 379907        009233/250, :ttl 120}
 102517 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue size, :state critical, :description nil, :metric 9991, :tags [riemann], :time 37990        7009233/250, :ttl 120}
 102518 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue used, :state critical, :description nil, :metric 9991/10000, :tags [riemann], :time         379907009233/250, :ttl 120}
 102519 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg threads active, :state ok, :description nil, :metric 246, :tags [riemann], :time 37990700        9233/250, :ttl 120}
 102520 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg threads alive, :state ok, :description nil, :metric 246, :tags [riemann], :time 379907009        233/250, :ttl 120}
...
...
102557 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 759814048467/500, :ttl 120}
 102558 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :        time 759814048467/500, :ttl 120}
 102559 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 conns, :state ok, :description nil, :metric 0, :tags [riemann], :time 759814048        467/500, :ttl 120}
 102560 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 7598        14048467/500, :ttl 120}
 102561 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 75981        4048467/500, :ttl 120}
 102562 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.0, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 759814048467/500, :ttl 120}
 102563 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.5, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 759814048467/500, :ttl 120}
 102564 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 759814048467/500, :ttl 120}
 102565 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 759814048467/500, :ttl 120}
 102566 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :        time 759814048467/500, :ttl 120}
 102567 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.0, :state ok, :description nil, :metric nil, :tags [riemann], :tim        e 759814048467/500, :ttl 120}
 102568 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.5, :state ok, :description nil, :metric nil, :tags [riemann], :tim        e 759814048467/500, :ttl 120}
 102569 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 759814048467/500, :ttl 120}
 102570 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 759814048467/500, :ttl 120}
 102571 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 759814048467/500, :ttl 120}
 102572 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg accepted rate, :state ok, :description nil, :metric 152000/20001, :tags [riemann], :time         303925619387/200, :ttl 120}
 102573 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg completed rate, :state ok, :description nil, :metric 0N, :tags [riemann], :time 303925619        387/200, :ttl 120}
 102574 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg rejected rate, :state ok, :description nil, :metric 0N, :tags [riemann], :time 3039256193        87/200, :ttl 120}
 102575 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue capacity, :state ok, :description nil, :metric 10000, :tags [riemann], :time 303925        619387/200, :ttl 120}
 102576 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue size, :state critical, :description nil, :metric 10000, :tags [riemann], :time 3039        25619387/200, :ttl 120}
 102577 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue used, :state critical, :description nil, :metric 1, :tags [riemann], :time 30392561        9387/200, :ttl 120}
 102578 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg threads active, :state ok, :description nil, :metric 693, :tags [riemann], :time 30392561        9387/200, :ttl 120}
 102579 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg threads alive, :state ok, :description nil, :metric 693, :tags [riemann], :time 303925619        387/200, :ttl 120}
...
...
102599 java.util.concurrent.RejectedExecutionException: Task clojure.core$bound_fn_STAR_$fn__4671@c1fdfcf rejected from java.util.concurrent.ThreadPoolExecutor@4b1f67c8[Running, pool size = 1024, active threads = 1024, queued tasks = 10        000, completed tasks = 8356055]
 102600         at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
 102601         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
 102602         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
 102603         at riemann.service.ExecutorServiceService.execute(service.clj:173)
 102604         at riemann.streams$execute_on$stream__7093.invoke(streams.clj:266)
 102605         at riemann.streams$batch$flush__7957$fn__7963.invoke(streams.clj:1154)
 102606         at riemann.streams$batch$flush__7957.invoke(streams.clj:1154)
 102607         at riemann.streams$part_time_simple$tick__7301.invoke(streams.clj:610)
 102608         at riemann.time.Once.run(time.clj:42)
 102609         at riemann.time$run_tasks_BANG_$fn__5370$fn__5371.invoke(time.clj:154)
 102610         at riemann.time$run_tasks_BANG_$fn__5370.invoke(time.clj:153)
 102611         at riemann.time$run_tasks_BANG_.invokeStatic(time.clj:147)
 102612         at riemann.time$run_tasks_BANG_.invoke(time.clj:142)
 102613         at riemann.time$start_BANG_$fn__5386$fn__5387.invoke(time.clj:189)
 102614         at clojure.lang.AFn.applyToHelper(AFn.java:152)
 102615         at clojure.lang.AFn.applyTo(AFn.java:144)
 102616         at clojure.core$apply.invokeStatic(core.clj:646)
 102617         at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881)
 102618         at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881)
 102619         at clojure.lang.RestFn.invoke(RestFn.java:425)
 102620         at clojure.lang.AFn.applyToHelper(AFn.java:156)
 102621         at clojure.lang.RestFn.applyTo(RestFn.java:132)
 102622         at clojure.core$apply.invokeStatic(core.clj:650)
 102623         at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911)
 102624         at clojure.lang.RestFn.invoke(RestFn.java:397)
 102625         at clojure.lang.AFn.run(AFn.java:22)
 102626         at java.lang.Thread.run(Thread.java:748)
 102627 WARN [2018-02-26 14:55:39,096] riemann task 0 - riemann.streams - riemann.streams$execute_on$stream__7093@194d221c threw
 102628 java.util.concurrent.RejectedExecutionException: Task clojure.core$bound_fn_STAR_$fn__4671@69dab334 rejected from java.util.concurrent.ThreadPoolExecutor@4b1f67c8[Running, pool size = 1024, active threads = 1024, queued tasks = 1        0000, completed tasks = 8356055]
 102629         at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
 102630         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
 102631         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
 102632         at riemann.service.ExecutorServiceService.execute(service.clj:173)
 102633         at riemann.streams$execute_on$stream__7093.invoke(streams.clj:266)
 102634         at riemann.streams$batch$flush__7957$fn__7963.invoke(streams.clj:1154)
 102635         at riemann.streams$batch$flush__7957.invoke(streams.clj:1154)
 102636         at riemann.streams$part_time_simple$tick__7301.invoke(streams.clj:610)
 102637         at riemann.time.Once.run(time.clj:42)
 102638         at riemann.time$run_tasks_BANG_$fn__5370$fn__5371.invoke(time.clj:154)
 102639         at riemann.time$run_tasks_BANG_$fn__5370.invoke(time.clj:153)
 102640         at riemann.time$run_tasks_BANG_.invokeStatic(time.clj:147)
 102641         at riemann.time$run_tasks_BANG_.invoke(time.clj:142)
@mcorbin
Copy link
Contributor

mcorbin commented Feb 27, 2018 via email

@keyboardfann
Copy link
Author

Hi @mcorbin,
This is the status at the moment. The complele rate from 7 drop to 0. But I don't know it's network issue or what happened.
https://imgur.com/a/el3Nh

@mcorbin
Copy link
Contributor

mcorbin commented Mar 1, 2018

I dont know what happened. You had any other exceptions in the logs ?

Btw, in your example code:

(def send_to_central_riemann (batch 1000 1/10 
				(async-queue! :agg { :queue-size 10000
					:core-pool-size 32 
					:max-pool-size 1024}
						(forward
							(riemann.client/tcp-client :host "10.x.x.x")))) ; Need to set it to loadbalancer IP

a parenthesis is missing at the end.

@keyboardfann
Copy link
Author

HI @mcorbin
No error log in the log file, the riemann I miss copy a parenthesis, it's ok on our system. Thank you for notification. It happens sometimes, is there any way to check when it happens next time?

@mcorbin
Copy link
Contributor

mcorbin commented Mar 2, 2018

So when it happens, the Riemann on 10.x.x.x stop receiving events completely ?
What happen to the Riemann forwarding the events ? does it stop emitting statistics ? Your previous graph was the stats of the Riemann forwarding the events or the Riemann receiving the events on "10.x.x.x " ?

@keyboardfann
Copy link
Author

Hi @mcorbin,
We have improved our central riemann code to reduce CPU usage, and monitor if it happens again. If nothing happens again this month, I will close the ticket. Thank you for the help.

@jamtur01
Copy link
Member

Excellent. Please re-open if your issue reappears. Thanks.

@keyboardfann
Copy link
Author

keyboardfann commented Mar 23, 2018

Hi @jamtur01, it happen again, could you help to reopen it?

Hi @mcorbin ,
Here is the workflow tcb0124 --> haproxy --> central riemann1,2,3 --> influxdb or logstash.
When the riemann forwarder shows the error, the central riemanns seems works fine.
image

And the riemann forwarder stats and other data seems can't send to center riemann and influxdb, so there is no metrics on the graph after the error happen. I think the forwarder stats on the graph is the Riemann forwarding the events, not receiver stats.

image
image

@jamtur01 jamtur01 reopened this Mar 23, 2018
@mcorbin
Copy link
Contributor

mcorbin commented Mar 23, 2018

I think you need to use a tool like visualvm and perform thread and heap dump and analyze the threads states on the frozen Riemann process.

You can do it by adding on /etc/default/riemann (or /etc/sysconfig on RHEL) these options:

EXTRA_JAVA_OPTS="-Dcom.sun.management.jmxremote.port=4444 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"

Be careful, this example configure jmx without authentication/security (but you can also configure jmx with password auth/tls/...).

You can then analyze your Riemann process using for example https://visualvm.github.io/download.html, to see the thread states and perform thread dumps.

You can also do heap dump using jmap, with this command for example:

sudo -u riemann bash -c 'jmap -dump:format=b,file=/tmp/heapdump.bin <RIEMANN_PID>'

and then open it in visualvm.

@keyboardfann
Copy link
Author

Let me try and see if I can find any issue.

@keyboardfann
Copy link
Author

seems don't happen again, I close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants