Performance issue with producer - low throughput #3328

sendaun · 2021-04-05T06:06:53Z

Description

My application has a throughput requirement of at least 5000 records per second (with records ranging from 200 to 300 bytes each). In this regard, I have been testing librdkafka throughput for past one week by tweaking various configuration parameters. So far the max through put i could witness is ~ 1560 records per second (no where near the performance numbers I read here).

Few observations to share:

(1) As suggested in various forums, I tried with various batch sizes and queue.buffering.max.ms (between 1ms to 1000ms), but hardly any difference in the throughput.

(2) The statistics shows significantly high internal latency. [Attached]
librd_stat.txt

(3) Adding a another producer instance, shows almost double the through put (~2k to 2.5k records per second)

(4) As observed in this issue: #2356, I also notice that reducing the record size from 300 bytes to 30 bytes shows a drastic increase in the throughput numbers (don't have the exact numbers, but significantly high)

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): v1.6.1
Apache Kafka version: 2.7.0
Operating system: z/OS (Unix System Services)
librdkafka client configuration:
client.id = rdkafka
client.software.name = librdkafka
metadata.broker.list = 192.0.1.2:9092
message.max.bytes = 600000
message.copy.max.bytes = 65535
receive.message.max.bytes = 100000000
max.in.flight.requests.per.connection = 1000000
metadata.request.timeout.ms = 60000
topic.metadata.refresh.interval.ms = 5000
metadata.max.age.ms = 900000
topic.metadata.refresh.fast.interval.ms = 250
topic.metadata.refresh.fast.cnt = 10
topic.metadata.refresh.sparse = true
topic.metadata.propagation.max.ms = 30000
debug =
socket.timeout.ms = 60000
socket.blocking.max.ms = 1
socket.send.buffer.bytes = 0
socket.receive.buffer.bytes = 0
socket.keepalive.enable = false
socket.nagle.disable = false
socket.max.fails = 1
broker.address.ttl = 1000
broker.address.family = any
enable.sparse.connections = true
reconnect.backoff.jitter.ms = 0
reconnect.backoff.ms = 100
reconnect.backoff.max.ms = 10000
statistics.interval.ms = 10000
enabled_events = 0
stats_cb = 0x1e60a4e0
log_cb = 0x1e643cd0
log_level = 6
log.queue = false
log.thread.name = true
enable.random.seed = true
log.connection.close = true
socket_cb = 0x1e6afa58
open_cb = 0x1e7db490
default_topic_conf = 0x5000102b60
internal.termination.signal = 0
api.version.request = true
api.version.request.timeout.ms = 10000
api.version.fallback.ms = 0
broker.version.fallback = 0.10.0
security.protocol = plaintext
ssl.ca.certificate.stores = Root
enable.ssl.certificate.verification = true
ssl.endpoint.identification.algorithm = none
sasl.mechanisms = GSSAPI
sasl.kerberos.service.name = kafka
sasl.kerberos.principal = kafkaclient
sasl.kerberos.kinit.cmd = kinit -R -t "%{sasl.kerberos.keytab}" -k %{sasl.kerberos.principal} || kinit -t "%{sasl.kerberos.keytab}" -k %{sasl.kerberos.principal}
sasl.kerberos.min.time.before.relogin = 60000
enable.sasl.oauthbearer.unsecure.jwt = false
test.mock.num.brokers = 0
partition.assignment.strategy = range,roundrobin
session.timeout.ms = 10000
heartbeat.interval.ms = 3000
group.protocol.type = consumer
coordinator.query.interval.ms = 600000
max.poll.interval.ms = 300000
enable.auto.commit = true
auto.commit.interval.ms = 5000
enable.auto.offset.store = true
queued.min.messages = 100000
queued.max.messages.kbytes = 65536
fetch.wait.max.ms = 500
fetch.message.max.bytes = 1048576
fetch.max.bytes = 52428800
fetch.min.bytes = 1
fetch.error.backoff.ms = 500
offset.store.method = broker
isolation.level = read_committed
enable.partition.eof = false
check.crcs = false
allow.auto.create.topics = false
client.rack =
transaction.timeout.ms = 60000
enable.idempotence = false
enable.gapless.guarantee = false
queue.buffering.max.messages = 100000
queue.buffering.max.kbytes = 1048576
queue.buffering.max.ms = 1000
message.send.max.retries = 2147483647
retry.backoff.ms = 100
queue.buffering.backpressure.threshold = 1
compression.codec = none
batch.num.messages = 1000
batch.size = 1000000
delivery.report.only.error = true
sticky.partitioning.linger.ms = 10

Broker configuration: 3 brokers, 1 topic/ 3 partitions, RF:3

The text was updated successfully, but these errors were encountered:

edenhill · 2021-04-05T09:01:28Z

Are you perhaps doing synchronous producing, that is, you wait for each produce() to complete by calling flush() or poll(long_timeout)?
https://github.com/edenhill/librdkafka/wiki/FAQ#why-is-there-no-sync-produce-interface

sendaun · 2021-04-05T09:31:56Z

Thanks for the quick response @edenhill . I am not doing synchronous produce. My producer application has a separate polling thread that polls indefinitely:
while(1)
{
rd_kafka_poll(rk, 1000);
}

Apart from this no poll (or flush()) done in my main thread that produce() the records.

edenhill · 2021-04-06T13:30:56Z

What configuration properties are you setting? (it is quite a bit of work to see which ones are set to non-default values in the config dump above)
Alternatively use debug=conf to get them printed on startup.

Are there any process limits in play, perhaps number of concurrent threads running?

sendaun · 2021-04-07T06:45:58Z

These are the only configurations set:

I checked my system configuration for process limits, current configuration can go upto 1500 concurrent threads:

CURRENT UNIX CONFIGURATION SETTINGS:
MAXPROCSYS = 1500 MAXPROCUSER = 250
MAXFILEPROC = 256000 MAXFILESIZE = NOLIMIT
MAXCPUTIME = 1000 MAXUIDS = 200
MAXPTYS = 256 MAXIOBUFUSER = 2048
MAXMMAPAREA = 40960 MAXASSIZE = 2G
MAXTHREADS = 1500 MAXTHREADTASKS = 1500
MAXCORESIZE = 4194304 MAXSHAREPAGES = 131072
IPCMSGQBYTES = 2147483647 IPCMSGQMNUM = 10000
IPCMSGNIDS = 500 IPCSEMNIDS = 500
IPCSEMNOPS = 25 IPCSEMNSEMS = 50
IPCSHMMPAGES = 2048 IPCSHMNIDS = 500
IPCSHMNSEGS = 500 IPCSHMSPAGES = 262144
SUPERUSER = BPXROOT FORKCOPY = COPY

sendaun · 2021-04-07T06:54:11Z

Turned on protocol/msg logging. Attached for your reference.
librd_log.txt

edenhill · 2021-04-07T07:18:29Z

Oh, you are setting acks=0:

%7|1617778132.515|CONF|rdkafka#producer-1| [thrd:192.0.1.2:9092/bootstrap]: Topic "kpis" configuration (user-supplied):
%7|1617778132.515|CONF|rdkafka#producer-1| [thrd:192.0.1.2:9092/bootstrap]:   request.required.acks = 0

That will not work great since the client will not receive much back pressure or pacing from the broker but just blast away messages as quickly as it can, supposedly faster than the broker can handle.
Highly recommend never to use acks=0. Use acks=1 if you don't care about messaging guarantees, or acks=all if you do.

sendaun · 2021-04-07T08:02:42Z

ack=1 did not yield much throughput gain either. It is still around 1724 records per second.
log (Attempt to produce 100K records ~ 300bytes each):
librd_log_acks1.txt

edenhill · 2021-04-07T08:10:41Z

Recommend removing "batch.num.messages", it is better for the client to fill batches as large as possible.

edenhill · 2021-04-07T08:10:58Z

You can also reduce linger.ms to 100 ms

sendaun · 2021-04-07T08:20:21Z

No luck still, we came down to ~1100 rps.

librd_log_100ms.txt

edenhill · 2021-04-07T08:32:32Z

When the socket buffers become full there's this 1s delay between sends:

%7|1617783327.850|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent partial ProduceRequest (v7, 704037+65536/999872 bytes, CorrId 7)
%7|1617783327.850|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent partial ProduceRequest (v7, 769573+65536/999872 bytes, CorrId 7)
%7|1617783328.851|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent partial ProduceRequest (v7, 835109+94419/999872 bytes, CorrId 7)
%7|1617783329.851|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent partial ProduceRequest (v7, 929528+65536/999872 bytes, CorrId 7)
%7|1617783329.852|SEND|rdkafka#producer-1| [thrd:192.0.1.2:9094/3]: 192.0.1.2:9093/2: Sent ProduceRequest (v7, 999872 bytes @ 995064, CorrId 7)

Are you comfortable with rebuilding librdkafka yourself?
If so I'd like you to change this line from 1000 to 10 and then try to reproduce:
https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_broker.c#L78

sendaun · 2021-04-07T08:58:01Z

That did the trick! I see the throughput went up to ~ 25K to 33K rps.

librd_log_blockms_10.txt

Now, Will it make sense to change max_block time (rd_kafka_max_block_ms) configurable by client ?

Thanks for your precious time on this @edenhill

edenhill · 2021-04-07T09:14:38Z

Awesome!
Are those throughput numbers more in line with your requirements?

The max_block_ms is a safety harness which shouldn't really be used, it is just there to make sure we don't over-poll or under-poll, but the code should wake up automatically when there are more messages.
I think this might be related to the queue wakeup issues discussed here: #2912

sendaun · 2021-04-07T09:28:55Z

Indeed! -
The numbers I see exceeds my applications current throughput requirement. I will settle down on a nominal max_block_ms value so I don't over/under poll.

The issue can be closed. Thanks again!

edenhill · 2021-04-07T09:35:58Z

Dup of #2912

edenhill · 2021-04-08T10:26:24Z

@sendaun Could you try the https://github.com/edenhill/librdkafka/tree/nokinitinnew branch? (without reducing max_block_ms)
I think it might fix this issue.

sendaun · 2021-04-08T10:32:32Z

Sure will test and keep you posted on the result.

sendaun · 2021-04-08T17:38:21Z

Hi @edenhill , appears does not fix the problem. Tested with multiple runs but unable to cross 1500 records per sec.

librd_log_nokinitinnew.txt

edenhill · 2021-04-14T08:34:11Z

Hey, can you try to reproduce on this branch? https://github.com/edenhill/librdkafka/tree/qwakeupfix

sendaun · 2021-04-14T15:05:55Z

Unable to locate the branch, Should I try master?

edenhill · 2021-04-14T17:10:20Z

Yes please ons 14 apr. 2021 kl. 17:06 skrev sendaun ***@***.***>:

…

Unable to locate the branch, Should I try master? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3328 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEAFPXXLNIEVAXJT2PDM3LTIWVONANCNFSM42MIPW7A> .

edenhill added the duplicate label Apr 7, 2021

edenhill closed this as completed Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with producer - low throughput #3328

Performance issue with producer - low throughput #3328

sendaun commented Apr 5, 2021 •

edited

Loading

edenhill commented Apr 5, 2021

sendaun commented Apr 5, 2021

edenhill commented Apr 6, 2021

sendaun commented Apr 7, 2021

sendaun commented Apr 7, 2021

edenhill commented Apr 7, 2021

sendaun commented Apr 7, 2021 •

edited

Loading

edenhill commented Apr 7, 2021

edenhill commented Apr 7, 2021

sendaun commented Apr 7, 2021

edenhill commented Apr 7, 2021

sendaun commented Apr 7, 2021

edenhill commented Apr 7, 2021

sendaun commented Apr 7, 2021

edenhill commented Apr 7, 2021

edenhill commented Apr 8, 2021

sendaun commented Apr 8, 2021

sendaun commented Apr 8, 2021

edenhill commented Apr 14, 2021

sendaun commented Apr 14, 2021

edenhill commented Apr 14, 2021 via email

Performance issue with producer - low throughput #3328

Performance issue with producer - low throughput #3328

Comments

sendaun commented Apr 5, 2021 • edited Loading

Description

Checklist

edenhill commented Apr 5, 2021

sendaun commented Apr 5, 2021

edenhill commented Apr 6, 2021

sendaun commented Apr 7, 2021

sendaun commented Apr 7, 2021

edenhill commented Apr 7, 2021

sendaun commented Apr 7, 2021 • edited Loading

edenhill commented Apr 7, 2021

edenhill commented Apr 7, 2021

sendaun commented Apr 7, 2021

edenhill commented Apr 7, 2021

sendaun commented Apr 7, 2021

edenhill commented Apr 7, 2021

sendaun commented Apr 7, 2021

edenhill commented Apr 7, 2021

edenhill commented Apr 8, 2021

sendaun commented Apr 8, 2021

sendaun commented Apr 8, 2021

edenhill commented Apr 14, 2021

sendaun commented Apr 14, 2021

edenhill commented Apr 14, 2021 via email

sendaun commented Apr 5, 2021 •

edited

Loading

sendaun commented Apr 7, 2021 •

edited

Loading