Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluent-bit: modify hot reload handler #8041

Merged
merged 1 commit into from
Oct 19, 2023

Conversation

nokute78
Copy link
Collaborator

@nokute78 nokute78 commented Oct 14, 2023

This patch is to modify signal handler for hot reloading.

Multiple SIGHUP can cause SIGSEGV like following log.
This patch is to prevent it.

$ bin/fluent-bit -c a.conf 
Fluent Bit v2.2.0
* Copyright (C) 2015-2023 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/10/14 10:26:43] [ info] [fluent bit] version=2.2.0, commit=8401076f11, pid=33322
[2023/10/14 10:26:43] [ info] [storage] ver=1.2.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/10/14 10:26:43] [ info] [cmetrics] version=0.6.3
[2023/10/14 10:26:43] [ info] [ctraces ] version=0.3.1
[2023/10/14 10:26:43] [ info] [input:cpu:cpu.0] initializing
[2023/10/14 10:26:43] [ info] [input:cpu:cpu.0] storage_strategy='memory' (memory only)
[2023/10/14 10:26:43] [ info] [output:stdout:stdout.0] worker #0 started
[2023/10/14 10:26:43] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/10/14 10:26:43] [ info] [sp] stream processor started
[2023/10/14 10:26:48] [engine] caught signal (SIGHUP)
[2023/10/14 10:26:48] [ info] reloading instance pid=33322 tid=0x7fde55899fc0
[2023/10/14 10:26:48] [ info] [reload] stop everything of the old context
[2023/10/14 10:26:48] [ warn] [engine] service will shutdown when all remaining tasks are flushed
(snip)
[2023/10/14 10:26:51] [ info] [reload] start everything
[2023/10/14 10:26:51] [  Error] epoll_wait: Bad file descriptor, errno=9 at /home/taka/git/fluent-bit/lib/monkey/mk_core/mk_event_epoll.c:449
[2023/10/14 10:26:51] [engine] caught signal (SIGSEGV)
[2023/10/14 10:26:51] [engine] caught signal (SIGSEGV)
[2023/10/14 10:26:51] [engine] caught signal (SIGHUP)
[2023/10/14 10:26:51] [ warn] [reload] hot reloading is not enabled
[2023/10/14 10:26:51] [engine] caught signal (SIGHUP)
[2023/10/14 10:26:51] [ warn] [reload] hot reloading is not enabled
#0  0x557d4c1114cd      in  mk_rconf_free() at lib/monkey/mk_core/mk_rconf.c:673
#1  0x557d4b499dd6      in  flb_stop() at src/flb_lib.c:772
#2  0x557d4b48354b      in  flb_main() at src/fluent-bit.c:1426
#3  0x557d4b48359b      in  main() at src/fluent-bit.c:1437
#4  0x7fde55629d8f      in  __libc_start_call_main() at _call_main.h:58
#5  0x7fde55629e3f      in  __libc_start_main_impl() at sr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:392
#6  0x557d4b47c374      in  ???() at ???:0
#7  0xffffffffffffffff  in  ???() at ???:0
[2]+  Killed                  bin/fluent-bit -c a.conf
Aborted (core dumped)
taka@taka-VirtualBox:~/git/fluent-bit/build$ 

Note: SIG31-C. Do not access shared objects in signal handlers
https://wiki.sei.cmu.edu/confluence/x/VdYxBQ


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • [N/A] Backport to latest stable release.

Configuration

[SERVICE]
    Http_Server on
    Hot_reload on

[INPUT]
    Name cpu
    Interval_sec 10

[OUTPUT]
    Name stdout
    Match *

Debug/Valgrind log

Valgrind reported an error that is not related this PR.
The error is also occurred on current master

$ valgrind --leak-check=full bin/fluent-bit -c a.conf 
==35224== Memcheck, a memory error detector
==35224== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==35224== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==35224== Command: bin/fluent-bit -c a.conf
==35224== 
Fluent Bit v2.2.0
* Copyright (C) 2015-2023 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/10/14 10:32:09] [ info] [fluent bit] version=2.2.0, commit=8401076f11, pid=35224
[2023/10/14 10:32:09] [ info] [storage] ver=1.2.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/10/14 10:32:09] [ info] [cmetrics] version=0.6.3
[2023/10/14 10:32:09] [ info] [ctraces ] version=0.3.1
[2023/10/14 10:32:09] [ info] [input:cpu:cpu.0] initializing
[2023/10/14 10:32:09] [ info] [input:cpu:cpu.0] storage_strategy='memory' (memory only)
[2023/10/14 10:32:09] [ info] [output:stdout:stdout.0] worker #0 started
[2023/10/14 10:32:10] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/10/14 10:32:10] [ info] [sp] stream processor started
[2023/10/14 10:32:11] [engine] caught signal (SIGHUP)
[2023/10/14 10:32:11] [ info] reloading instance pid=35224 tid=0x541c940
[2023/10/14 10:32:11] [ info] [reload] stop everything of the old context
[2023/10/14 10:32:11] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[2023/10/14 10:32:11] [ info] [input] pausing cpu.0
[2023/10/14 10:32:11] [ info] [engine] service has stopped (0 pending tasks)
[2023/10/14 10:32:11] [ info] [input] pausing cpu.0
[2023/10/14 10:32:12] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2023/10/14 10:32:12] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2023/10/14 10:32:12] [ info] [reload] start everything
[2023/10/14 10:32:12] [ info] [fluent bit] version=2.2.0, commit=8401076f11, pid=35224
[2023/10/14 10:32:12] [ info] [storage] ver=1.2.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/10/14 10:32:12] [ info] [cmetrics] version=0.6.3
[2023/10/14 10:32:12] [ info] [ctraces ] version=0.3.1
[2023/10/14 10:32:12] [ info] [input:cpu:cpu.0] initializing
[2023/10/14 10:32:12] [ info] [input:cpu:cpu.0] storage_strategy='memory' (memory only)
[2023/10/14 10:32:12] [ info] [output:stdout:stdout.0] worker #0 started
[2023/10/14 10:32:12] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/10/14 10:32:12] [ info] [sp] stream processor started
^C[2023/10/14 10:32:13] [engine] caught signal (SIGINT)
[2023/10/14 10:32:13] [ warn] [engine] service will shutdown in max 5 seconds
[2023/10/14 10:32:13] [ info] [input] pausing cpu.0
[2023/10/14 10:32:13] [ info] [engine] service has stopped (0 pending tasks)
[2023/10/14 10:32:13] [ info] [input] pausing cpu.0
[2023/10/14 10:32:13] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2023/10/14 10:32:13] [ info] [output:stdout:stdout.0] thread worker #0 stopped
==35224== 
==35224== HEAP SUMMARY:
==35224==     in use at exit: 56 bytes in 1 blocks
==35224==   total heap usage: 7,050 allocs, 7,049 frees, 1,210,602 bytes allocated
==35224== 
==35224== 56 bytes in 1 blocks are definitely lost in loss record 1 of 1
==35224==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==35224==    by 0xE4D375: mk_mem_alloc_z (mk_memory.h:70)
==35224==    by 0xE4D766: thread_get_libco_params (mk_http_thread.c:60)
==35224==    by 0xE4D956: thread_params_set (mk_http_thread.c:168)
==35224==    by 0xE4DC15: mk_http_thread_create (mk_http_thread.c:226)
==35224==    by 0xE49A71: mk_http_init (mk_http.c:748)
==35224==    by 0xE4889D: mk_http_request_prepare (mk_http.c:232)
==35224==    by 0xE4B8DC: mk_http_sched_read (mk_http.c:1576)
==35224==    by 0xE47215: mk_sched_event_read (mk_scheduler.c:695)
==35224==    by 0xE50959: mk_server_worker_loop (mk_server.c:523)
==35224==    by 0xE46B38: mk_sched_launch_worker_loop (mk_scheduler.c:417)
==35224==    by 0x4FF3AC2: start_thread (pthread_create.c:442)
==35224== 
==35224== LEAK SUMMARY:
==35224==    definitely lost: 56 bytes in 1 blocks
==35224==    indirectly lost: 0 bytes in 0 blocks
==35224==      possibly lost: 0 bytes in 0 blocks
==35224==    still reachable: 0 bytes in 0 blocks
==35224==         suppressed: 0 bytes in 0 blocks
==35224== 
==35224== For lists of detected and suppressed errors, rerun with: -s
==35224== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Signed-off-by: Takahiro Yamashita <[email protected]>
@nokute78 nokute78 temporarily deployed to pr October 14, 2023 01:34 — with GitHub Actions Inactive
@nokute78 nokute78 temporarily deployed to pr October 14, 2023 01:34 — with GitHub Actions Inactive
@nokute78 nokute78 temporarily deployed to pr October 14, 2023 01:34 — with GitHub Actions Inactive
}

if (exit_signal) {
flb_signal_exit(exit_signal);
}
ret = config->exit_status_code;
ret = ctx->config->exit_status_code;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent touching old config and following warning.

^C[2023/10/14 10:28:56] [engine] caught signal (SIGINT)
==33368== Invalid read of size 4
==33368==    at 0x1C84B2: flb_main (fluent-bit.c:1404)
==33368==    by 0x1C859B: main (fluent-bit.c:1437)
==33368==  Address 0x541f6d8 is 808 bytes inside a block of size 18,592 free'd
==33368==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==33368==    by 0x22418D: flb_free (flb_mem.h:127)
==33368==    by 0x22C28B: flb_config_exit (flb_config.c:545)
==33368==    by 0x1DDB0A: flb_destroy (flb_lib.c:242)
==33368==    by 0x28ECFB: flb_reload (flb_reload.c:504)
==33368==    by 0x1C6E86: flb_signal_handler (fluent-bit.c:613)
==33368==    by 0x4FA151F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==33368==    by 0x50447F7: clock_nanosleep@@GLIBC_2.17 (clock_nanosleep.c:78)
==33368==    by 0x5049676: nanosleep (nanosleep.c:25)
==33368==    by 0x50495AD: sleep (sleep.c:55)
==33368==    by 0x1C846A: flb_main (fluent-bit.c:1380)
==33368==    by 0x1C859B: main (fluent-bit.c:1437)
==33368==  Block was alloc'd at
==33368==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==33368==    by 0x224173: flb_calloc (flb_mem.h:95)
==33368==    by 0x22B647: flb_config_init (flb_config.c:201)
==33368==    by 0x1DD88C: flb_create (flb_lib.c:162)
==33368==    by 0x1C784D: flb_main (fluent-bit.c:1050)
==33368==    by 0x1C859B: main (fluent-bit.c:1437)
==33368== 

@nokute78 nokute78 temporarily deployed to pr October 14, 2023 02:00 — with GitHub Actions Inactive
Copy link
Contributor

@cosmo0920 cosmo0920 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. 😍 I confirmed that this PR fixes the most of the SEGV cases with more severe conditions which is using with Golang plugin that is built on top of our Golang infrastructure: https://github.com/calyptia/plugin

Thanks for plugging SEGV case! 👍

@niedbalski niedbalski added this to the Fluent Bit v2.1.11 milestone Oct 19, 2023
@edsiper edsiper merged commit 0955615 into fluent:master Oct 19, 2023
40 of 43 checks passed
edsiper pushed a commit that referenced this pull request Oct 20, 2023
@nokute78 nokute78 deleted the reload_sighandler branch October 20, 2023 21:41
leonardo-albertovich pushed a commit that referenced this pull request Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants