Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Try to fix FreeBSD ccall failure #28201

Closed
wants to merge 1 commit into from

Conversation

iblislin
Copy link
Member

@iblislin iblislin commented Jul 20, 2018

  Got exception LoadError("/.../build/test/ccall.jl", 994, ErrorException("failed process: Process(`/.../build/usr/bin/julia -Cnative -J/.../build/usr/lib/julia/sys.so --compile=yes --depwarn=error --startup-file=no -e 'A = Ref{Cint}(42); finalizer(cglobal((:c_exit_finalizer, \"libccalltest\"), Cvoid), A)'`, ProcessSignaled(6)) [0]")) outside of a @test
  LoadError: failed process: Process(`/.../build/usr/bin/julia -Cnative -J/.../build/usr/lib/julia/sys.so --compile=yes --depwarn=error --startup-file=no -e 'A = Ref{Cint}(42); finalizer(cglobal((:c_exit_finalizer, "libccalltest"), Cvoid), A)'`, ProcessSignaled(6)) [0]

The process got aborted(ProcessSignaled(6)).

I also found a core file on my CI worker.

[venv] julia@abeing:~/julia-fbsd-buildbot/worker/11rel-amd64/build % lldb40 -c test/julia.core ./julia
(lldb) target create "./julia" --core "test/julia.core"
Core file '/home/julia/julia-fbsd-buildbot/worker/11rel-amd64/build/test/julia.core' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'julia', stop reason = signal SIGABRT
  * frame #0: 0x00000008012b779a libc.so.7`_thr_kill + 10
    frame #1: 0x00000008012b7764 libc.so.7`_raise + 52
    frame #2: 0x00000008012b76d9 libc.so.7`abort + 73
    frame #3: libjulia.so.0.7`uv__signal_global_init at signal.c:67
    frame #4: 0x00000008007b0bc8 libthr.so.3`_pthread_once + 216
    frame #5: libjulia.so.0.7`uv_once(guard=0x0000000800f6ce98, callback=(libjulia.so.0.7`uv__signal_global_init at signal.c:62)) at thread.c:286
    frame #6: libjulia.so.0.7`uv__signal_global_once_init at signal.c:72
    frame #7: libjulia.so.0.7`uv_loop_init(loop=0x0000000800f6cbe0) at loop.c:34
    frame #8: libjulia.so.0.7`uv_default_loop at uv-common.c:602
    frame #9: libjulia.so.0.7`_julia_init(rel=JL_IMAGE_CWD) at init.c:633
    frame #10: libjulia.so.0.7`julia_init__threading(rel=<unavailable>) at task.c:302
    frame #11: julia`main(argc=0, argv=0x00007fffffffe310) at repl.c:237
    frame #12: 0x0000000000401625 julia`_start + 149
(lldb) f 3
frame #3: libjulia.so.0.7`uv__signal_global_init at signal.c:67
   64       abort();
   65
   66     if (uv__signal_unlock())
-> 67       abort();
   68   }
   69
   70

So, I guess the core dump happened under pressure, and the libuv init failed.

@iblislin
Copy link
Member Author

ref #28191

@@ -24,7 +24,13 @@ else
UV_FLAGS := --disable-shared $(UV_MFLAGS)
endif

$(BUILDDIR)/$(LIBUV_SRC_DIR)/build-configured: $(SRCCACHE)/$(LIBUV_SRC_DIR)/source-extracted
$(SRCCACHE)/$(LIBUV_SRC_DIR)/libuv-unix-signal.patch-applied: $(SRCCACHE)/$(LIBUV_SRC_DIR)/source-extracted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not accept patch files for libuv, please commit the upstream (which we control, and thus can merge easily)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this PR is just a try.
I can re-send the patch to the upstream repo if it works.

@iblislin iblislin changed the title Try to fix FreeBSD ccal failure [WIP] Try to fix FreeBSD ccal failure Jul 20, 2018
do {
r = write(uv__signal_lock_pipefd[1], &data, sizeof data);
- } while (r < 0 && errno == EINTR);
+ } while (r < 0 && (errno == EINTR || errno == EAGAIN));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on EAGAIN, this is a likely deadlock about to happen :/

I should some day file a bug report on libuv that this code needs to be redesigned, since I've run into this failure mode before (it's not actually particularly unlikely to trigger).

@vtjnash
Copy link
Member

vtjnash commented Jul 20, 2018

Do you have access to errno in the core file? I just realized this is during startup, so actually there's no (documented) reason this call is permitted to fail here (https://www.freebsd.org/cgi/man.cgi?write(2))

@iblislin
Copy link
Member Author

Do you have access to errno in the core file?

well, sorry that the new build wiped the core file....
I'm waiting for another build dumps a core file.

@iblislin
Copy link
Member Author

ha, I got the core file now, and one of the workers is detached from buildbot master.

@iblislin
Copy link
Member Author

(lldb) target create "./julia" --core "test/julia.core"                                                                                           
Core file '/home/julia/julia-fbsd-buildbot/worker/11rel-amd64/build/test/julia.core' (x86_64) was loaded.                                         
(lldb) bt                                                                                                                                         
* thread #1, name = 'julia', stop reason = signal SIGABRT                                                                                         
  * frame #0: 0x00000008012b779a libc.so.7`_thr_kill + 10                                                                                         
    frame #1: 0x00000008012b7764 libc.so.7`_raise + 52                                                                                            
    frame #2: 0x00000008012b76d9 libc.so.7`abort + 73                                                                                             
    frame #3: libjulia.so.0.7`uv__signal_global_init at signal.c:67                                                                               
    frame #4: 0x00000008007b0bc8 libthr.so.3`_pthread_once + 216                                                                                  
    frame #5: libjulia.so.0.7`uv_once(guard=0x0000000800f6d088, callback=(libjulia.so.0.7`uv__signal_global_init at signal.c:62)) at thread.c:286 
    frame #6: libjulia.so.0.7`uv__signal_global_once_init at signal.c:72                                                                          
    frame #7: libjulia.so.0.7`uv_loop_init(loop=0x0000000800f6cdd0) at loop.c:34                                                                  
    frame #8: libjulia.so.0.7`uv_default_loop at uv-common.c:602                                                                                  
    frame #9: libjulia.so.0.7`_julia_init(rel=JL_IMAGE_CWD) at init.c:633                                                                         
    frame #10: libjulia.so.0.7`julia_init__threading(rel=<unavailable>) at task.c:302                                                             
    frame #11: julia`main(argc=0, argv=0x00007fffffffe310) at repl.c:237                                                                          
    frame #12: 0x0000000000401625 julia`_start + 149                                                                                              
(lldb) p errno                                                                                                                                    
(void *) $0 = 0x000000290000000c

@iblislin
Copy link
Member Author

GDB shew me ENOMEM

(gdb) p errno
$1 = 12

@iblislin
Copy link
Member Author

└─[iblis@abeing ]%  dmesg | tail
kern.ipc.maxpipekva exceeded; see tuning(7)
kern.ipc.maxpipekva exceeded; see tuning(7)
kern.ipc.maxpipekva exceeded; see tuning(7)
kern.ipc.maxpipekva exceeded; see tuning(7)
kern.ipc.maxpipekva exceeded; see tuning(7)
kern.ipc.maxpipekva exceeded; see tuning(7)
kern.ipc.maxpipekva exceeded; see tuning(7)
kern.ipc.maxpipekva exceeded; see tuning(7)
pid 32939 (julia), uid 1003: exited on signal 6 (core dumped)
kern.ipc.maxpipekva exceeded; see tuning(7)

I greped the whole /usr/src and found the message is from here:
https://github.com/freebsd/freebsd/blob/98274e3f11ad5ebefe16b45787a648fc2f3626e2/sys/kern/sys_pipe.c#L515-L529


Quote from tuning(7), it says the exhaustion is not fatal.

    The kern.ipc.maxpipekva loader tunable is used to set a hard limit on the
     amount of kernel address space allocated to mapping of pipe buffers.  Use
     of the mapping allows the kernel to eliminate a copy of the data from
     writer address space into the kernel, directly copying the content of
     mapped buffer to the reader.  Increasing this value to a higher setting,
     such as `25165824' might improve performance on systems where space for
     mapping pipe buffers is quickly exhausted.  This exhaustion is not fatal;
     however, and it will only cause pipes to fall back to using double-copy.

@ararslan ararslan added the system:freebsd Affects only FreeBSD label Jul 20, 2018
@iblislin iblislin changed the title [WIP] Try to fix FreeBSD ccal failure [WIP] Try to fix FreeBSD ccall failure Jul 24, 2018
@iblislin
Copy link
Member Author

So.. can I just make it retry if ENOMEM?

@iblislin
Copy link
Member Author

I found the test suit spawn will eat up all my kernel pipe memory (it's about 250MB).
Any idea about which test cases caused this?

Please note that the pipe resize failure counter is increasing in this gif.
gif

@vtjnash
Copy link
Member

vtjnash commented Jul 24, 2018

I found the test suit spawn will eat up all my kernel pipe memory. Any idea about which test cases caused this?

Very likely the test for kernel misbehaviors when eating up all kernel pipe memory ("# test for proper handling of FD exhaustion") 😜. We should probably move this tests (and any others like it) into a separate file ("stress.jl") and run them at the end on worker 1, so they don't interfere with other tests however.

So.. can I just make it retry if ENOMEM?

This is a kernel bug. This function is documented to never return ENOMEM.

@ararslan
Copy link
Member

This is a kernel bug. This function is documented to never return ENOMEM.

If we've uncovered a FreeBSD kernel bug, would someone (@iblis17?) be willing to submit a FreeBSD bug report with steps to reproduce?

@iblislin iblislin deleted the ib/libuv-unix-signal branch July 26, 2018 16:48
@iblislin
Copy link
Member Author

I'm trying to contructing a MWE in C... I will send it to the mailing list freebsd-hackers@ later.

@ararslan
Copy link
Member

Awesome. You rock!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:freebsd Affects only FreeBSD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants