Seg fault in rkfs_ kills python kernel without error message or traceback #1

mountaindust · 2013-07-05T23:16:28Z

Backtrace from gdb:
#0 0x69144891 in rkfs_ () from C:\Python27\lib\site-packages\odespy_rkf45.pyd
#1 0x69146562 in rkf45_ () from C:\Python27\lib\site-packages\odespy_rkf45.pyd
#2 0x691444d8 in advance_ () from C:\Python27\lib\site-packages\odespy_rkf45.pyd
#3 0x69142291 in f2py_rout__rkf45_advance ()

from C:\Python27\lib\site-packages\odespy_rkf45.pyd
#4 0x69142c83 in fortran_call () from C:\Python27\lib\site-packages\odespy_rkf45.pyd
#5 0x1e0650fe in python27!PyObject_Call () from C:\Windows\system32\python27.dll
#6 0x00000000 in ?? ()

I am working on making this reproducible, but so far, it only wants to occur in the middle of a ton of cython code after running for 2 hours. I've had "unexpected kernel death" in a number of solvers, all of them adaptive methods (I haven't tried with non-adaptive methods).

Update: I've been able to reproduce the kernel crash on the first step when using Lsoda. As a result, I was able to step in with pdb while using some pure python code, and (accidentally, actually) get an error message: "Illegal input was detected, before taking any integration steps." This message repeats a couple of times with a suggested factor for scaling tolerance before you get "SystemExit: 1." Maybe this is what then propagates up and crashes the kernel? In normal running, you cannot catch whatever is crashing it, even with a bare "except:" clause. The error message never appears either - probably because stdout doesn't get flushed - and all you see is the message "The kernel has died unexpectedly..." which prints to the screen over and over again every few seconds.

Something needs to be fixed here, so that such errors from the solver raise an exception with some information rather than a SystemExit. At minimum, it would be nice if the kernel didn't crash and a readable error message was displayed.

Further examination of Lsoda with pdb always hangs the Spyder IDE on line 1086 in solvers.py. value.dtype is determined to be dtype('float64') on line 1077, and requesting this information from the pdb prompt does not raise an error. My U0 array is rather long, (282568 in length), but I don't see how that would make a difference for this assignment. Windows can recover from the crash, and requesting information about self.dtype from pdb gives the following, somewhat cryptic error:

ipdb> self.dtype
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 552, in __bootstrap_inner
self.run()
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\monitor.py", line 569, in run
self.update_remote_view()
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\monitor.py", line 450, in update_remote_view
remote_view = make_remote_view(ns, settings, more_excluded_names)
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\monitor.py", line 70, in make_remote_view
more_excluded_names=more_excluded_names)
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\monitor.py", line 59, in get_remote_data
excluded_names=excluded_names)
File "C:\Python27\lib\site-packages\spyderlib\widgets\dicteditorutils.py", line 237, in globalsfilter
filters=filters))
File "C:\Python27\lib\site-packages\spyderlib\widgets\dicteditorutils.py", line 202, in is_supported
if not is_editable_type(value):
File "C:\Python27\lib\site-packages\spyderlib\widgets\dicteditorutils.py", line 115, in is_editable_type
return get_color_name(value) not in (UNSUPPORTED_COLOR, CUSTOM_TYPE_COLOR)
File "C:\Python27\lib\site-packages\spyderlib\widgets\dicteditorutils.py", line 107, in get_color_name
elif value.size == 1:
AttributeError: Lsoda instance has no attribute 'size'

dtype('float64')

Typing self.dtype again, or maybe anything else, hangs pdb indefinitely (thus the kernel death?).

hplgit · 2013-07-12T15:00:18Z

It would be nice to catch problems in the Fortran code and raise a Python exception, but it is difficult to find what the problem in Fortran is unless Fortran actually returns successfully with some error code (most of the Python wrappers of Fortran solvers hold the error code as attribute such that the user can examine it).

It is hard to debug such problems without having a small example and provide files that can be run on other systems than Windows.

mountaindust · 2013-07-12T17:26:31Z

I finally figured out while using (what I think is) a pure python solver from the scipy ode package that there was a memory leak. The cause of the leak is actually of some interest for solver construction, so I will describe it.

Suppose you need to solve a large ODE system with a considerable size initial condition, but the parameters will need updating every so often because the system is somehow coupled to something else - say a discrete-time dynamical system. You use a for loop to go through each time period on which you will solve and update the parameters.

Now I'm a mathematician not a programmer, and I have started using python from matlab, so I made a couple of mistakes here. The first was that I changed parameters in my ODE system by creating a new lambda function, odes = lambda x, t: my_ode_system(x,t,params) each time the loop came around to where I needed to solve... instead of by the method supplied. This is a problem for some solvers because they take this function handle and try to make a lambda function of this lambda function, just in case the x and t are switched. For some reason, this causes problems, but they are somewhat minor and traceable. The second mistake was the killer - I had everything about the solver inside the loop. Something like this (leaving out the parameter mistake):

for t in xrange(0,10,h)
solver = odespy.rkf45(my_ode_system,prm1,prm2,prm3)
solver.set_initial_condition(bigarray)
u, tpoints = solver.solve([t,t+h])
prm1, prm2, prm3 = update_params(u,t)
bigarray = u
ii += 1
solution_record[ii] = u

To me, it was not at all clear why this might be bad, but I think every time solver is redefined in the first line of the for loop, there is a memory leak due to the two scopes it operates in (again, not a programmer, so this is only a guess). Anyway, while doing this exact same thing with a scipy ode object, my leak was fixed by putting the lines "del solver; gc.collect()" before/after the solver object was used (oddly enough, there was still a longer term leak that didn't go away until I found a way to majorly decrease the size of my initial conditions using a different data structure). I have not yet tried it with scipy to see if the leak is fixed there as well, though I imagine it would be. Maybe this observation would be worth posting on the scipy dev page as well, since there is no mention of this problem anywhere that I can find, and since I can easily see other math people trying the same sort of thing, this weakness in the object oriented approach taken here should be pointed out and avoided.

I discovered the leak because the scipy dopri5 method eventually returns a memory error. If the wrapper could check for this sort of thing, that would be a big step forward.

hplgit · 2013-07-20T08:42:21Z

You are certainly on the level of a professional programmer with this ability to dive into seg faults and memory leaks!

Normally, in Python, a loop like

for i in range(1000000):
solver = SomeBigObject()

works well because the garbage collector detects that previous SomeBigObject objects are not referenced anymore by a variable and hence their destructors are called. Experiments I have done with big arrays show that this is true. Doing an explicit del solver is what I though implicitly was done, along with gc.collect() (now and then). Did you have to say gc.collect(), and would gc.collect() only solve the problem? I'm guessing too about what can be the cause...

You did the del solver with a scipy ode object - if solver was the odespy wrapper of the scipy ode object, would the del and collect work?

There is probably no memory problem in the underlying Fortran code since Fortran can only work with static chunks of memory, but it could of course be some problem in the wrapper code. I think wrappers now are created with Cython while they used f2py in the past. It could be interesting to see if an old f2py-based scipy has different behavior. Which version did you use?

The best way to solve your problem is to reuse the same solver object throughout the simulation. You connect one right-hand side object to the solver once, before the time loop, and then you update the parameters in the right-hand side function. Something like

class MyRHS:
def _init_(self, data_struct):
self.data = data_struct

def __call__(self, u, t):
    # compute rhs

rhs = MyRHS(...)
solver = odespy.Dopri5(rhs)
for i in range(N): # time loop
# update rhs.data
solver.set_initial_condition(bigarray)
u, t = solver.solve(...)

With this set-up you should be able to avoid allocation of new data, just update the data structures in-place.

mountaindust · 2013-09-14T03:56:35Z

Ha... I'm just a mathematics graduate student who will dive into whatever is necessary where my dissertation is involved (defense in mid October!). This is my first foray into seg faults and memory leaks, and it's been pretty daunting to understand enough of it to find a fix for my code.

I did not try without gc.collect(), but given what was going on, I figured the safest thing was to del and then force an immediate collect. I've been writing in Cython as well (everything goes WAY too slow for my problem otherwise), so it's been a pain to debug. I thought about using something like your set-up after the fact - that's MUCH more elegant - but since it ain't broke now, I'm not going to fix it. I was using scipy version 0.12.0.

yoavram · 2015-02-09T18:24:18Z

Great back and forth, thanks!

hplgit closed this as completed Feb 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seg fault in rkfs_ kills python kernel without error message or traceback #1

Seg fault in rkfs_ kills python kernel without error message or traceback #1

mountaindust commented Jul 5, 2013

hplgit commented Jul 12, 2013

mountaindust commented Jul 12, 2013

hplgit commented Jul 20, 2013

mountaindust commented Sep 14, 2013

yoavram commented Feb 9, 2015

Seg fault in rkfs_ kills python kernel without error message or traceback #1

Seg fault in rkfs_ kills python kernel without error message or traceback #1

Comments

mountaindust commented Jul 5, 2013

hplgit commented Jul 12, 2013

mountaindust commented Jul 12, 2013

hplgit commented Jul 20, 2013

mountaindust commented Sep 14, 2013

yoavram commented Feb 9, 2015