Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault with open_mfdataset #444

Closed
razcore-rad opened this issue Jun 26, 2015 · 26 comments
Closed

segmentation fault with open_mfdataset #444

razcore-rad opened this issue Jun 26, 2015 · 26 comments
Labels

Comments

@razcore-rad
Copy link

This is super strange. Does anyone have any idea why a segmentation fault might be happening here?

Python 3.4.3 (default, Jun 26 2015, 00:02:21) 
[GCC 4.3.4 [gcc-4_3-branch revision 152973]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xray
>>> xray.open_mfdataset('2*.nc', concat_dim='time')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/backends/api.py", line 205, in open_mfdataset
Segmentation fault (core dumped)

I stay it's strange because I ended up tracking down the bug to xray.core.ops.array_equiv. I have no idea what's going on, but by mistake I found out that if I introduce isnull(arr1 & arr2) just before the return statement then I don't get the error any more... So my xray.core.ops.array_equiv is now:

def array_equiv(arr1, arr2):
    """Like np.array_equal, but also allows values to be NaN in both arrays
    """
    arr1, arr2 = as_like_arrays(arr1, arr2)
    if arr1.shape != arr2.shape:
        return False
    # segmentation fault if we don't call this here...
    isnull(arr1 & arr2)
    return bool(((arr1 == arr2) | (isnull(arr1) & isnull(arr2))).all())

Thanks...

@shoyer
Copy link
Member

shoyer commented Jun 26, 2015

Oh my, that's bad!

Can you experiment with the engine argument to open_mfdataset and see if that changes things? For example, try engine='scipy' (if this is a netcdf3 files) and engine='netcdf4'.

It would be also be helpful to report the dtypes of the arrays that trigger failure in array_equiv.

@shoyer shoyer added the bug label Jun 26, 2015
@razcore-rad
Copy link
Author

Unfortunately I can't use engine='scipy' cause they're not netcdf3 files so it defaults to 'netcdf4'. On the other hand here you can find the back trace from gdb... if that helps in any way...

print(arr1.dtype, arr2.dtype)
print((arr1 == arr2))
print((arr1 == arr2) | (isnull(arr1) & isnull(arr2)))

# gives:
float64 float64
dask.array<x_1, shape=(50, 39, 59), chunks=((50,), (39,), (59,)), dtype=bool>
dask.array<x_6, shape=(50, 39, 59), chunks=((50,), (39,), (59,)), dtype=bool>

Funny thing is when I'm adding these print statements and so on I get some traceback from Python (some times). Without them I would only get segmetation fault with no additional information. For example, just now, after introducing these prints I got this traceback. This doesn't seem to be an xray bug, I mean it can't since it's just Python code... but any help is appreciated. Thanks!

edit: oh yeah... this is a funny thing. If I do print(((arr1 == arr2) | (isnull(arr1) & isnull(arr2))).all()), I get dask.array<x_13, shape=(), chunks=(), dtype=bool> which I guess it's a problem... so calling that all method kind of screws things up, or at least calls other stuff that screw it up, but I have no idea why calling isnull(arr1 & arr2) before all this... makes it run without segfault.

@shoyer
Copy link
Member

shoyer commented Jun 26, 2015

Another backend to try would be engine='h5netcdf': https://github.com/shoyer/h5netcdf

That might help us identify if this is a netCDF4-python bug.

I am also baffled by how inserting isnull(arr1 & arr2) avoids the seg fault. This is a lazy computation created with dask that is immediately thrown away without accessing any of the values.

@razcore-rad
Copy link
Author

Just tried engine='h5netcdf'. Still get the segfault. It looks to me that something doesn't properly initialize the hdf5 library and calling that isnull function like this somehow triggers some initialization for the both arrays. It might also be the & operator... because if I do isnull(arr1) & isnull(arr2) I still get the segmentation fault. Only when using isnull(arr1 & arr2) it seems to work... strange things.

edit: I was right... it's actually the & operator, I just need to call arr1 & arr2 before the return statement and I don't get the segmentation fault...

@shoyer
Copy link
Member

shoyer commented Jun 27, 2015

do you have an example file? this might also be your HDF5 install....

@mrocklin
Copy link
Contributor

@shoyer asked me to chime in in case this is an issue with dask. One thing to try would be to remove multi-threading from the equation. I'm not sure how this would affect things but it's worth a shot.

>>> import dask
>>> from dask.async import get_sync
>>> dask.set_options(get=get_sync)  # use single-threaded scheduler by default
>>> ... do work as normal

@mrocklin
Copy link
Contributor

Alternatively can we try doing the operations that xray would do manually and see if one of them triggers something?

One could also try

$ gdb python

@razcore-rad
Copy link
Author

So I just tried @mrocklin's idea with using single-threaded stuff. This seems to fix the segmentation fault, but I am very curious as to why there's a problem with working in parallel. I tried two different hdf5 libraries (I think version 1.8.13 and 1.8.14) but I got the same segmentation fault. Anyway, working on a single thread is not a big deal, I'll just do that for the time being... I already tried gdb on python but I'm not experienced enough to make heads or tails of it... I have the gdb backtrace here but I don't know what to do with it...

@shoyer, the files are not the issue here, they're the same ones I provided in #443.

Question: does the hdf5 library need to be built with parallel support (mpi or something) maybe?... thanks guys

@mrocklin
Copy link
Contributor

There was a similar problem with PyTables, which didn't support concurrency
well. This resulted in the from-hdf5 function in dask array which uses
explicit locks to avoid concurrent access.

We could repeat this treatment more generally without much trouble to force
single threaded access on access but still allow parallelism otherwise.
On Jun 27, 2015 2:33 PM, "Răzvan Rădulescu" [email protected]
wrote:

So I just tried @mrocklin https://github.com/mrocklin's idea with using
single-threaded stuff. This seems to fix the segmentation fault, but I am
very curious as to why there's a problem with working in parallel. I tried
two different hdf5 libraries (I think version 1.8.13 and 1.8.14) but I got
the same segmentation fault. Anyway, working on a single thread is not a
big deal, I'll just do that for the time being... I already tried gdb on
python but I'm not experienced enough to make heads or tails of it... I
have the gdb backtrace here
https://gist.github.com/razvanc87/0986c4f7a591772e1778 but I don't know
what to do with it...

@shoyer https://github.com/shoyer, the files are not the issue here,
they're the same ones I provided in #443
#443.

Question: does the hdf5 library need to be built with parallel support
(mpi or something) maybe?... thanks guys


Reply to this email directly or view it on GitHub
#444 (comment).

@shoyer
Copy link
Member

shoyer commented Jun 27, 2015

Of course, concurrent access to HDF5 files works fine on my laptop, using Anaconda's build of HDF5 (version 1.8.14). I have no idea what special flags they invoked when building it :).

That said, I have been unable to produce any benchmarks that show improved performance when simply doing multithreaded reads without doing any computation (e.g., %time xray.open_dataset(..., chunks=...).load()). Even when I'm reading multiple independent chunks compressed on disk, CPU seems to be pegged at 100%, when using either netCDF4-python or h5py (via h5netcdf) to read the data. For non-compressed data, reads seem to be limited by disk speed, so CPU is also not relevant.

Given these considerations, it seems like we should use a lock when reading data into xray with dask. @mrocklin we could just use lock=True with da.from_array, right? If we can find use cases for multi-threaded reads, we could also add an optional lock argument to open_dataset/open_mfdataset.

@mrocklin
Copy link
Contributor

Oh, I didn't realize that that was built in already. Sounds like you could
handle this easily on the xray side.
On Jun 27, 2015 4:40 PM, "Stephan Hoyer" [email protected] wrote:

Of course, concurrent access to HDF5 files works fine on my laptop, using
Anaconda's build of HDF5 (version 1.8.14). I have no idea what special
flags they invoked when building it :).

That said, I have been unable to produce any benchmarks that show improved
performance when simply doing multithreaded reads without doing any
computation (e.g., %time xray.open_dataset(..., chunks=...).load()). Even
when I'm reading multiple independent chunks compressed on disk, CPU seems
to be pegged at 100%, when using either netCDF4-python or h5py (via
h5netcdf) to read the data. For non-compressed data, reads seem to be
limited by disk speed, so CPU is also not relevant.

Given these considerations, it seems like we should use a lock when
reading data into xray with dask. @mrocklin https://github.com/mrocklin
we could just use lock=True with da.from_array, right? If we can find use
cases for multi-threaded reads, we could also add an optional lock
argument to open_dataset/open_mfdataset.


Reply to this email directly or view it on GitHub
#444 (comment).

@shoyer
Copy link
Member

shoyer commented Jun 28, 2015

I have a tentative fix (adding the threading lock) in #446

Still wondering why multi-threading can't use more than one CPU -- hopefully my h5py issue (referenced above) will get us some answers.

@shoyer
Copy link
Member

shoyer commented Jun 29, 2015

Just merged the fix to master.

@razvanc87 if you could try installing the development version, I would love to hear if this resolves your issues.

@shoyer
Copy link
Member

shoyer commented Jun 29, 2015

@razvanc87 What version of h5py were you using with h5netcdf? @andrewcollette suggests (h5py/h5py#591 (comment)) that h5py should already have the lock that fixes this issue if you were using h5py 2.4.0 or later.

@razcore-rad
Copy link
Author

Well... I have a couple of remarks to make. After some more thought about this it might have been all along my fault. Let me explain. I have this machine at work where I don't have administrative privileges so I decided to give linuxbrew a try. Now there are some system hdf5 libraries (but in custom locations) and they have this module command to load different versions of packages and set up proper environment variables. Before I had this issue, I did have xray installed with dask and everything compiled against the system libraries (and I had no problems with it). Then, with linuxbrew I started getting this weird behavior, using the latest version of hdf5 (1.8.14), but then I tried with version (1.8.13) and I had the same issue. Then I read somewhere on the net that... because of this mixture of local - system install with linuxbrew there might be issues when compiling, that is, the compiler uses versions of some header files that don't necessarily match local installed libraries. I can't confirm this any more though cause I reconfigured everything and removed linuxbrew cause it was producing more problems than solving... but I'll be happy to give the current installation a try and see if I can reproduce the error... can't do more than this though... sorry.

@razcore-rad
Copy link
Author

OK... as a follow-up, I did some tests and with netcdf4 I got this error again, but using open_mfdataset with the latest versions of h5py & h5netcdf I don't. But there are some decodings that aren't happening now... for whatever reason (maybe h5netcdf?). Anyway, my netcdf files store the attributes in 'ascii', that is, bytes in python so when trying to check for the time I get:

Traceback (most recent call last):
  File "segfault.py", line 62, in <module>
    concat_dim='time', engine='h5netcdf'))
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/backends/api.py", line 202, in open_mfdataset
    datasets = [open_dataset(p, **kwargs) for p in paths]
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/backends/api.py", line 202, in <listcomp>
    datasets = [open_dataset(p, **kwargs) for p in paths]
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/backends/api.py", line 145, in open_dataset
    return maybe_decode_store(store)
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/backends/api.py", line 101, in maybe_decode_store
    concat_characters=concat_characters, decode_coords=decode_coords)
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/conventions.py", line 850, in decode_cf
    decode_coords)
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/conventions.py", line 791, in decode_cf_variables
    decode_times=decode_times)
  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/xray/conventions.py", line 735, in decode_cf_variable
    if 'since' in attributes['units']:
TypeError: Type str doesn't support the buffer API

This is simple to solve.. just have every byte attribute decode to 'utf8' when first reading in the variables... I'll have some more time to look at this alter today.

edit: boy... there are some differences between these packages (netcdf4 & h5netcdf)... so, when trying to open_mfdataset with netcdf4 I get the segmentation fault... when I open it with h5netcdf I don't, but I the attributes are in bytes so then xray gives some errors when trying to get the date/time... but netcdf4 doesn't produce this error, it probably converts the bytes to strings internally... so I went in and tried to patch some .decode('utf8') here and there in xray and it works... when using h5netcdf, but then I get another error from h5netcdf:

  File "/ichec/home/users/razvan/.local/lib/python3.4/site-packages/h5py/_hl/attrs.py", line 55, in __getitem__
    raise IOError("Empty attributes cannot be read")
OSError: Empty attributes cannot be read

I didn't put the full error cause I don't think it's relevant. Anyway, needless to say... netcdf4 doesn't give this error... so these things need to be put in accordance somehow :)

edit2: so I was going through the posts here and now I saw you addressed this issue using that lock thing, which is set to True by default in open_datset, right? well, I don't know exactly what this thing is supposed to do, but I'm still getting a segmentation fault, but as stated before, only when using netcdf4, not h5netcdf, but then I run in that inconsistency with the ascii vs utf8 issue if I use h5netcdf... maybe I should open an open issue about this string issue? I don't know if this is an upstream issue or not, I mean, I guess h5netcdf just decides to not convert the ascii to utf8, whereas netcdf4 goes with the more contemporary approach of returning utf8... or is this internally handled by xray?

@shoyer
Copy link
Member

shoyer commented Jul 2, 2015

Thanks for your help debugging!

I made a new issue for ascii attributes handling: #451

This is one case where Python 3's insistence that bytes and strings are different is annoying. I'll probably have to decode all bytes type attributes read from h5netcdf.

How do you trigger the seg-fault with netcdf4-python? Just using open_mfdataset as before? I'm a little surprised that still happens with the thread lock.

@shoyer shoyer reopened this Jul 2, 2015
@razcore-rad
Copy link
Author

Yes, I'm using the same files that I once uploaded on Dropbox for you to play with for #443. I'm not doing anything special, just passing in the glob pattern to open_mfdataset with no option for engine (which I guess goes for netcdf4 by default).

@shoyer
Copy link
Member

shoyer commented Jul 2, 2015

Ah, I think I know why the seg faults are still occuring. By default, dask.array.from_array uses a thread lock that is specific to each array variable. We need a global thread lock, because the HDF5 library is not thread safe.

@mrocklin maybe da.from_array should use a global thread lock if lock=True? Alternatively, I could just change this in xray -- but I suspect that other dask users who want a lock also probably want a global lock.

@mrocklin
Copy link
Contributor

mrocklin commented Jul 3, 2015

The library itself is not threadsafe? What about on a per-file basis?

@razcore-rad
Copy link
Author

Per file basis (open_dataset) there's no problem... but again, if I try h5netcdf engine, open_mfdataset doesn't throw a segmentation fault, but then I go into the string unicode/ascii problem. So I guess h5netcdf and netcdf4 use the same netcdf/hdf5 libraries don't they? so if if works for h5netcdf then it should work for netcdf4 as well...

@shoyer
Copy link
Member

shoyer commented Jul 3, 2015

The library itself is not threadsafe? What about on a per-file basis?

@andrewcollette could you comment on this for h5py/hdf5?

@mrocklin based on my reading of Andrew's comment in the h5py issue, this is indeed the case.

@shoyer
Copy link
Member

shoyer commented Jul 3, 2015

@razvanc87 netcdf4 and h5py use the same HDF5 libraries, but have different bindings from Python. H5py likely does a more careful job of using its own locks to ensure thread safety, which likely explains the difference you are seeing (the attribute encoding is a separate issue).

@andrewcollette
Copy link

@shoyer, there are basically two levels of thread safety for HDF5/h5py. First, the HDF5 library has an optional compile-time "threadsafe" build option that wraps all API access in a lock. This is all-or-nothing; I'm not aware of any per-file effects.

Second, h5py uses its own global lock on the Python side to serialize access, which is only disabled in MPI mode. For added protection, h5py also does not presently release the GIL around reads/writes.

@razcore-rad
Copy link
Author

I think this issue can be closed, after some digging and playing with different netcdf4 modules I'm pretty certain that it was a linkage and compilation issue between system hdf5 and netcdf libraries. You see, the computer I got this error on is one of those "module load" managed supercomputers... and somewhere on the way things got messed up while compiling python modules...

@shoyer
Copy link
Member

shoyer commented Jul 10, 2015

@razvanc87 I've gotten a few other reports of issues with multithreading (not just you), so I think we do definitely need to add our own lock when accessing these files. Misconfigured hdf5 installs may not be so uncommon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants