ERR: fail fast with non-supported dtypes on construction #14349

jzwinck · 2016-10-05T05:11:46Z

Example Code

print(pd.DataFrame({'a': np.zeros(1000, 'V4')}))

Results

Non-deterministic behavior. Sometimes you get all zeros, sometimes you get garbage like this:

                        a
0            [1, 0, 0, 0]
1            [0, 0, 0, 0]
2            [8, 0, 0, 0]
3            [0, 0, 0, 0]
4      [48, -92, 71, -60]
5        [-27, 127, 0, 0]
6            [5, 0, 0, 0]
7            [0, 0, 0, 0]
...

That is despite the fact that the bytes are actually all zero, and NumPy prints all rows as [0, 0, 0, 0].

Sometimes when printing a wider DataFrame containing such a column, it segfaults with this stack trace:

#0  BYTE_copyswap (dst=0x7fffebdbe868, src=0x7fffe9f0d054, __NPY_UNUSED_TAGGEDswap=0, __NPY_UNUSED_TAGGEDarr=0x7fffea15f990)
    at numpy/core/src/multiarray/arraytypes.c.src:1911
#1  0x00007ffff5d82075 in PyArray_Scalar (data=0x7fffe9f0d054, descr=0x7fffea1612d0, base=<optimized out>) at numpy/core/src/multiarray/scalarapi.c:835
#2  0x00007ffff5d52df0 in array_item (self=0x7fffea15f990, i=<optimized out>) at numpy/core/src/multiarray/mapping.c:1371

Expected Output

All rows [0, 0, 0, 0] - just as NumPy prints it.

Output of `pd.show_versions()`

## INSTALLED VERSIONS

commit: None
python: 3.5.1
python-bits: 64
OS: Linux
OS-release: 3.13.0
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
pip: 8.1.2
setuptools: 27.2.0
numpy: 1.11.1

The text was updated successfully, but these errors were encountered:

jzwinck · 2016-10-05T05:17:29Z

I have observed that this does not happen if a reference to the NumPy array is held:

arr = np.zeros(1000, 'V4')
print(pd.DataFrame({'a': arr}))

That code behaves as expected, simply printing zeros. But if you do this:

arr = np.zeros(1000, 'V4')
df = pd.DataFrame({'a': arr})
del arr
print(df)

You're back to crashing or non-determinism. So the problem seems to be that Pandas is holding a dangling (count=0) reference to data from NumPy.

chris-b1 · 2016-10-05T10:01:15Z

Maybe you're already aware, but void types aren't really supported, what you're actually getting is that array cast to an object dtype. Which won't be very performant and won't (I think?) preserve the memory layout either if you were planning on casting it later.

jreback · 2016-10-05T11:20:09Z

this is a duplicate of #13447 . void is not a supported dtype. as I said in that issue, you can cast if you want. Further this could be checked more aggressively in construction. I'll mark it for that enhancement.

jzwinck · 2016-10-05T12:08:51Z

@chris-b1 Yes I am aware that the void dtype is said to be unsupported. However it should not be nondeterministic, and it should not segfault. The problem we have here is that Pandas is holding a reference it does not own, and this should be fixable.

@jreback I do not want the solution to be simply refusing to construct a DataFrame. Instead, at a minimum I would want a way to tell Pandas to construct the DataFrame with the columns it can use, and omit the ones it cannot use. If it simply raises an exception every time it sees a void dtype, this will not be usable, but having it accept the data it can accept and omit the data it cannot accept would be OK with me.

Or, even better, just fix the reference counting bug. I'm not asking Pandas to "support" void any more or less than it does now, just that it should not crash the Python runtime.

A patch to make Pandas fail to construct DataFrames that it was previously able to construct would be a step backward.

jreback · 2016-10-05T12:16:46Z

if u want to submit a pull request great
void is completely unsupported
it may happen to work sometimes but this is not support

jzwinck · 2016-10-06T07:07:49Z

Here is another example, which has the same problem despite not explicitly using void:

print(pd.DataFrame({'a': np.zeros(1000, [('x', 'i1', 4), ('y', 'i4')])}))

The fundamental error is creation of an ObjectBlock which contains 1000 buffer readers which hold raw pointers into the passed array, with no reference count increment upon that array.

A simple solution is the following patch to internals.py:

if len(object_items) > 0:
    object_blocks = _simple_blockify(object_items, np.object_)
+    for i in range(len(object_items)):
+       object_blocks[i].reference_holder = object_items[i]

Then, each ObjectBlock does hold a reference to the input data. This prevents the undefined behavior, and actually prints the correct values.

I also tried changing _simple_blockify() there to _multi_blockify(), which makes construction work properly and preserves the V4 dtype, but the problem is other parts of the code like Series don't like this, because V4 is similar to (i1, 4) rather than i4, and "compound" dtypes are not supported either.

This also works, in _stack_arrays():

for i, arr in enumerate(arrays):
    if issubclass(arr.dtype.type, np.void):
        stacked[i] = None
    else:
        stacked[i] = _asarray_compat(arr)

What that does is to simply store None instead of the void data.

Others possible solutions include:

Adding a VoidBlock type (I expect @jreback will not like this because it looks like "support" for void).
Making the DataFrame constructor skip void columns automatically. This would be OK with me, and perhaps smarter than storing a column full of None as above.

Do any of these solutions appeal to you, @chris-b1 and @jreback ?

chris-b1 · 2016-10-06T10:04:33Z

I think it would be reasonable to make sure the right reference is held - I guess the issue that when you assign into an object array the refcount of the assigned values isn't incremented (is that a numpy bug?)

v = np.zeros(1000, 'V4')
stacked = np.empty((1,1000), dtype='O')
stacked[0] = v
del v

In [8]: stacked
Out[8]: 
array([array([ -80,  -45, -128,   32], dtype=int8),
       array([-31,   1,   0,   0], dtype=int8),
       array([  0, -18, -46,  27], dtype=int8),
       array([-31,   1,   0,   0], dtype=int8),
       array([-36,  18,   0,   0], dtype=int8)], dtype=object)

Not completely opposed to a having a bytes type, it just would be a lot of work for a fairly specialized use, especially where there are plans to revamp the type system, See also
wesm/pandas2#24 and feel free to comment there.

jreback · 2016-10-06T10:07:28Z

@jzwinck you are just showing a variation on #13296. We are not supporting invalid dtypes. Sure they can be added, but this is quite a bit of work, in this case for almost no gain.

jzwinck · 2016-10-06T12:12:51Z

@jreback how about this idea?

Making the DataFrame constructor skip void columns automatically. This would be OK with me, and perhaps smarter than storing a column full of None as above.

jreback · 2016-10-06T12:15:28Z

just raise is simpler, better and more logical

jzwinck · 2016-10-06T12:35:03Z

@jreback I do not like that at all, because it means that I can no longer rely on being able to construct a DataFrame from an ndarray--sometimes it will raise an exception even if the dimensions make sense. If Pandas can't support void, I would very much prefer that the column either be filled with None (which I gave the code for above), or be omitted automatically so I can move on with my life and not always need to loop over the dtype first to look for columns which Pandas can't use.

jzwinck · 2016-10-06T12:37:49Z

I should also mention that one of the common ways that void columns appear is via align=True dtypes in NumPy. That is, if I create an "aligned" structured or recarray in NumPy, it will often have some void columns (sometimes with no name, sometimes with a name like f1, f2, depending on how it was born). It's not that I'm deliberately creating void columns just to mess up Pandas.

jreback · 2016-10-06T12:53:42Z

@jzwinck then you are in charge of this in your code. pandas doesn't support void, full stop.

jzwinck · 2016-10-10T04:46:39Z

@chris-b1 I have created the above issue, numpy/numpy#8129 due to what does look like a bug in NumPy. Thank you for raising the possibility that the root cause is a NumPy bug--I didn't see that initially. I'd appreciate your thoughts on that ticket if you have any.

As for this ticket, I am closing it because the very, very last thing I would ever want to happen would be for DataFrame construction to fail just because a void column exists. I now believe the best way to resolve this issue is in NumPy.

jreback added Difficulty Novice Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas labels Oct 5, 2016

jreback added this to the Next Major Release milestone Oct 5, 2016

jreback changed the title ~~BUG: Non-determinism (sometimes segfault) when printing DataFrame with NumPy void column~~ ERR: fail fast with non-supported dtypes on construction Oct 5, 2016

jzwinck mentioned this issue Oct 10, 2016

Constructing object array from void array leads to use-after-free numpy/numpy#8129

Closed

jzwinck closed this as completed Oct 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: fail fast with non-supported dtypes on construction #14349

ERR: fail fast with non-supported dtypes on construction #14349

jzwinck commented Oct 5, 2016

jzwinck commented Oct 5, 2016

chris-b1 commented Oct 5, 2016

jreback commented Oct 5, 2016

jzwinck commented Oct 5, 2016 •

edited

Loading

jreback commented Oct 5, 2016

jzwinck commented Oct 6, 2016 •

edited

Loading

chris-b1 commented Oct 6, 2016

jreback commented Oct 6, 2016

jzwinck commented Oct 6, 2016

jreback commented Oct 6, 2016

jzwinck commented Oct 6, 2016

jzwinck commented Oct 6, 2016 •

edited

Loading

jreback commented Oct 6, 2016

jzwinck commented Oct 10, 2016

ERR: fail fast with non-supported dtypes on construction #14349

ERR: fail fast with non-supported dtypes on construction #14349

Comments

jzwinck commented Oct 5, 2016

Example Code

Results

Expected Output

Output of pd.show_versions()

jzwinck commented Oct 5, 2016

chris-b1 commented Oct 5, 2016

jreback commented Oct 5, 2016

jzwinck commented Oct 5, 2016 • edited Loading

jreback commented Oct 5, 2016

jzwinck commented Oct 6, 2016 • edited Loading

chris-b1 commented Oct 6, 2016

jreback commented Oct 6, 2016

jzwinck commented Oct 6, 2016

jreback commented Oct 6, 2016

jzwinck commented Oct 6, 2016

jzwinck commented Oct 6, 2016 • edited Loading

jreback commented Oct 6, 2016

jzwinck commented Oct 10, 2016

Output of `pd.show_versions()`

jzwinck commented Oct 5, 2016 •

edited

Loading

jzwinck commented Oct 6, 2016 •

edited

Loading

jzwinck commented Oct 6, 2016 •

edited

Loading