Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: fail fast with non-supported dtypes on construction #14349

Closed
jzwinck opened this issue Oct 5, 2016 · 14 comments
Closed

ERR: fail fast with non-supported dtypes on construction #14349

jzwinck opened this issue Oct 5, 2016 · 14 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas

Comments

@jzwinck
Copy link
Contributor

jzwinck commented Oct 5, 2016

Example Code

print(pd.DataFrame({'a': np.zeros(1000, 'V4')}))

Results

Non-deterministic behavior. Sometimes you get all zeros, sometimes you get garbage like this:

                        a
0            [1, 0, 0, 0]
1            [0, 0, 0, 0]
2            [8, 0, 0, 0]
3            [0, 0, 0, 0]
4      [48, -92, 71, -60]
5        [-27, 127, 0, 0]
6            [5, 0, 0, 0]
7            [0, 0, 0, 0]
...

That is despite the fact that the bytes are actually all zero, and NumPy prints all rows as [0, 0, 0, 0].

Sometimes when printing a wider DataFrame containing such a column, it segfaults with this stack trace:

#0  BYTE_copyswap (dst=0x7fffebdbe868, src=0x7fffe9f0d054, __NPY_UNUSED_TAGGEDswap=0, __NPY_UNUSED_TAGGEDarr=0x7fffea15f990)
    at numpy/core/src/multiarray/arraytypes.c.src:1911
#1  0x00007ffff5d82075 in PyArray_Scalar (data=0x7fffe9f0d054, descr=0x7fffea1612d0, base=<optimized out>) at numpy/core/src/multiarray/scalarapi.c:835
#2  0x00007ffff5d52df0 in array_item (self=0x7fffea15f990, i=<optimized out>) at numpy/core/src/multiarray/mapping.c:1371

Expected Output

All rows [0, 0, 0, 0] - just as NumPy prints it.

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 3.5.1
python-bits: 64
OS: Linux
OS-release: 3.13.0
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
pip: 8.1.2
setuptools: 27.2.0
numpy: 1.11.1

@jzwinck
Copy link
Contributor Author

jzwinck commented Oct 5, 2016

I have observed that this does not happen if a reference to the NumPy array is held:

arr = np.zeros(1000, 'V4')
print(pd.DataFrame({'a': arr}))

That code behaves as expected, simply printing zeros. But if you do this:

arr = np.zeros(1000, 'V4')
df = pd.DataFrame({'a': arr})
del arr
print(df)

You're back to crashing or non-determinism. So the problem seems to be that Pandas is holding a dangling (count=0) reference to data from NumPy.

@chris-b1
Copy link
Contributor

chris-b1 commented Oct 5, 2016

Maybe you're already aware, but void types aren't really supported, what you're actually getting is that array cast to an object dtype. Which won't be very performant and won't (I think?) preserve the memory layout either if you were planning on casting it later.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2016

this is a duplicate of #13447 . void is not a supported dtype. as I said in that issue, you can cast if you want. Further this could be checked more aggressively in construction. I'll mark it for that enhancement.

@jreback jreback added Difficulty Novice Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas labels Oct 5, 2016
@jreback jreback added this to the Next Major Release milestone Oct 5, 2016
@jreback jreback changed the title BUG: Non-determinism (sometimes segfault) when printing DataFrame with NumPy void column ERR: fail fast with non-supported dtypes on construction Oct 5, 2016
@jzwinck
Copy link
Contributor Author

jzwinck commented Oct 5, 2016

@chris-b1 Yes I am aware that the void dtype is said to be unsupported. However it should not be nondeterministic, and it should not segfault. The problem we have here is that Pandas is holding a reference it does not own, and this should be fixable.

@jreback I do not want the solution to be simply refusing to construct a DataFrame. Instead, at a minimum I would want a way to tell Pandas to construct the DataFrame with the columns it can use, and omit the ones it cannot use. If it simply raises an exception every time it sees a void dtype, this will not be usable, but having it accept the data it can accept and omit the data it cannot accept would be OK with me.

Or, even better, just fix the reference counting bug. I'm not asking Pandas to "support" void any more or less than it does now, just that it should not crash the Python runtime.

A patch to make Pandas fail to construct DataFrames that it was previously able to construct would be a step backward.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2016

if u want to submit a pull request great
void is completely unsupported
it may happen to work sometimes but this is not support

@jzwinck
Copy link
Contributor Author

jzwinck commented Oct 6, 2016

Here is another example, which has the same problem despite not explicitly using void:

print(pd.DataFrame({'a': np.zeros(1000, [('x', 'i1', 4), ('y', 'i4')])}))

The fundamental error is creation of an ObjectBlock which contains 1000 buffer readers which hold raw pointers into the passed array, with no reference count increment upon that array.

A simple solution is the following patch to internals.py:

if len(object_items) > 0:
    object_blocks = _simple_blockify(object_items, np.object_)
+    for i in range(len(object_items)):
+       object_blocks[i].reference_holder = object_items[i]

Then, each ObjectBlock does hold a reference to the input data. This prevents the undefined behavior, and actually prints the correct values.

I also tried changing _simple_blockify() there to _multi_blockify(), which makes construction work properly and preserves the V4 dtype, but the problem is other parts of the code like Series don't like this, because V4 is similar to (i1, 4) rather than i4, and "compound" dtypes are not supported either.

This also works, in _stack_arrays():

for i, arr in enumerate(arrays):
    if issubclass(arr.dtype.type, np.void):
        stacked[i] = None
    else:
        stacked[i] = _asarray_compat(arr)

What that does is to simply store None instead of the void data.

Others possible solutions include:

  • Adding a VoidBlock type (I expect @jreback will not like this because it looks like "support" for void).
  • Making the DataFrame constructor skip void columns automatically. This would be OK with me, and perhaps smarter than storing a column full of None as above.

Do any of these solutions appeal to you, @chris-b1 and @jreback ?

@chris-b1
Copy link
Contributor

chris-b1 commented Oct 6, 2016

I think it would be reasonable to make sure the right reference is held - I guess the issue that when you assign into an object array the refcount of the assigned values isn't incremented (is that a numpy bug?)

v = np.zeros(1000, 'V4')
stacked = np.empty((1,1000), dtype='O')
stacked[0] = v
del v

In [8]: stacked
Out[8]: 
array([array([ -80,  -45, -128,   32], dtype=int8),
       array([-31,   1,   0,   0], dtype=int8),
       array([  0, -18, -46,  27], dtype=int8),
       array([-31,   1,   0,   0], dtype=int8),
       array([-36,  18,   0,   0], dtype=int8)], dtype=object)

Not completely opposed to a having a bytes type, it just would be a lot of work for a fairly specialized use, especially where there are plans to revamp the type system, See also
wesm/pandas2#24 and feel free to comment there.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2016

@jzwinck you are just showing a variation on #13296. We are not supporting invalid dtypes. Sure they can be added, but this is quite a bit of work, in this case for almost no gain.

@jzwinck
Copy link
Contributor Author

jzwinck commented Oct 6, 2016

@jreback how about this idea?

Making the DataFrame constructor skip void columns automatically. This would be OK with me, and perhaps smarter than storing a column full of None as above.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2016

just raise is simpler, better and more logical

@jzwinck
Copy link
Contributor Author

jzwinck commented Oct 6, 2016

@jreback I do not like that at all, because it means that I can no longer rely on being able to construct a DataFrame from an ndarray--sometimes it will raise an exception even if the dimensions make sense. If Pandas can't support void, I would very much prefer that the column either be filled with None (which I gave the code for above), or be omitted automatically so I can move on with my life and not always need to loop over the dtype first to look for columns which Pandas can't use.

@jzwinck
Copy link
Contributor Author

jzwinck commented Oct 6, 2016

I should also mention that one of the common ways that void columns appear is via align=True dtypes in NumPy. That is, if I create an "aligned" structured or recarray in NumPy, it will often have some void columns (sometimes with no name, sometimes with a name like f1, f2, depending on how it was born). It's not that I'm deliberately creating void columns just to mess up Pandas.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2016

@jzwinck then you are in charge of this in your code. pandas doesn't support void, full stop.

@jzwinck
Copy link
Contributor Author

jzwinck commented Oct 10, 2016

@chris-b1 I have created the above issue, numpy/numpy#8129 due to what does look like a bug in NumPy. Thank you for raising the possibility that the root cause is a NumPy bug--I didn't see that initially. I'd appreciate your thoughts on that ticket if you have any.

As for this ticket, I am closing it because the very, very last thing I would ever want to happen would be for DataFrame construction to fail just because a void column exists. I now believe the best way to resolve this issue is in NumPy.

@jzwinck jzwinck closed this as completed Oct 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

3 participants