-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERR: fail fast with non-supported dtypes on construction #14349
Comments
I have observed that this does not happen if a reference to the NumPy array is held:
That code behaves as expected, simply printing zeros. But if you do this:
You're back to crashing or non-determinism. So the problem seems to be that Pandas is holding a dangling (count=0) reference to data from NumPy. |
Maybe you're already aware, but void types aren't really supported, what you're actually getting is that array cast to an |
this is a duplicate of #13447 . void is not a supported dtype. as I said in that issue, you can cast if you want. Further this could be checked more aggressively in construction. I'll mark it for that enhancement. |
@chris-b1 Yes I am aware that the void dtype is said to be unsupported. However it should not be nondeterministic, and it should not segfault. The problem we have here is that Pandas is holding a reference it does not own, and this should be fixable. @jreback I do not want the solution to be simply refusing to construct a DataFrame. Instead, at a minimum I would want a way to tell Pandas to construct the DataFrame with the columns it can use, and omit the ones it cannot use. If it simply raises an exception every time it sees a void dtype, this will not be usable, but having it accept the data it can accept and omit the data it cannot accept would be OK with me. Or, even better, just fix the reference counting bug. I'm not asking Pandas to "support" void any more or less than it does now, just that it should not crash the Python runtime. A patch to make Pandas fail to construct DataFrames that it was previously able to construct would be a step backward. |
if u want to submit a pull request great |
Here is another example, which has the same problem despite not explicitly using void:
The fundamental error is creation of an ObjectBlock which contains 1000 buffer readers which hold raw pointers into the passed array, with no reference count increment upon that array. A simple solution is the following patch to
Then, each ObjectBlock does hold a reference to the input data. This prevents the undefined behavior, and actually prints the correct values. I also tried changing This also works, in
What that does is to simply store Others possible solutions include:
Do any of these solutions appeal to you, @chris-b1 and @jreback ? |
I think it would be reasonable to make sure the right reference is held - I guess the issue that when you assign into an object array the refcount of the assigned values isn't incremented (is that a numpy bug?)
Not completely opposed to a having a |
@jreback how about this idea?
|
just raise is simpler, better and more logical |
@jreback I do not like that at all, because it means that I can no longer rely on being able to construct a DataFrame from an ndarray--sometimes it will raise an exception even if the dimensions make sense. If Pandas can't support void, I would very much prefer that the column either be filled with |
I should also mention that one of the common ways that void columns appear is via |
@jzwinck then you are in charge of this in your code. pandas doesn't support void, full stop. |
@chris-b1 I have created the above issue, numpy/numpy#8129 due to what does look like a bug in NumPy. Thank you for raising the possibility that the root cause is a NumPy bug--I didn't see that initially. I'd appreciate your thoughts on that ticket if you have any. As for this ticket, I am closing it because the very, very last thing I would ever want to happen would be for DataFrame construction to fail just because a void column exists. I now believe the best way to resolve this issue is in NumPy. |
Example Code
print(pd.DataFrame({'a': np.zeros(1000, 'V4')}))
Results
Non-deterministic behavior. Sometimes you get all zeros, sometimes you get garbage like this:
That is despite the fact that the bytes are actually all zero, and NumPy prints all rows as
[0, 0, 0, 0]
.Sometimes when printing a wider DataFrame containing such a column, it segfaults with this stack trace:
Expected Output
All rows
[0, 0, 0, 0]
- just as NumPy prints it.Output of
pd.show_versions()
commit: None
python: 3.5.1
python-bits: 64
OS: Linux
OS-release: 3.13.0
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
pip: 8.1.2
setuptools: 27.2.0
numpy: 1.11.1
The text was updated successfully, but these errors were encountered: