-
Notifications
You must be signed in to change notification settings - Fork 50
Revisiting representation of missing values #133
Comments
This is takeaway that I had from https://groups.google.com/d/msg/julia-dev/yIv6ZNbwHxk/WkUl8B1GQWoJ In my ideal world, we'd define a set of appropriate abstractions such that we're not locked into any particular implementation. |
Unfortunately, as my benchmarks show, it's possible to extract reasonable performance out of the BitArray in this benchmark, but abstracting that would basically require using a macro for looping so that we can unroll it 64x. Maybe we could get LLVM to optimize this on its own by specifying loop unrolling metadata, but we don't currently have a facility for that in Julia. The pain involved in getting good performance out of BitArrays is definitely a point against them, though, and maybe we should switch to NaNs for FloatingPoint and Array{Bool} for other types if we want it to be easy to get good performance in user code. |
I'm not really fond of switching to NaN for FloatingPoint since it's totally reasonable for a database to store both NaN and NULL. How much performance do we really gain from using NaN? If NaN's aren't more than twice as fast, it's hard for me to see why we should special-case floating point numbers. |
We can use a different bit pattern for NULL NaN, as @tshort commented in DataArrays.jl#5. I think 25% slower is about the cutoff where there'd be reason to have DataArrays that encode missingness as NaNs. It's no big deal if your code runs in 200 ms instead of 100 ms, but if it takes 12 hours instead of 6 then you might begin to care :). OTOH we don't necessarily need to use them by default, if we can make our code abstract enough, which doesn't seem too hard if NaNs and Array{Bool} are the only things we're optimizing for. In any case, I think your idea of benchmarking a representative list of operations is a good one. I propose:
|
I'm pretty uncomfortable assuming that data sources will never contain the It's also not clear to me that people should ever be doing "complex" calculations on |
There are 22 bits of payload for Float32 NaNs and 51 bits of payload for Float64 NaNs, and I don't think the CPU ever sets them to anything but zero (except for maybe the signalling bit?). I think the probability that someone uses a specific NaN pattern the hardware doesn't generate, doesn't want that NaN to represent a missing value, and happens to pick the same NaN that we do is pretty low. I do some computations with multiple-gigabyte arrays with some missing values, although generally the first thing I do is pick a subset and drop chunks that have missing values. I also occasionally compute the mean or variance across a dimension of ~100 MB arrays with missing values. It would be nice to use DataArrays for these cases without a substantial performance cost over NaNs. |
I don't want to be the only one arguing against |
To chime in on the topic of If I understand the current design correctly there is a BitArray that signals missingness in a data array and the current discussion is about improving the performance of this approach? I can see the need to special case the operations over floats , but would it be maybe possible to just store the sentinel value when setting a value to be missing and otherwise use the more general approach? Then numerical expensive calculations could work on the underlying data and everybody else (for me especially DB interfaces), would work via the general interface and does not have to know about the special casing of floats and yet another sentinel value. How would the Nullable approach factor into this debate? |
I was going to suggest something like @vchuravy. That would simplify the design since fallback functions would be guaranteed to work, thus specialized functions for floats would only be optimizations to be added when needed. Also zero-copy interoperability with R would be ensured that way. That would come at the cost of setting missing values to a sentinel on initialization and when setting a value to |
Honestly, the fact that any other person in the world but me dislikes using @vchuravy, here's an overview of how I see things moving:
Does that clear things up? Basically, you just replace |
@johnmyleswhite Thanks for the explanation. I imagine that that Instead of allocating the memory region corresponding to a missing value to zero or a random value, we could instead use a sentinel value. All operations then can decided to deal with missingness in any way they want. This could made general to user-defined types by defining a method
where and
Set values to null in a NullableArray would be slightly more expensive and creation of a NullalbleArray would be certainly more expansive then before. I am quite sure that this simple approach has problems on its own, but maybe it might be worthwhile to explore. |
That could work. Once upon a time, we did something very much like that by defining That said, I'm not totally opposed to the idea, but it's not clear to me that it's worth doing for any types other than |
And, yes, we do allocate the DataArrays.jl/src/dataarray.jl Line 21 in e539194
|
Slight detour... What about |
@johnmyleswhite Yes providing *I also noticed that x should be a hassentinel(x) = false
hassentinel(::Float64) = true
# ...
function setindex(a :: NullableArray{T}, x :: Nullable{T}, idx...)
if isnull(x)
hassentinel(T) && a.data[idx...] = sentinel(T)
a.bitmap[idx...] = false
else
a.data[idx...] = get(x)
a.bitmap[idx...] = true
end
end @tshort Any |
@tshort Without any tweaks I think that would be a |
For compatibility with R, a sentinel could be used for integers, PDAs and strings. Though Julia wouldn't rely on the sentinel to check for missingness, since the sentinels would be valid integer values (not sure how NA strings are handled). That said, this could also be done when passing data to R by the package responsible for that, to avoid imposing an overhead on all Julia users. |
Eh. I'm not excited about making everything on all primitive numeric types a bunch slower for a form of hypothetical compatibility with R. |
I don't think sentinals should be slower. I timed an Int sentinal using @simonster's code, and it was about the same speed as using Vector{Bool}. I tend to prefer Vector{Bool} as the default and nan sentinals for floats. As John said, the API is the most important part. |
Moving discussion about how to represent missing values in DataArrays from JuliaLang/julia#9363 and JuliaData/DataFrames.jl#745 here. I've done some benchmarks here that show that, with the best code I can generate at the moment, NaN sentinels > Vector{Bool} > BitVector > Vector{Float64}.
The text was updated successfully, but these errors were encountered: