-
Notifications
You must be signed in to change notification settings - Fork 50
vcat of PooledDataVector might need to expand the reftype #213
Conversation
Unfortunately, I don't think we can choose the reftype dynamically, as it would introduce type instability. What we could do is use Anyway, I wouldn't encourage you to spend too much time on DataArrays, are we're going to stop using them in DataFrames (JuliaData/DataFrames.jl#1008). You can have a look at CategoricalArrays.jl instead, were similar features are needed. |
2f0b09d
to
6a7fa47
Compare
I also made some other changes to make the return type stable. Thanks for reminding me that DataArrays is being deprecated. |
|
||
idx = map(pa) do p | ||
idx = [ begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@code_warntype said it couldn't deduce the return type if idx was Any[] as returned by map.
|
Right, we could write a special vcat but on the other hand I think it's fair to have to call compact again after vcat. What do you think? |
Given that the package is going to be deprecated soon, any correct solution is fine. But for CategoricalArrays, an optimization like this would be interesting. |
Are you sure this code works for arrays of any dimension? You can take inspiration from (It would be cool if we could avoid rewriting |
Since you create copies of the input refs already, a simple and robust strategy would be to EDIT: Oops, I now see that's what you're doing in the end. So maybe you don't need to check the dimensions of the inputs, since |
I added some tests, and thanks for mentioning |
ca1 = compact(pa1); | ||
ca2 = compact(pa2); | ||
@test vcat(ca1, ca2) == vcat(a1, a2) | ||
@test vcat(ca1, ca2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a reftype
function just for this, better use isa(vcat(ca1, ca2), DataArray{Int, DataArrays.DEFAULT_POOLED_REF_TYPE}}
. Also compute vcat(ca1, ca2)
only once.
pa = PooledDataArray[p1, p2...] | ||
|
||
pools = Vector{T}[p.pool for p in pa] | ||
pool = levels(T[pools...;]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just pool = unique(vcat([p.pool for p in pa]...))
?
@assert size(p)[2:end] == size(p1)[2:end] | ||
end | ||
|
||
pa = PooledDataArray[p1, p2...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tuple should be more efficient and simpler: pa = (p1, p2...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, how is a tuple more efficient than a typed array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tuple is immutable, so the compiler can get rid of it entirely, while the array needs to be allocated. Unlikely to be significant here, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with code_warntype, and I had to keep some type hints to make the return type stable. But I also noticed that the code was shorter with pa
as tuple than as an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the type annotation in PooledDataArray[p1, p2...]
doesn't improve type stability, since the type is abstract (lacks type parameters). On the contrary, it will make the array abstract even if type inference is table to identify a common concrete type for all elements.
Tuples don't have this problem as their type includes information regarding each element separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I learnt something new today!
The type annotation I was referring to that I wasn't able to get rid of was the T
in thispool = unique(T[[p.pool for p in pa]...;])
.
@code_warntype tells me that even if it knows the type of pa to be pa::Tuple{DataArrays.PooledDataArray{Int64,UInt8,1},DataArrays.PooledDataArray{Int64,UInt8,1}}
it will call regular vcat (instead of typed_vcat) and get a pool of type pool::Union{Array{Any,1},Array{Int64,1}}
pool = levels(T[pools...;]) | ||
|
||
idx = [ begin | ||
m = findat(pool, p.pool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use indexin(p.pool, pool)
instead of findat
, since the former is in Base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's another gem in Base
that I had missed!
a1 = zeros(2,3,4,5) | ||
a2 = zeros(3,3,4,5) | ||
a1[1:end] = 1:length(a1) | ||
a2[1:end] = (1:length(a2)) + length(a1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to keep common levels between the arrays to test that, e.g. by replacing + length(a1)
with + 10
. AFAICT this wouldn't have worked with the current code in the PR. Also, using length(a2):1
instead of 1:length(a2)
would be a harder test than having levels ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. It did actually work with +10 instead, but we didn't know that until that test was added. I agree that it was a poor test to only have unique ordered values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I hadn't realized that the calls to levels
was indeed equivalent to unique
.
a2 = zeros(3,3,4,5) | ||
a1[1:end] = 1:length(a1) | ||
a2[1:end] = (1:length(a2)) + length(a1) | ||
ca1 = PooledDataArray(a1) |> compact; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compact(PooledDataArray(a1))
is a more common style.
a2[1:end] = (1:length(a2)) + length(a1) | ||
ca1 = PooledDataArray(a1) |> compact; | ||
ca2 = PooledDataArray(a2) |> compact; | ||
@test vcat(ca1, ca2) == vcat(a1, a2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test the result type using isa
.
@test vcat(ca1, ca2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE | ||
@test vcat(ca1, pa2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE | ||
|
||
a1 = zeros(2,3,4,5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array(2,3,4,5)
, since you fill the array right below.
Thanks for the feedback! |
@@ -829,3 +829,21 @@ function dropna{T}(pdv::PooledDataVector{T}) | |||
resize!(res, total) | |||
return res | |||
end | |||
|
|||
function Base.vcat{T,R,N}(p1::PooledDataArray{T,R,N}, p2::PooledDataArray...) | |||
for p in p2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to ask again: is this check really needed, i.e. won't the vcat
call below catch it? Anyway, @assert
shouldn't be used for caller errors: if you keep these, throw an ArgumentError
with a message instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will, but it would be harder for a user to figure out what was wrong in their arguments. Do you have a generic guideline to not validate arguments if it is guaranteed to crash anyways? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the error you get from vcat of the refs below is really clear. I removed this check.
@@ -143,6 +143,9 @@ for (f, basef) in ((:pdatazeros, :zeros), (:pdataones, :ones)) | |||
end | |||
end | |||
|
|||
# Pooled reference type | |||
reftype{T,R}(pa::PooledDataArray{T,R}) = R |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed.
dd00892
to
5e9fddf
Compare
f0a9190
to
ece7e37
Compare
pa = (p1, p2...) | ||
pool = unique(T[[p.pool for p in pa]...;]) | ||
|
||
idx = [begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it nicer as a single line idx = [indexin(p.pool, pool)[p.refs] for p in pa]
.
Thanks, looks good to me (apart from the minor style remark). I'll leave the PR open a bit more to let others comment. BTW, I would really appreciate a PR adding this feature to CategoricalArrays. We really need the more efficient concatenation strategy implemented here. |
This feature could then be used to fix vcat of pooled columns in DataFrames as well, JuliaData/DataFrames.jl#990