Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

vcat of PooledDataVector might need to expand the reftype #213

Merged
merged 8 commits into from
Sep 27, 2016

Conversation

gustafsson
Copy link
Contributor

This feature could then be used to fix vcat of pooled columns in DataFrames as well, JuliaData/DataFrames.jl#990

@nalimilan
Copy link
Member

Unfortunately, I don't think we can choose the reftype dynamically, as it would introduce type instability. What we could do is use widen to return a reftype just as large as what could possibly be needed to store all levels, but not more.

Anyway, I wouldn't encourage you to spend too much time on DataArrays, are we're going to stop using them in DataFrames (JuliaData/DataFrames.jl#1008). You can have a look at CategoricalArrays.jl instead, were similar features are needed.

@gustafsson
Copy link
Contributor Author

widen(UInt8)==UInt64. So I think a hardcoded DEFAULT_POOLED_REF_TYPE makes more sense.

I also made some other changes to make the return type stable.

Thanks for reminding me that DataArrays is being deprecated.


idx = map(pa) do p
idx = [ begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@code_warntype said it couldn't deduce the return type if idx was Any[] as returned by map.

@nalimilan
Copy link
Member

widen(UInt8) gives UInt32 here, but yeah, we would have to design a custom function for it to return UInt16 instead. The point is just that the type should only depend on types, not on dynamic properties like the number of levels.

@gustafsson
Copy link
Contributor Author

Right, we could write a special vcat but on the other hand I think it's fair to have to call compact again after vcat. What do you think?

@nalimilan
Copy link
Member

Given that the package is going to be deprecated soon, any correct solution is fine. But for CategoricalArrays, an optimization like this would be interesting.

@nalimilan
Copy link
Member

Are you sure this code works for arrays of any dimension? You can take inspiration from typed_vcat in Base to ensure it does. Also, tests should be included for matrices (at least).

(It would be cool if we could avoid rewriting vcat from scratch, but for now that doesn't seem possible.)

@nalimilan
Copy link
Member

nalimilan commented Sep 13, 2016

Since you create copies of the input refs already, a simple and robust strategy would be to convert them to the wanted reftype, and then call vcat on them: most of the logic would be implemented in Base, and you wouldn't have to rewrite it.

EDIT: Oops, I now see that's what you're doing in the end. So maybe you don't need to check the dimensions of the inputs, since Base.vcat will do it for you? Anyway, dim is broken as it is since N isn't defined (also, isn't this equivalent to ndims?).

@gustafsson
Copy link
Contributor Author

I added some tests, and thanks for mentioning ndims

ca1 = compact(pa1);
ca2 = compact(pa2);
@test vcat(ca1, ca2) == vcat(a1, a2)
@test vcat(ca1, ca2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a reftype function just for this, better use isa(vcat(ca1, ca2), DataArray{Int, DataArrays.DEFAULT_POOLED_REF_TYPE}}. Also compute vcat(ca1, ca2) only once.

pa = PooledDataArray[p1, p2...]

pools = Vector{T}[p.pool for p in pa]
pool = levels(T[pools...;])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pool = unique(vcat([p.pool for p in pa]...))?

@assert size(p)[2:end] == size(p1)[2:end]
end

pa = PooledDataArray[p1, p2...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tuple should be more efficient and simpler: pa = (p1, p2...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, how is a tuple more efficient than a typed array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tuple is immutable, so the compiler can get rid of it entirely, while the array needs to be allocated. Unlikely to be significant here, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with code_warntype, and I had to keep some type hints to make the return type stable. But I also noticed that the code was shorter with pa as tuple than as an array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the type annotation in PooledDataArray[p1, p2...] doesn't improve type stability, since the type is abstract (lacks type parameters). On the contrary, it will make the array abstract even if type inference is table to identify a common concrete type for all elements.

Tuples don't have this problem as their type includes information regarding each element separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I learnt something new today!

The type annotation I was referring to that I wasn't able to get rid of was the T in thispool = unique(T[[p.pool for p in pa]...;]).

@code_warntype tells me that even if it knows the type of pa to be pa::Tuple{DataArrays.PooledDataArray{Int64,UInt8,1},DataArrays.PooledDataArray{Int64,UInt8,1}} it will call regular vcat (instead of typed_vcat) and get a pool of type pool::Union{Array{Any,1},Array{Int64,1}}

pool = levels(T[pools...;])

idx = [ begin
m = findat(pool, p.pool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use indexin(p.pool, pool) instead of findat, since the former is in Base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's another gem in Base that I had missed!

a1 = zeros(2,3,4,5)
a2 = zeros(3,3,4,5)
a1[1:end] = 1:length(a1)
a2[1:end] = (1:length(a2)) + length(a1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to keep common levels between the arrays to test that, e.g. by replacing + length(a1) with + 10. AFAICT this wouldn't have worked with the current code in the PR. Also, using length(a2):1 instead of 1:length(a2) would be a harder test than having levels ordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It did actually work with +10 instead, but we didn't know that until that test was added. I agree that it was a poor test to only have unique ordered values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I hadn't realized that the calls to levels was indeed equivalent to unique.

a2 = zeros(3,3,4,5)
a1[1:end] = 1:length(a1)
a2[1:end] = (1:length(a2)) + length(a1)
ca1 = PooledDataArray(a1) |> compact;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compact(PooledDataArray(a1)) is a more common style.

a2[1:end] = (1:length(a2)) + length(a1)
ca1 = PooledDataArray(a1) |> compact;
ca2 = PooledDataArray(a2) |> compact;
@test vcat(ca1, ca2) == vcat(a1, a2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test the result type using isa.

@test vcat(ca1, ca2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE
@test vcat(ca1, pa2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE

a1 = zeros(2,3,4,5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array(2,3,4,5), since you fill the array right below.

@gustafsson
Copy link
Contributor Author

Thanks for the feedback!

@@ -829,3 +829,21 @@ function dropna{T}(pdv::PooledDataVector{T})
resize!(res, total)
return res
end

function Base.vcat{T,R,N}(p1::PooledDataArray{T,R,N}, p2::PooledDataArray...)
for p in p2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to ask again: is this check really needed, i.e. won't the vcat call below catch it? Anyway, @assert shouldn't be used for caller errors: if you keep these, throw an ArgumentError with a message instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will, but it would be harder for a user to figure out what was wrong in their arguments. Do you have a generic guideline to not validate arguments if it is guaranteed to crash anyways? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the error you get from vcat of the refs below is really clear. I removed this check.

@@ -143,6 +143,9 @@ for (f, basef) in ((:pdatazeros, :zeros), (:pdataones, :ones))
end
end

# Pooled reference type
reftype{T,R}(pa::PooledDataArray{T,R}) = R
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed.

pa = (p1, p2...)
pool = unique(T[[p.pool for p in pa]...;])

idx = [begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it nicer as a single line idx = [indexin(p.pool, pool)[p.refs] for p in pa].

@nalimilan
Copy link
Member

Thanks, looks good to me (apart from the minor style remark). I'll leave the PR open a bit more to let others comment.

BTW, I would really appreciate a PR adding this feature to CategoricalArrays. We really need the more efficient concatenation strategy implemented here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants