vcat of PooledDataVector might need to expand the reftype #213

gustafsson · 2016-09-12T11:28:31Z

This feature could then be used to fix vcat of pooled columns in DataFrames as well, JuliaData/DataFrames.jl#990

nalimilan · 2016-09-12T18:42:08Z

Unfortunately, I don't think we can choose the reftype dynamically, as it would introduce type instability. What we could do is use widen to return a reftype just as large as what could possibly be needed to store all levels, but not more.

Anyway, I wouldn't encourage you to spend too much time on DataArrays, are we're going to stop using them in DataFrames (JuliaData/DataFrames.jl#1008). You can have a look at CategoricalArrays.jl instead, were similar features are needed.

gustafsson · 2016-09-12T20:02:01Z

widen(UInt8)==UInt64. So I think a hardcoded DEFAULT_POOLED_REF_TYPE makes more sense.

I also made some other changes to make the return type stable.

Thanks for reminding me that DataArrays is being deprecated.

nalimilan · 2016-09-12T21:36:22Z

src/pooleddataarray.jl


-    idx = map(pa) do p
+    idx = [ begin


Why change this?

@code_warntype said it couldn't deduce the return type if idx was Any[] as returned by map.

nalimilan · 2016-09-12T21:38:57Z

widen(UInt8) gives UInt32 here, but yeah, we would have to design a custom function for it to return UInt16 instead. The point is just that the type should only depend on types, not on dynamic properties like the number of levels.

gustafsson · 2016-09-12T21:47:17Z

Right, we could write a special vcat but on the other hand I think it's fair to have to call compact again after vcat. What do you think?

nalimilan · 2016-09-13T07:34:46Z

Given that the package is going to be deprecated soon, any correct solution is fine. But for CategoricalArrays, an optimization like this would be interesting.

nalimilan · 2016-09-13T07:48:30Z

Are you sure this code works for arrays of any dimension? You can take inspiration from typed_vcat in Base to ensure it does. Also, tests should be included for matrices (at least).

(It would be cool if we could avoid rewriting vcat from scratch, but for now that doesn't seem possible.)

nalimilan · 2016-09-13T08:04:20Z

Since you create copies of the input refs already, a simple and robust strategy would be to convert them to the wanted reftype, and then call vcat on them: most of the logic would be implemented in Base, and you wouldn't have to rewrite it.

EDIT: Oops, I now see that's what you're doing in the end. So maybe you don't need to check the dimensions of the inputs, since Base.vcat will do it for you? Anyway, dim is broken as it is since N isn't defined (also, isn't this equivalent to ndims?).

gustafsson · 2016-09-15T21:40:20Z

I added some tests, and thanks for mentioning ndims

nalimilan · 2016-09-16T08:14:46Z

test/pooleddataarray.jl

+    ca1 = compact(pa1);
+    ca2 = compact(pa2);
+    @test vcat(ca1, ca2) == vcat(a1, a2)
+    @test vcat(ca1, ca2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE


Instead of adding a reftype function just for this, better use isa(vcat(ca1, ca2), DataArray{Int, DataArrays.DEFAULT_POOLED_REF_TYPE}}. Also compute vcat(ca1, ca2) only once.

nalimilan · 2016-09-16T08:25:44Z

src/pooleddataarray.jl

+    pa = PooledDataArray[p1, p2...]
+
+    pools = Vector{T}[p.pool for p in pa]
+    pool = levels(T[pools...;])


Why not just pool = unique(vcat([p.pool for p in pa]...))?

nalimilan · 2016-09-16T08:26:41Z

src/pooleddataarray.jl

+        @assert size(p)[2:end] == size(p1)[2:end]
+    end
+
+    pa = PooledDataArray[p1, p2...]


A tuple should be more efficient and simpler: pa = (p1, p2...).

I'm curious, how is a tuple more efficient than a typed array?

A tuple is immutable, so the compiler can get rid of it entirely, while the array needs to be allocated. Unlikely to be significant here, though.

I checked with code_warntype, and I had to keep some type hints to make the return type stable. But I also noticed that the code was shorter with pa as tuple than as an array.

AFAIK the type annotation in PooledDataArray[p1, p2...] doesn't improve type stability, since the type is abstract (lacks type parameters). On the contrary, it will make the array abstract even if type inference is table to identify a common concrete type for all elements.

Tuples don't have this problem as their type includes information regarding each element separately.

Thanks I learnt something new today!

The type annotation I was referring to that I wasn't able to get rid of was the T in thispool = unique(T[[p.pool for p in pa]...;]).

@code_warntype tells me that even if it knows the type of pa to be pa::Tuple{DataArrays.PooledDataArray{Int64,UInt8,1},DataArrays.PooledDataArray{Int64,UInt8,1}} it will call regular vcat (instead of typed_vcat) and get a pool of type pool::Union{Array{Any,1},Array{Int64,1}}

nalimilan · 2016-09-16T08:28:29Z

src/pooleddataarray.jl

+    pool = levels(T[pools...;])
+
+    idx = [ begin
+        m = findat(pool, p.pool)


Use indexin(p.pool, pool) instead of findat, since the former is in Base.

That's another gem in Base that I had missed!

nalimilan · 2016-09-16T08:31:42Z

test/pooleddataarray.jl

+    a1 = zeros(2,3,4,5)
+    a2 = zeros(3,3,4,5)
+    a1[1:end] = 1:length(a1)
+    a2[1:end] = (1:length(a2)) + length(a1)


Would be good to keep common levels between the arrays to test that, e.g. by replacing + length(a1) with + 10. AFAICT this wouldn't have worked with the current code in the PR. Also, using length(a2):1 instead of 1:length(a2) would be a harder test than having levels ordered.

Makes sense. It did actually work with +10 instead, but we didn't know that until that test was added. I agree that it was a poor test to only have unique ordered values.

OK, I hadn't realized that the calls to levels was indeed equivalent to unique.

nalimilan · 2016-09-16T08:32:17Z

test/pooleddataarray.jl

+    a2 = zeros(3,3,4,5)
+    a1[1:end] = 1:length(a1)
+    a2[1:end] = (1:length(a2)) + length(a1)
+    ca1 = PooledDataArray(a1) |> compact;


compact(PooledDataArray(a1)) is a more common style.

nalimilan · 2016-09-16T08:32:31Z

test/pooleddataarray.jl

+    a2[1:end] = (1:length(a2)) + length(a1)
+    ca1 = PooledDataArray(a1) |> compact;
+    ca2 = PooledDataArray(a2) |> compact;
+    @test vcat(ca1, ca2) == vcat(a1, a2)


Also test the result type using isa.

nalimilan · 2016-09-16T08:33:05Z

test/pooleddataarray.jl

+    @test vcat(ca1, ca2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE
+    @test vcat(ca1, pa2) |> DataArrays.reftype == DataArrays.DEFAULT_POOLED_REF_TYPE
+
+    a1 = zeros(2,3,4,5)


Array(2,3,4,5), since you fill the array right below.

gustafsson · 2016-09-16T12:38:04Z

Thanks for the feedback!

nalimilan · 2016-09-16T12:47:18Z

src/pooleddataarray.jl

@@ -829,3 +829,21 @@ function dropna{T}(pdv::PooledDataVector{T})
    resize!(res, total)
    return res
 end
+
+function Base.vcat{T,R,N}(p1::PooledDataArray{T,R,N}, p2::PooledDataArray...)
+    for p in p2


Sorry, forgot to ask again: is this check really needed, i.e. won't the vcat call below catch it? Anyway, @assert shouldn't be used for caller errors: if you keep these, throw an ArgumentError with a message instead.

It will, but it would be harder for a user to figure out what was wrong in their arguments. Do you have a generic guideline to not validate arguments if it is guaranteed to crash anyways? :)

Actually the error you get from vcat of the refs below is really clear. I removed this check.

nalimilan · 2016-09-16T12:47:39Z

src/pooleddataarray.jl

@@ -143,6 +143,9 @@ for (f, basef) in ((:pdatazeros, :zeros), (:pdataones, :ones))
    end
 end

+# Pooled reference type
+reftype{T,R}(pa::PooledDataArray{T,R}) = R


No longer needed.

nalimilan · 2016-09-16T13:25:03Z

src/pooleddataarray.jl

+    pa = (p1, p2...)
+    pool = unique(T[[p.pool for p in pa]...;])
+
+    idx = [begin


I would find it nicer as a single line idx = [indexin(p.pool, pool)[p.refs] for p in pa].

nalimilan · 2016-09-16T13:27:45Z

Thanks, looks good to me (apart from the minor style remark). I'll leave the PR open a bit more to let others comment.

BTW, I would really appreciate a PR adding this feature to CategoricalArrays. We really need the more efficient concatenation strategy implemented here.

gustafsson mentioned this pull request Sep 12, 2016

vcat should expand pooled columns when needed JuliaData/DataFrames.jl#990

Closed

Johan Gustafsson added 2 commits September 12, 2016 21:37

vcat of PooledDataVector might need to expand the reftype

cffe69f

type-stable vcat of pooled arrays

6a7fa47

gustafsson force-pushed the expand_pool_on_vcat branch from 2f0b09d to 6a7fa47 Compare September 12, 2016 19:59

nalimilan reviewed Sep 12, 2016
View reviewed changes

nalimilan mentioned this pull request Sep 13, 2016

Add a promotion mechanism to choose array concatenation return type JuliaLang/julia#18472

Closed

Johan Gustafsson added 2 commits September 15, 2016 23:36

vcat: test multi-dimensional arrays

41f1040

vcat: replace dim with ndims

76f749b

nalimilan reviewed Sep 16, 2016

View reviewed changes

improved tests

cd07b87

gustafsson force-pushed the expand_pool_on_vcat branch from dd00892 to 5e9fddf Compare September 16, 2016 13:00

using simpler functions

ece7e37

gustafsson force-pushed the expand_pool_on_vcat branch from f0a9190 to ece7e37 Compare September 16, 2016 13:06

nalimilan reviewed Sep 16, 2016

View reviewed changes

Johan Gustafsson added 2 commits September 17, 2016 11:32

Use platform-default Int instead of Int64

5f43d98

compact code

39fef7e

gustafsson mentioned this pull request Sep 17, 2016

pooled vcat JuliaData/CategoricalArrays.jl#18

Merged

nalimilan merged commit 8504c80 into JuliaStats:master Sep 27, 2016

nalimilan mentioned this pull request Feb 24, 2017

address vcat return inconsistency JuliaStats/NullableArrays.jl#187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vcat of PooledDataVector might need to expand the reftype #213

vcat of PooledDataVector might need to expand the reftype #213

gustafsson commented Sep 12, 2016

nalimilan commented Sep 12, 2016

gustafsson commented Sep 12, 2016

nalimilan Sep 12, 2016

gustafsson Sep 12, 2016

nalimilan commented Sep 12, 2016

gustafsson commented Sep 12, 2016

nalimilan commented Sep 13, 2016

nalimilan commented Sep 13, 2016

nalimilan commented Sep 13, 2016 •

edited

Loading

gustafsson commented Sep 15, 2016

nalimilan Sep 16, 2016

nalimilan Sep 16, 2016

nalimilan Sep 16, 2016

gustafsson Sep 16, 2016

nalimilan Sep 16, 2016

gustafsson Sep 16, 2016

nalimilan Sep 16, 2016

gustafsson Sep 17, 2016

nalimilan Sep 16, 2016

gustafsson Sep 16, 2016

nalimilan Sep 16, 2016

gustafsson Sep 16, 2016

nalimilan Sep 16, 2016

nalimilan Sep 16, 2016

nalimilan Sep 16, 2016

nalimilan Sep 16, 2016

gustafsson commented Sep 16, 2016

nalimilan Sep 16, 2016

gustafsson Sep 16, 2016

gustafsson Sep 16, 2016

nalimilan Sep 16, 2016

nalimilan Sep 16, 2016

nalimilan commented Sep 16, 2016

vcat of PooledDataVector might need to expand the reftype #213

vcat of PooledDataVector might need to expand the reftype #213

Conversation

gustafsson commented Sep 12, 2016

nalimilan commented Sep 12, 2016

gustafsson commented Sep 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Sep 12, 2016

gustafsson commented Sep 12, 2016

nalimilan commented Sep 13, 2016

nalimilan commented Sep 13, 2016

nalimilan commented Sep 13, 2016 • edited Loading

gustafsson commented Sep 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gustafsson commented Sep 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Sep 16, 2016

nalimilan commented Sep 13, 2016 •

edited

Loading