Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for RGB colors #67

Open
timholy opened this issue Mar 6, 2020 · 9 comments
Open

Support for RGB colors #67

timholy opened this issue Mar 6, 2020 · 9 comments

Comments

@timholy
Copy link
Contributor

timholy commented Mar 6, 2020

Aside from #66, I think one of the last obstacles to using this in ImageFiltering is the handling of color array elements, for example RGB(0.8f0, 0.3f0, 0.5f0) (which is just a struct of 3 Float32s). Through TiledIteration it's possible to convert the data into a suitable format, for example RGB{N0f8} can be converted to RGB{Float32} because TileBuffer takes an element type as an argument. (I note that's really old syntax, with modern Julia I should change it to TileBuffer{T}(undef, tileaxs).) Consequently it will be converted before it is used; because convolution/correlation uses each element multiple times, this conversion is well worth doing if it gets us better performance and allows us to use this package. (Another benefit of copying is that we can guarantee stridedness.)

So the question is, what kind of format should I convert it to? Should we define explict support for RGB, or would it be better to convert to an SVector{3} or similar? I kind of favor the latter since it would presumably be more generic.

I suspect this will alter the loop unrolling/SIMD packing, once I understand what representation you'd like here.

@chriselrod
Copy link
Member

In terms of data layout, I think it is likely that a "struct of arrays" representation would work best. Preferably, rather than creating three separate arrays, the last axis of the array would be of dim 3, where the dims correspond to R, G, and B.

That'd make it cheap to load separate vectors of R, G, and B without requiring a gather or any shuffling. That is, if we can't remove the abstractions surrounding raw operations on integers or floating point numbers, I think we should use "structs of vectors", where one vector will hold all the R, another all the G, and the last all the B.

This would require:

  1. Define such an RGB struct vector.
  2. Define the needed operations.
  3. Update the cost modeling.

"2." could be as easy as making the existing operations on RGB values generic enough so that they'd expect the struct we define in "1.", and then defining appropriate vload! and vstore! operations to work on the StructArrays (or a StructTiles object so that we can convert and fill this object just before we begin operating, and thus begin operating while they're still hot in the cache).

"3." is the most difficult. The most important aspect would be to allow it to model a vector as occupying multiple registers. A vector of RGB would consume one register for each the R, the G, and the B vector. If it were to decide on a kernel that spills registers, it would likely get very bad performance.
Less important is that the costs of functions should reasonably accurately reflect their relative cost. I don't think that it is likely to change the decisions it makes, so this is much less important than making sure it's correctly modeling register pressure.

Something else to consider would be supporting the N0f8 representation. With AVX2, each RGB{N0f8} vector object would contain vectors of length 32, versus vectors of merely length 8 for RGB{Float32}.
Nonetheless, on a mostly compute-bound task (using AVX2), the Float32 seems to be faster than N0f8:

julia> using LoopVectorization, BenchmarkTools

julia> Af32 = rand(Float32, 256, 256);

julia> Bf32 = rand(Float32, 256, 256);

julia> Au8 = rand(UInt8, 256, 256);

julia> Bu8 = rand(UInt8, 256, 256);

julia> Cf32 = similar(Af32); Cu8 = similar(Au8);

julia> function AmulBavx2!(C, A, B)
           @avx for m  1:size(A,1), n  1:size(B,2)
               C[m,n] = zero(eltype(C))
               for k  1:size(A,2)
                   C[m,n] += A[m,k] * B[k,n]
               end
           end
       end
AmulBavx2! (generic function with 1 method)

julia> @benchmark AmulBavx2!($Cf32, $Af32, $Bf32)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     741.575 μs (0.00% GC)
  median time:      748.385 μs (0.00% GC)
  mean time:        758.588 μs (0.00% GC)
  maximum time:     2.121 ms (0.00% GC)
  --------------
  samples:          6572
  evals/sample:     1

julia> @benchmark AmulBavx2!($Cu8, $Au8, $Bu8)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.366 ms (0.00% GC)
  median time:      1.400 ms (0.00% GC)
  mean time:        1.404 ms (0.00% GC)
  maximum time:     1.735 ms (0.00% GC)
  --------------
  samples:          3556
  evals/sample:     1

Simple elementwise multiplication and especially addition were faster with the UInt8s. With AVX2, one load instruction can load 32 8-bit integers vs 8 32-bit numbers.

The case for elementwise multiplication was despite the fact that there is no assembly instruction for multiplying vectors of 8-bit integers (not with AVX2 or AVX512BW):

julia> u = ntuple(Val(32)) do _ Core.VecElement(rand(UInt8)) end
(VecElement{UInt8}(0x43), VecElement{UInt8}(0xf1), VecElement{UInt8}(0x25), VecElement{UInt8}(0x21), VecElement{UInt8}(0xf6), VecElement{UInt8}(0x8e), VecElement{UInt8}(0xc1), VecElement{UInt8}(0x0d), VecElement{UInt8}(0xa4), VecElement{UInt8}(0x5f), VecElement{UInt8}(0xc0), VecElement{UInt8}(0x49), VecElement{UInt8}(0x6c), VecElement{UInt8}(0x98), VecElement{UInt8}(0xe6), VecElement{UInt8}(0xcb), VecElement{UInt8}(0x7a), VecElement{UInt8}(0xdb), VecElement{UInt8}(0x83), VecElement{UInt8}(0x83), VecElement{UInt8}(0x0f), VecElement{UInt8}(0xed), VecElement{UInt8}(0xe3), VecElement{UInt8}(0x92), VecElement{UInt8}(0xef), VecElement{UInt8}(0xed), VecElement{UInt8}(0x62), VecElement{UInt8}(0x42), VecElement{UInt8}(0x6d), VecElement{UInt8}(0x04), VecElement{UInt8}(0xf9), VecElement{UInt8}(0x93))

julia> u2 = ntuple(Val(32)) do _ Core.VecElement(rand(UInt8)) end
(VecElement{UInt8}(0xd9), VecElement{UInt8}(0x12), VecElement{UInt8}(0x69), VecElement{UInt8}(0x0c), VecElement{UInt8}(0xb4), VecElement{UInt8}(0x73), VecElement{UInt8}(0xa9), VecElement{UInt8}(0x3d), VecElement{UInt8}(0x8e), VecElement{UInt8}(0x64), VecElement{UInt8}(0x83), VecElement{UInt8}(0xa6), VecElement{UInt8}(0x49), VecElement{UInt8}(0xde), VecElement{UInt8}(0x95), VecElement{UInt8}(0x28), VecElement{UInt8}(0x27), VecElement{UInt8}(0x78), VecElement{UInt8}(0xc0), VecElement{UInt8}(0xd1), VecElement{UInt8}(0xfe), VecElement{UInt8}(0xbb), VecElement{UInt8}(0xbf), VecElement{UInt8}(0x4b), VecElement{UInt8}(0x21), VecElement{UInt8}(0x72), VecElement{UInt8}(0xf6), VecElement{UInt8}(0xa8), VecElement{UInt8}(0x58), VecElement{UInt8}(0x13), VecElement{UInt8}(0x09), VecElement{UInt8}(0xe2))

julia> using SIMDPirates

julia> vmul(u, u2)
(VecElement{UInt8}(0xcb), VecElement{UInt8}(0xf2), VecElement{UInt8}(0x2d), VecElement{UInt8}(0x8c), VecElement{UInt8}(0xf8), VecElement{UInt8}(0xca), VecElement{UInt8}(0x69), VecElement{UInt8}(0x19), VecElement{UInt8}(0xf8), VecElement{UInt8}(0x1c), VecElement{UInt8}(0x40), VecElement{UInt8}(0x56), VecElement{UInt8}(0xcc), VecElement{UInt8}(0xd0), VecElement{UInt8}(0xde), VecElement{UInt8}(0xb8), VecElement{UInt8}(0x96), VecElement{UInt8}(0xa8), VecElement{UInt8}(0x40), VecElement{UInt8}(0xf3), VecElement{UInt8}(0xe2), VecElement{UInt8}(0x1f), VecElement{UInt8}(0x5d), VecElement{UInt8}(0xc6), VecElement{UInt8}(0xcf), VecElement{UInt8}(0x8a), VecElement{UInt8}(0x2c), VecElement{UInt8}(0x50), VecElement{UInt8}(0x78), VecElement{UInt8}(0x4c), VecElement{UInt8}(0xc1), VecElement{UInt8}(0xc6))

yields

#julia> @code_native debuginfo=:none vmul(u, u2)
        .text
        vpunpckhbw      %ymm0, %ymm0, %ymm2 # ymm2 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
        vpunpckhbw      %ymm0, %ymm1, %ymm3 # ymm3 = ymm1[8],ymm0[8],ymm1[9],ymm0[9],ymm1[10],ymm0[10],ymm1[11],ymm0[11],ymm1[12],ymm0[12],ymm1[13],ymm0[13],ymm1[14],ymm0[14],ymm1[15],ymm0[15],ymm1[24],ymm0[24],ymm1[25],ymm0[25],ymm1[26],ymm0[26],ymm1[27],ymm0[27],ymm1[28],ymm0[28],ymm1[29],ymm0[29],ymm1[30],ymm0[30],ymm1[31],ymm0[31]
        vpmullw %ymm2, %ymm3, %ymm2
        movabsq $.rodata.cst32, %rax
        vmovdqa (%rax), %ymm3
        vpand   %ymm3, %ymm2, %ymm2
        vpunpcklbw      %ymm0, %ymm0, %ymm0 # ymm0 = ymm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
        vpunpcklbw      %ymm0, %ymm1, %ymm1 # ymm1 = ymm1[0],ymm0[0],ymm1[1],ymm0[1],ymm1[2],ymm0[2],ymm1[3],ymm0[3],ymm1[4],ymm0[4],ymm1[5],ymm0[5],ymm1[6],ymm0[6],ymm1[7],ymm0[7],ymm1[16],ymm0[16],ymm1[17],ymm0[17],ymm1[18],ymm0[18],ymm1[19],ymm0[19],ymm1[20],ymm0[20],ymm1[21],ymm0[21],ymm1[22],ymm0[22],ymm1[23],ymm0[23]
        vpmullw %ymm0, %ymm1, %ymm0
        vpand   %ymm3, %ymm0, %ymm0
        vpackuswb       %ymm2, %ymm0, %ymm0
        retq
        nopw    %cs:(%rax,%rax)
        nopl    (%rax)

Versus Float32, where multiplying 32 of them could be accomplished with 4 multiply instructions.

Which does better is likely task dependent. If the 8-bit integers require a lot of multiplications and fixed-point related re-normalizing, the floating point numbers will probably be faster. If there are more additions/subtractions and memory moves, the 8-bits will have an advantage.

If CPUs with the AVX512VNNI instruction set become more common, it may be worth supporting them as well if some kernels can deal with the fact that they horizontally sum some elements. That is, when performing a fused multiply-add on 8 bit integers, it would compute

for i in 1:16
    dest[i] = a[4i-3]*b[4i-3] + a[4i-2]*b[4i-2] + a[4i-1]*b[4i-1] + a[4i]*b[4i]
end

I have such a chip and could run tests.

@timholy
Copy link
Contributor Author

timholy commented Mar 7, 2020

Interesting. We got away from this layout because I had the impression it would be better to have all the values localized, but it would be easy to get back to. I presume the core problem is that having the fastest dimension be of size 3 is not ideal from a vectorization standpoint?

Currently you can do this by combining views:

julia> rgbchannels = rand(128, 128, 3);

julia> using ImageCore

julia> img = colorview(RGB, PermutedDimsArray(rgbchannels, (3, 1, 2)));

julia> println(summary(img))
128×128 reshape(reinterpret(RGB{Float64}, PermutedDimsArray(::Array{Float64,3}, (3, 1, 2))), 128, 128) with eltype RGB{Float64}

julia> img[1, 1]
RGB{Float64}(0.5340850453425747,0.1268892532095427,0.4547422783495081)

EDIT: I should clarify that the original rgbchannels array can be extracted from img:

julia> println(summary(img.parent.parent.parent))
128×128×3 Array{Float64,3}

Do you need something more "dedicated" or does this solve the problem? Note it's 3 "orthogonal" manipulations: an index permutation via PermutedDimsArray, an eltype-change via reinterpret (a ReinterpretArray), and a shape change via reshape/ReshapedArray. It's a lot of nesting of types, on the other hand if each operation can be supported by LoopVectorization then composability might provide a lot of power without people needing to write LoopVectorization methods to handle their custom types.

But if the nesting is trouble, we used to have a ColorView type (that was kind of a 3-in-1) and we could (re)introduce it.

@timholy
Copy link
Contributor Author

timholy commented Mar 7, 2020

This prompts me to realize the above should probably be 4 layers of nesting: we probably need something to say to the compiler that dimension 3 of rgbchannels has size 3. ReinterpretArray (a nice, generic solution to element-type changes) has had a large negative impact on performance compared to our old custom ColorView type. I bet we could fix that if the type system knew the size.

@timholy
Copy link
Contributor Author

timholy commented Mar 7, 2020

Sorry to keep rethinking this, but for ImageFiltering's purposes we could just craft a TileBuffer that's a plain Array{Float32} with color in the last dimension, and define a custom copy operation to and from this type. That would take all the need for smarts out of LoopVectorization. The only disadvantage I can see is that you might need to allocate a similar temporary buffer for the result, rather than using the actual output buffer.

If we go that way (which seems pretty attractive), I suspect this issue could be closed now.

To finally address one of your other points,

Something else to consider would be supporting the N0f8 representation.

For most mathematics you really need to convert to Float32 anyway because of overflow; UInt8 isn't a very useful computational type, I view it as more of a storage type. @kimikage has done some excellent work optimizing our conversions, e.g., https://github.com/JuliaMath/FixedPointNumbers.jl/blob/0c486f385e5fa00b66573e544a2923e73cdfa279/src/normed.jl#L147-L157. (N0f8 is the f=0 case.)

@chriselrod
Copy link
Member

Interesting. We got away from this layout because I had the impression it would be better to have all the values localized, but it would be easy to get back to. I presume the core problem is that having the fastest dimension be of size 3 is not ideal from a vectorization standpoint?

That might be the biggest problem.
Looking here, it looks like the color elements are all treated in the same way, and different colors don't seem to interact.

If this is ever not the case, the struct-of-arrays layout has an advantage.
Using complex numbers as a simple example of vectorization getting complicated:

c = a::Complex * b::Complex
c.re == a.re * b.re - a.im * b.im
c.im == a.re * b.im + a.im * b.re

If our inputs are two SIMD vectors <a.re, a.im> and <b.re, b.im> and we want the output <c.re, c.im>, we'd have to do a lot of shuffling. For example, in pseudocode, we may vectorize it via:

are = a[1]
arevec = vbroadcast(are)
aim = a[2]
aimvec = vbroadcast(aim)
bimre = shuffle(b, [2,1])
c = arevec * b + <-1,1> * aimvec * bimre # elementwise ops

Here, we spent a lot of effort reorganizing the data so that we could put it in a format amenable for vectorization. We'd also need some more shuffling to extend this to vectors of length 4, 8, or 16. Instead of broadcasts, we'd probably either use gathers to load the correct pattern from where they're stored in memory, or we'd use a lot of shuffles.

If, on the other hand, we had a struct-of-arrays layout, and had vectors of width W, we could compute W answers in parallel via

c.re == a.re * b.re - a.im * b.im
c.im == a.re * b.im + a.im * b.re

where each c.re, b.im, etc are vectors of W different elements. We don't need any re-organization, and it scales naturally for arbitrary W.

When/if arrays of structs are supported, my plan is actually to use shuffles after loading them to put them in the above format, as I think that is much simpler than the first example (although it is obviously worse what not needing the shuffles in the fist place).

As an example more directly applicable to colors, if we were multiplying different colors by different constants

(c::Number, a::Linear3) = base_color_type(a)(c*comp1(a), c*comp2(a), c*comp3(a))

if we wanted to multiply multiple colors by multiple constants c, under the first approach we'd need to assemble vectors of [c1,c1,c1,c2,c2,c2,c3...], versus loading contiguous vectors [c1,c2,c3,c4...] and just performing three multiplications.

I don't yet know what sorts of algorithms and operations are done with these colors, but as long as we're capable of doing W or more in parallel (i.e., different iterations on images can be computed independently, so that we're allowed to perform them in parallel), that'll be the easiest route to efficient parallelization.

Do you need something more "dedicated" or does this solve the problem?

How are these operations expressed?
If someone writes kernels without abstractions directly on these H x W x 3 arrays, it solves the problem. I may have to double check that stridedpointer still works on these objects (strides and pointer would need to be defined), and if not define a few specialized methods for reshaped and reinterpreted arrays so that it works as expected.
This also means that there should not be any overhead, like the problems reinterpret caused post 1.0 (LoopVectorization accesses data through the stridedpointers, not the array objects).

However, if we want to support using the color abstractions, it's more complicated. That is if we want to support writing operations using colors, we'd need to

  1. Support holding SVec. I've been considering making the SVec objects a subtype of Number or Real, so I may be able to get things to work without changing ColorTypes. Otherwise, we may have to make it more general, or define a new type and ensure that multiple dispatch takes care of all the existing operations working correctly.
  2. Define a StructOfArrays pointer, that loads things correctly.
  3. Teach the cost modeling that vectors can occupy multiple registers.

If we're just working directly on Float32 arrays, this isn't necessary.

This prompts me to realize the above should probably be 4 layers of nesting: we probably need something to say to the compiler that dimension 3 of rgbchannels has size 3.

Manually unrolling loops across those 3 dimensions would likely be best if there's any interactions across them like there is in the complex number case.
Although if things can be expressed as loops of independent iterations across that axis, it would definitely help LoopVectorization to know that the loop is only 3 iterations (meaning maybestaticsize should be defined if that loop uses 1:size(A,3)), otherwise it might try unrolling it excessively.

Sorry to keep rethinking this, but for ImageFiltering's purposes we could just craft a TileBuffer that's a plain Array{Float32} with color in the last dimension, and define a custom copy operation to and from this type. That would take all the need for smarts out of LoopVectorization. The only disadvantage I can see is that you might need to allocate a similar temporary buffer for the result, rather than using the actual output buffer.

If I understand you correctly, that may be the most cache-friendly approach. Optimized matrix multiplication libraries do that sort of copying even when they don't need to apply any transformations, as it prevents TLB and cache misses.

If the allocation is a concern, you could define a global buffer (per thread) that is reused.

For most mathematics you really need to convert to Float32 anyway because of overflow; UInt8 isn't a very useful computational type,

Hmm. As a comment, the product of the largest 16-bit and 8-bit unsigned integers is still well within the range of integers exactly representable by Float32

julia> typemax(UInt16) * typemax(UInt8)
0xff01

julia> Float32(ans)
65281.0f0

julia> 1 << 23
8388608

julia> Float32(ans) - big(1) << 23
0.0

Floating point multiplication is much faster than integer multiplication, so I suspect it would be worth promoting to Float32 before multiplying by (Float32-versions of) the 16-bit integers, and the calculation will still be exact.

@stillyslalom
Copy link

Instead of specializing for RGB colors, how about a general interface for any isbits type convertible to NTuple{N, Union{Float32, Float64}? That covers a lot of use cases (complex numbers, color images, static vectors/arrays, fluid/solid mechanics tensors). Provide a tiled StructArray-ish type with elements stored along the last axis of the underlying array, ask users to provide convert methods between their types and the LoopVectorization tile type, allow destructuring in the iterator, and you get a nice syntax like

@avx for (i, (r, g, b)) in enumerate(image)
    image[i].r = min(g, b) / r
end

@kimikage
Copy link

I don't know if this is directly related, but I'm currently experimenting with new color types.
(cf. JuliaGraphics/ColorTypes.jl#114 (comment), https://github.com/kimikage/StorageColorTypes.jl)
The significant difference between those types and the types defined in ColorType.jl is that the encoding in memory and the eltype are different.
This can be a performance bottleneck, but the abstraction technique which decouples the data format on registers from the data format in memory might be useful (and annoying, though).

@chriselrod
Copy link
Member

This prompts me to realize the above should probably be 4 layers of nesting: we probably need something to say to the compiler that dimension 3 of rgbchannels has size 3. ReinterpretArray (a nice, generic solution to element-type changes) has had a large negative impact on performance compared to our old custom ColorView type. I bet we could fix that if the type system knew the size.

LoopVectorization should now support the reinterpret(reshape, ...) you added fairly well, gleaning the static size information for example (so that it'd know the leading dimension is 3), and use this to generate shuffle instructions for relatively efficient loading/storing.
This still isn't as efficient as not needing shuffles at all (e.g., with struct-of-arrays type layout, or making the 3 one of the later dimensions), but it's far better than gather/scatter.

@chriselrod
Copy link
Member

Instead of specializing for RGB colors, how about a general interface for any isbits type convertible to NTuple{N, Union{Float32, Float64}? That covers a lot of use cases (complex numbers, color images, static vectors/arrays, fluid/solid mechanics tensors). Provide a tiled StructArray-ish type with elements stored along the last axis of the underlying array, ask users to provide convert methods between their types and the LoopVectorization tile type, allow destructuring in the iterator, and you get a nice syntax like

@avx for (i, (r, g, b)) in enumerate(image)
    image[i].r = min(g, b) / r
end

That'd be a nice workaround until the switch to abstract interpreter (which is now #3 on the list of major changes I have planned for LoopVectorization), but we'd have to come up with a way to implement this that avoids world-age errors.

That is, the operation mappings have to be in the form of type information so that they can be passed into _avx_! and used, with all rules for how to interpret these mappings already being defined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants