Skip to content

Commit

Permalink
Add Intel ICL (Classic 19.2) and ICX (NextGen 2021) configuration to …
Browse files Browse the repository at this point in the history
…solution

ICX is Intel's new LLVM based compiler. SSE 4.2 is set as a minimum
ICL is Intel's classic 19.2 C++ Compiler
  • Loading branch information
pinterf committed Nov 8, 2021
1 parent 734c857 commit d8bdff7
Show file tree
Hide file tree
Showing 4 changed files with 1,232 additions and 0 deletions.
Loading

21 comments on commit d8bdff7

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue found in the sources at this commit point: x264_pixel_sad_16x16_sse2() and x264_pixel_sad_16x16_sse3() crashes if ALIGN_SOURCEBLOCK = 1 in MVInterface.h (aligned copy disabled). And blocksize=16x16. With default padding =8. If increase padding to 16 (in Msuper) - crash not happens. Looks like they not compatible with non-aligned source block ? But non-aligned runs faster at newer CPUs (about after AVX). So may be increase default padding with blocksize > 8 ?

Test script:

LoadPlugin("mvtools2.dll")

ColorBars(1920,1080)
ConvertToYV12()

tr=12
super=MSuper(last,chroma=false, mt=false, pel=1)
multi_vec=MAnalyse (super, multi=true, blksize=16, delta=tr, chroma=false)
MDegrainN(last,super, multi_vec, tr, thSAD=175, thSAD2=160)

@pinterf
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange, they contain movu instructions, not mova. (if I'm understanding right the asm source - hard to read, hand crafted with lot of asm macros, you have to visually identify where to find e.g. x264_pixel_sad_16x16_sse2). What does disassembly show?

@pinterf
Copy link
Owner Author

@pinterf pinterf commented on d8bdff7 Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked. This is the problem:
psadbw xmm0,xmmword ptr [rcx]
rcx was 8 byte aligned only.
This part of macro-element is used at many places, probably in 32,48 and 64 byte blocksizes as well.
EDIT:
All 16 and up blocksizes are affected. Build-stones:
PROCESS_SAD_12x4, PROCESS_SAD_16x4, PROCESS_SAD_24x4, PROCESS_SAD_24x2, PROCESS_SAD_32x4, PROCESS_SAD_48x4, PROCESS_SAD_64x4, then macro elements in specialized SAD_W16 (width=16).
So nearly everything. I haven't even looked into 10 and 16 bit SAD routines.

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange because pasdbw do not have alignment requirements. I think it is read out of buffer ? The Sad_C() function works ok. But it do not perform out of buffers reading. The instinsinc _mm_sad_epu8 also do not shows any alignment requirement https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2&ig_expand=481,6894,5517,5517,5406,5994&text=sadbw

@pinterf
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it needs alignment.
Btw. have you tried mpsadw? _mm_mpsadbw_epu8
https://www.felixcloutier.com/x86/mpsadbw

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About mpsadbw - still not see it. May be it interesting to try. But at first need to understand better how it works. In debugger/simulator. May be it can make H-scan in Exa searches faster.

With addition of optPredictorType=4 the MAnalyse finally works faster MDegrainN. It not of best quality option but helps to test MDegrainN performance better. So I tried to add 64bytes non-temporal store to 8x8 and 16x16 blocks processing in MDegrainN. It works a bit faster, but it looks it good to redo the ref block reading in some way - may be it can increase speed too.

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - mpsadbw should be faster in compare with sad+shifts. Unfortunately it only operates with 4 bytes src blocks and no way to switch to 8 bytes. So it doubles number of required instructions to process 8x8 and 16x16 blocks. But allow to have many search block sizes with 4 bytes granularity. Will try to make 8x8_sp1 and 8x8_sp2 functions for PlaneofBlocks_avx2 to test for speed. If it will be good - the all functions in PlaneofBlocks_avx2 need to be redesigned to mpsadbw (and may more added like up to 16x16 blocks). As I see mostly usable sp is 1 and 2 (default values for all levels except level 0 with sp1) and need for sp3 and sp4 may be very low. mpsadbw also allow to make sp3 (and sp3.5 that is -4..+3 marked now as sp4).

The AVX512 have many more sad instructions like dbsad_epu8 and may be allow to process more 8bytes sequencies in 1 instruction.

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Nov 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sp1 Exa search version with mpsadbw in this commit - DTL2020@03252b3 . It really looks shorter. Also using define and macro makes programm more compact. I can not test it for speed till middle of next week when I will be at work. At the SDE it looks like working. Also it have inside register file vertical shift - I hope in old versions compiler understand it and do the same thing. Though not look in the asm very much. So it may be also a point to test for speed - do compiler understand shifted load from memory as shifting ref rows or the shifting inside register file manually (with permute2x128) is faster/slower.

We got new CPU about i5-11500 near my workplace and CPUz shows it have AVX512F - I think to try make MDegrainN block processing function with gather/scatter instructions for full block load-store. AVX2 have only gather instructons and AVX512 both gather and scatter.

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some simple and useful logic optimization: DTL2020@fec4e66

Remember already checked vectors (x,y coordinates) and do not check again. It saves lots of calls to SAD() (that is hard to optimize at SIMD). Because lots of predictors may be equal (zero, global, median, 4 additional).

@pinterf
Copy link
Owner Author

@pinterf pinterf commented on d8bdff7 Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a new and interesting idea.
I see in if (0 != optSearchOption && nPel == 1 & avx2) you made it again only for avx2. So no C version is maintained them any more? Or C reference exists but is better not called because of speed reasons?

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I see in if (0 != optSearchOption && nPel == 1 & avx2) you made it again only for avx2. So no C version is maintained them any more? Or C reference exists but is better not called because of speed reasons?"

C versions exist and work as reference. But at non-AVX2 CPUs the ExpandingSearch possibly much faster in compare ExhaustiveSearch_C-ref (at some sources like motionless ColorBars()). Because ExpandingSearch uses MotionDistortion check to skip some of SAD() calls. If re-write ExhaustiveSearch_C-ref to use MotionDistortion check it will not equal to AVX2 functions. If add MotionDistortion check to AVX2 functions ( it is possible) - they will be slower.

If user have AVX2 CPU - may limit CPUMax to non-AVX2 and check the C-ref output. Also it is found that C-ref functions work for any block size so they are not large in count. Mostly for sp1, sp2, (and may be rarely used sp3 and sp4).

@pinterf
Copy link
Owner Author

@pinterf pinterf commented on d8bdff7 Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Just to be sure. (I'm still following the changes and like them, but I'm not pulling them yet until they reach a point when some parts are untouched for some weeks :) )

@pinterf
Copy link
Owner Author

@pinterf pinterf commented on d8bdff7 Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking forward to avx512 possibilities though. I had only one project so far with it
https://github.com/pinterf/FluxSmooth

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is planned to add multi-block processing (up to 4/8 8x8 for AVX2 and up to 16/32 8x8 for AVX512) - SO3 and SO4 (to MAnalyse and to MDegrainN) so it will take more time. May be several weeks. Currently single block processing AVX512 shows about equal performance with AVX2. Even for 16x16 block size. May be overhead on strided loading of 512bit vectors with 64/128 bit lines is too high. So multi-blocks processing with loading of several lines of neibour blocks in 512 bit vectors hope will be faster.

@pinterf
Copy link
Owner Author

@pinterf pinterf commented on d8bdff7 Dec 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect no gain in speed. The memory must be fetched this or that way, the basic operations of doing SAD on already loaded items can usually be overlapped by the next memory loads so its basically free.

zmm is too big to handle small 8 byte blocks.
So the compiler's tricky decision is how the fragmented ref is loaded (in this example I exchanged src and ref, latter is the variable element, former can be loaded once, because I made its stride to exactly 8)

Example#1
MSVC is using _mm_insert into XMM registers then xmms are inserted into ymm then they form a zmm
Then comes the SAD

https://godbolt.org/z/j3jscc1f1

Intels

Both with -mavx512f -mavx512bw -O3 (-Ot) optimization
I recommend playing with the options as well.

Intel ICX (new Clang based):
https://godbolt.org/z/xY5baad7s
This approach is loading both SAD operands in zmm register

Intel ICC (Classic)
https://godbolt.org/z/EPa75qxKd
Interesting: it re-stores the ref to have a consecutive 64 byte array then makes the SAD against that reordered memory buffer

@pinterf
Copy link
Owner Author

@pinterf pinterf commented on d8bdff7 Dec 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(_mm512_reduce_add_epi64 is a sequence, not a base instruction)

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Dec 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - reduce is macro. It easy to use by programmer but may be not best in speed. Though it leave more possibilities to future compilers to optimize it on current and future chips. Also if it become populair - it may be designed as real hardware accelerated instruction in the update of AVX512 instructions sets in future chips. It is already good sign that it exist in intrinsics list.

I think to test handcrafted version of reducing for 8x8 sad like:
`int isad1 = _mm512_reduce_add_epi64(_mm512_sad_epu8(zmm_src, zmm_ref));

__m512i zmm_zero = _mm512_setzero_si512();
__m256i ymm_shuf = _mm256_set_epi8(128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, \
	128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 9, 8, 128);
__m512i zmm_idx_perm01 = _mm512_set_epi8(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,\
										57,49,41,33,25,17,9,1,56,48,40,32,24,16,8,0);

__m512i zmm_sad8 = _mm512_sad_epu8(zmm_src, zmm_ref);
__m512i zmm_hilo8 = _mm512_permutexvar_epi8(zmm_idx_perm01, zmm_sad8);
zmm_hilo8 = _mm512_sad_epu8(zmm_hilo8, zmm_zero);
__m256i ymm_shift_hi16 = _mm256_shuffle_epi8(_mm512_castsi512_si256(zmm_hilo8), ymm_shuf);
int isad2 = _mm_cvtsi128_si32(_mm256_castsi256_si128(_mm256_adds_epu16(ymm_shift_hi16, _mm512_castsi512_si256(zmm_hilo8))));`

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Dec 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About MDegrainN memory scan ordrer: I make test with current version with rectangular blocks: 64x16 and 16x64. The fps
64x16 - 13.36
16x64 - 8.29

Unfortunately no more thin rects availabe now like 2x64 and 64x2.

The x64 in size block uses much more SDRAM rows switching so slower with memory I think. So I plan to make MDegrainN function (at first for selected block sizes like 8x8 and 16x16 typically used) as follow:

  1. For each row of blocks: CreateBlocksRowMDegrainN_Program that will prepare vectors shifts, blocks weights and mark sequencies of coherent vectors (for >1 blocks processing).
  2. ExecuteBlocksRowMDegrain_Program that will process each line of source to the each line of destination (gathering lines of refs as needed) - so it should use as low activated SDRAM rows as possible. With total height of blocks processing assuming almost each line of each block of HD/UHD frame fits in separate SDRAM row now - it is require to gather and store data to BlockHeight*(src + 2_x_trad + dst) SDRAM rows. And processsing line by line should decrease it to only (src + 2_x_trad + dst) SDRAM rows.

Todays approach of processing single whole height block looks like not memory-friendly. Ofcourse CPU with large cache trying to fetch some data from lots of activated SDRAM rows for future work but may be not all that needed.

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Dec 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"zmm is too big to handle small 8 byte blocks."

As CPUs typically can execute simple enough instructions on > 1 ports at once (I hope _sad() instruction too) - it may be faster to load halfs of 8x8 block to 2 ymm and add result at the end of processing. It is a thing to test sometime too.

sad(uint8_t* src, uint8_t*ref, int stride) {
const int stride_src = 8;

It is not fair to show compiler const stride_src=8 - it mean all data from the memory can be loaded with single 512bit load and no need for gathering work. So good compiler will collapse it to 1 512 bit load instruction. Unfortunately in real life the lines of src block is scattered via even different SDRAM rows with large stride. So CPU possibly will gather it from different cache lines.

The only reason why it is gather but not load 512bit - if you set indexes of 8byte blocks in reverse order (that is no important for sad 8x8 operation) but compiler may think the order in zmm register is important. But it still can load 512bit and make 1 permute instruction to set that non-memory order of 8byte blocks.

"makes the SAD against that reordered memory buffer"

I think about using ref data directly as memory operand because it is typically not reused. But not sure if compiler of intrinsics will understand it. It typically no difference between variable in register file or memory in C. And intrinsics do not have method of selecting memory operand where applicable ? Need to look about this question over internet.

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Dec 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possible way of optimization for 8x8 (and some other blocks sizes) - in the PseudoEPZ search simply load src block once to zmm register (const variable) and use many times (in all predictors SAD() checking). Or load halfs of block to 2 ymm registers. If compiler is smart and optimizing - it will reuse this register and not gather source block each time. It will save about 1/2 of gather instructions at each SAD() checking. Will try to make is for testing. It also possible for AVX2 I think. Only we need separate SAD() functions for PseudoEPZ () function. That can take source block as argument from register and not load it at all.

Currently as SAD() is non-inlined selectable function it looks complier can not re-use loaded once src block in many SAD() executions.

@DTL2020
Copy link
Contributor

@DTL2020 DTL2020 commented on d8bdff7 Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh - some more great idea: The zero and global predictors are completely equal (coherent) over the whole frame. So the zero and global predictors may be checked with multi-blocks SAD check AVX2/AVX512 functions that do not require each block rows gathering and can load full 256/512 bit register with multi-blocks rows with single load instruction and check SAD against memory/cache row as memory operand. And it processing may be pefrormed for the whole row of blocks storing result in temporal (or output) structure.

Unfortunately it is some possible issue with intrinsics - it is no way to force compiler to use memory operand and it is required checking executable disassembly - https://stackoverflow.com/questions/70261138/force-compiler-to-use-memory-operand-from-intrinsics/70261394#70261394 . So different compilers may create more of less speed results. And it is good to check which is best.

Also as i see in that godbolt tool the MSVC compiler do not optimize lazy-programmer addressing like
https://github.com/DTL2020/mvtools/blob/e1e9ff5c3f25a43d10d4b8e5ac0a517221c38ef1/Sources/PlaneOfBlocks_avx2.cpp#L1560
it looks is better to calculate pointer separately and only advance it with Pitch value at each row. The intel compiler looks like do it itself.

Please sign in to comment.