-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T1 & DWT multithreading decoding optimizations #786
T1 & DWT multithreading decoding optimizations #786
Conversation
Allow these hot functions to be inlined. This boosts decode performance by ~10%.
We can avoid using a loop-up table with some shift arithmetics.
Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that does the job of opj_t1_dec_clnpass_step_only() assuming the conditions are met. And use it in opj_t1_dec_clnpass(). The compiler generates more efficient code.
This is essentially used to shift inside the lut_ctxno_zc, which we can precompute at the beginning of opj_t1_decode_cblk() / opj_t1_encode_cblk()
Addition flag array such that colflags[1+0] is for state of col=0,row=0..3, colflags[1+1] for col=1, row=0..3, colflags[1+flags_stride] for col=0,row=4..7, ... This array avoids too much cache trashing when processing by 4 vertical samples as done in the various decoding steps.
… (of the non VSC case)
…qc_vsc() with loop unrolling
…ead pool to tcd level By default, only the main thread is used. If opj_codec_set_threads() is not used, but the OPJ_NUM_THREADS environment variable is set, its value will be used to initialize the number of threads. The value can be either an integer number, or "ALL_CPUS". If OPJ_NUM_THREADS is set and this function is called, this function will override the behaviour of the environment variable.
…et the time spent, and not to the total CPU time
{ | ||
opj_dwd_decode_h_job_t* job; | ||
|
||
job = (opj_dwd_decode_h_job_t*) opj_malloc(sizeof(opj_dwd_decode_h_job_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocation shall be checked for failure.
Lots of good things, unfortunately, I don't have much time to review this completely. |
opj_thread_pool_wait_completion(tp, 0); | ||
opj_free(job); | ||
opj_aligned_free(h.mem); | ||
return OPJ_FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the jobs are submitted outside of this loop (i.e. in another one following), then it would be possible to fallback to single-thread in case of allocation error (& changing single-thread condition below the loop). It's very likely that if hitting an out of memory condition here, one will be raised later so it's quite arguable wether to do this or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I now just got what you meant. Well, in modern OS, if malloc failures for such small structures happen you are in big trouble (swap trashing, etc...), so a clean error exit is probably good enough than a smarter fallback strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, it was quite arguable. A clean error is probably good enough.
thread->thread_fn = thread_fn; | ||
thread->user_data = user_data; | ||
|
||
thread->hThread = CreateThread( NULL, 0, opj_thread_callback_adapter, thread, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When building with MSVC, _beginthreadex
shall be used instead of CreateThread. c.f. https://msdn.microsoft.com/en-us/library/windows/desktop/ms682453(v=vs.85).aspx remarks section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thread.c is a C port of the cpl_multiproc.cpp code used in GDAL. We have used CreateThread() for years without problem. That said we could switch to _beginthreadex() if you think it is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be easy to check for _beginthreadex
availability with cmake and given the remark from MSDN, I think it should be used so that the library returns properly rather than crashing the whole process. I don't know if the threads are calling any function from the CRT as of now (probably not) but future evolutions might and we'll have forgotten this.
I think all review comments have now been addressed |
…ier1_optimizations_multithreading_2 Conflicts: src/lib/openjp2/t1.c
@detonin, is it possible to release v2.1.2 before merging this in something that would become v2.2.0 ? |
@mayeut I was about to say something like this ! However the root issue is still to have some kind of release branch where all CVE are fixed, which would simplify my life as a Debian package maintainer. And at the same time have git/master be the -next branch... |
@malaterre, what could be done is to merge current master branch in openjpeg-2.1 branch (openjpeg-2.1...master) while waiting for a release then merge this PR in master. @detonin, is that ok ? |
@mayeut Yes, this is the purpose of having created a specific branch for each minor version. Releases are actually never tagged in master but rather in their specific X.y branch. I was wondering if we should keep this way of doing given that in any case we do not have enough resources to maintain several versions of OpenJPEG and backport systematically bugfixes from Master to release branches. Unless @malaterre you could take care of this backport for potential security fixes after #786 merging ? |
@mayeut great suggestion ! I'll trim the thirdparty changes out of the merge and update this bug once I am done. People will go nuts if I incorporate such large change in PNG/LCMS in a minor release. |
@mayeut I am now happy with openjpeg-2.1 branch. Please go ahead and merge anything you want in git/master. Thanks everyone ! |
@malaterre thanks Mathieu for the merge, we'll proceed with the PR then |
@rouault Many thanks Even for these great optimizations that are now merged in master and will be released soon (as v2.2.0). I guess some of the still open PR will need to be updated before being able to merge them. |
Great ! Thanks. |
Fix warnings introduced by uclouvain#786
Fix warnings introduced by #786
This PR builds upon #783 and adds multithreading decoding optimizations. (The multithreading decoding optimizations are independant from the improvements of #783 but as they touch both t1.c , it was more convenient to build upon)
The main components of this PR are :
Benchmarking:
This has been tested with the following files :
C1: issue135.j2k (fom openjpeg-data, code blocks 32x32)
C2: Bretagne2.j2k (fom openjpeg-data, code blocks 32x32)
C3: 20160307_125117_0c74.jp2 (non public test file, 3 bands, 12 bits, 6600x2200 for band 1, 3300x2200 for bands 2 and 3, code blocks 64x64)
C4: issue135_vsc.jp2 ( issue135.j2k recoded by opj_compress -M 8, code blocks 64x64)
C5: issue135_raw.jp2 ( issue135.j2k recoded by opj_compress -M 1, code blocks 64x64)
C6: S2A_OPER_MSI_L1C_TL_MTI__20150819T171650_A000763_T30SWE_B05.jp2 (Sentinel 2 tile, 5490x5490, 1 band, 12 bits, code blocks 64x64)
Builds done with -DCMAKE_BUILD_TYPE=Release. Times measured are the smallest time of 2 consecutive runs reported by "OPJ_NUM_THREADS=4 opj_decompress -i $(INPUT_FILE) -o /tmp/out.ppm" in the "decode time: XXX ms" line
Machine & OS spec: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz, Linux 64 bit
Before = state of PR #783
After = state of this PR
This work has been funded by Planet Labs