Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtile decoding: memory use reduction and perf improvements #1010

Merged
merged 29 commits into from
Sep 5, 2017

Conversation

rouault
Copy link
Collaborator

@rouault rouault commented Sep 1, 2017

The gist of this PR is commit f9e9942

Only allocate tile component buffer of the needed dimension instead of being the full tile size.
* Use a sparse array mechanism to store code-blocks and intermediate stages of
  IDWT.
* IDWT, DC level shift and MCT stages are done just on that smaller array.
* Improve copy of tile component array to final image, by saving an intermediate
  buffer.
* For full-tile decoding at reduced resolution, only allocate the tile buffer to
  the reduced size, instead of the full-resolution size.

The effect is a reduction of the decoding time of "opj_decompress -i MAPA.jp2 -o out.tif -d 0,0,256,256" from 900ms to 190ms, and a reduction of RAM allocation from 2.27 GB to 265 MB (220 MB of them being the ingestion of the codestream).

master:

n5: 2270897801 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
 n1: 1610689344 0x4E7B847: opj_aligned_malloc (opj_malloc.c:61)
  n1: 1610689344 0x4E7523B: opj_alloc_tile_component_data (tcd.c:681)
   n1: 1610689344 0x4E75C5F: opj_tcd_init_decode_tile (tcd.c:822)
    n1: 1610689344 0x4E4E13A: opj_j2k_read_tile_header (j2k.c:8748)
     n1: 1610689344 0x4E4EB46: opj_j2k_decode_tiles (j2k.c:10573)
      n1: 1610689344 0x4E50726: opj_j2k_decode (j2k.c:7979)
       n1: 1610689344 0x4E555D2: opj_jp2_decode (jp2.c:1606)
        n0: 1610689344 0x403912: main (opj_decompress.c:1496)
 n1: 402672336 0x4E4EA3D: opj_j2k_decode_tiles (j2k.c:10591)
  n1: 402672336 0x4E50726: opj_j2k_decode (j2k.c:7979)
   n1: 402672336 0x4E555D2: opj_jp2_decode (jp2.c:1606)
    n0: 402672336 0x403912: main (opj_decompress.c:1496)
 n1: 219758393 0x4E4E628: opj_j2k_read_tile_header (j2k.c:4750)
  n1: 219758393 0x4E4EB46: opj_j2k_decode_tiles (j2k.c:10573)
   n1: 219758393 0x4E50726: opj_j2k_decode (j2k.c:7979)
    n1: 219758393 0x4E555D2: opj_jp2_decode (jp2.c:1606)
     n0: 219758393 0x403912: main (opj_decompress.c:1496)
 n1: 23893200 0x4E75CC5: opj_tcd_init_decode_tile (tcd.c:1235)
  n1: 23893200 0x4E4E13A: opj_j2k_read_tile_header (j2k.c:8748)
   n1: 23893200 0x4E4EB46: opj_j2k_decode_tiles (j2k.c:10573)
    n1: 23893200 0x4E50726: opj_j2k_decode (j2k.c:7979)
     n1: 23893200 0x4E555D2: opj_jp2_decode (jp2.c:1606)
      n0: 23893200 0x403912: main (opj_decompress.c:1496)
 n0: 13884528 in 51 places, all below massif's threshold (1.00%)

With PR :

n6: 265552513 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
 n1: 219758393 0x4E4F23D: opj_j2k_read_tile_header (j2k.c:4741)
  n1: 219758393 0x4E4F79E: opj_j2k_decode_tiles (j2k.c:10529)
   n1: 219758393 0x4E512C6: opj_j2k_decode (j2k.c:8003)
    n1: 219758393 0x4E56172: opj_jp2_decode (jp2.c:1606)
     n0: 219758393 0x403912: main (opj_decompress.c:1496)
 n1: 23893200 0x4E768DD: opj_tcd_init_decode_tile (tcd.c:1248)
  n1: 23893200 0x4E4ED6A: opj_j2k_read_tile_header (j2k.c:8772)
   n1: 23893200 0x4E4F79E: opj_j2k_decode_tiles (j2k.c:10529)
    n1: 23893200 0x4E512C6: opj_j2k_decode (j2k.c:8003)
     n1: 23893200 0x4E56172: opj_jp2_decode (jp2.c:1606)
      n0: 23893200 0x403912: main (opj_decompress.c:1496)
 n1: 7167960 0x4E76AFC: opj_tcd_init_decode_tile (tcd.c:1064)
  n1: 7167960 0x4E4ED6A: opj_j2k_read_tile_header (j2k.c:8772)
   n1: 7167960 0x4E4F79E: opj_j2k_decode_tiles (j2k.c:10529)
    n1: 7167960 0x4E512C6: opj_j2k_decode (j2k.c:8003)
     n1: 7167960 0x4E56172: opj_jp2_decode (jp2.c:1606)
      n0: 7167960 0x403912: main (opj_decompress.c:1496)
 n2: 6419232 0x4E7C2AD: opj_tgt_create (tgt.c:89)
  n1: 3209616 0x4E76A78: opj_tcd_init_decode_tile (tcd.c:1096)
   n1: 3209616 0x4E4ED6A: opj_j2k_read_tile_header (j2k.c:8772)
    n1: 3209616 0x4E4F79E: opj_j2k_decode_tiles (j2k.c:10529)
     n1: 3209616 0x4E512C6: opj_j2k_decode (j2k.c:8003)
      n1: 3209616 0x4E56172: opj_jp2_decode (jp2.c:1606)
       n0: 3209616 0x403912: main (opj_decompress.c:1496)
  n1: 3209616 0x4E76A93: opj_tcd_init_decode_tile (tcd.c:1104)
   n1: 3209616 0x4E4ED6A: opj_j2k_read_tile_header (j2k.c:8772)
    n1: 3209616 0x4E4F79E: opj_j2k_decode_tiles (j2k.c:10529)
     n1: 3209616 0x4E512C6: opj_j2k_decode (j2k.c:8003)
      n1: 3209616 0x4E56172: opj_jp2_decode (jp2.c:1606)
       n0: 3209616 0x403912: main (opj_decompress.c:1496)
 n0: 5097712 in 59 places, all below massif's threshold (1.00%)
 n1: 3216016 0x4E7CA37: opj_aligned_malloc (opj_malloc.c:61)
  n0: 3216016 in 4 places, all below massif's threshold (1.00%)

Similarly with a 9x7 compressed image, "opj_decompress -i MAPA_97.jp2 -o out.tif -d 0,0,256,256" from 1500ms to 180ms

Another significant commit of this PR is 0ae3cba

Allow several repeated calls to opj_set_decode_area() and opj_decode() for single-tiled images

* Only works for single-tiled images --> will error out cleanly, as currently
  in other cases
* Save re-reading the codestream for the tile, and re-use code-blocks of the
  previous decoding pass.
* Future improvements might involve improving opj_decompress, and the image writing logic,
  to use this strategy.

The test_decode_area utility can now decode images of more than 4giga pixels by proceeding by strips. e.g the following decodes the first 3072 lines of a 66000x66000 image by chunks of (at most) 1200 lines at a time

$ time bin/test_decode_area ../66000x66000_lossless.j2k  -strip_height 1200 0 0 66000 3072
Decoding 0...1200
Decoding 1200...2400
Decoding 2400...3072

real	0m3.488s

The memory consumption indicated by valgrind --tool=massif is 2.2 GB (1 GB if using strips of 256 lines, 770 MB for strips of 64 lines), which seems still a bit high, so probably still room for improvements in that area. Whereas opj_decompress -d 0,0,66000,3072 requires 3.5 GB

…dimension

Instead of being the full tile size.

* Use a sparse array mechanism to store code-blocks and intermediate stages of
  IDWT.
* IDWT, DC level shift and MCT stages are done just on that smaller array.
* Improve copy of tile component array to final image, by saving an intermediate
  buffer.
* For full-tile decoding at reduced resolution, only allocate the tile buffer to
  the reduced size, instead of the full-resolution size.
…peration by properly initializing working buffer
…lion pixels

However the intermediate buffer for decoding must still be smaller than 4
billion pixels, so this is useful for decoding at a lower resolution level,
or subtile decoding.
Untested though, since that means a tile buffer of at least 16 GB. So
there might be places where uint32 overflow on multiplication still occur...
…zation to reading at reduced resolution as well
…) for single-tiled images

* Only works for single-tiled images --> will error out cleanly, as currently
  in other cases
* Save re-reading the codestream for the tile, and re-use code-blocks of the
  previous decoding pass.
* Future improvements might involve improving opj_decompress, and the image writing logic,
  to use this strategy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants