Error 'called readLinearizationData for file that is not linearized' #680

pianoslum · 2020-11-15T11:47:47Z

Describe the bug
When ocr'ing the attached file, I get an error at the end. The resulting pdf looks fine, however.

To Reproduce

ocrmypdf 6.jpg 6a.pdf --image-dpi 5 --output-type=pdf

Run with verbosity or higher -v1 to see more detailed logging. This information may be helpful.

ocrmypdf 11.3.2
Running: ['tesseract', '--list-langs']
No language specified; assuming --language eng
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['tesseract', '-l', 'eng', '--print-parameters', 'pdf']
Running: ['gs', '--version']
Found gs 9.53.3
pikepdf mmap disabled
os.symlink(6.jpg, /tmp/com.github.ocrmypdf.an15j778/origin)
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
imgformat = JPEG
input dpi = 96 x 96
rotation = 0°
input colorspace = RGB
width x height = 1240px x 1754px
read_images() embeds a JPEG
Successfully converted to PDF, processing...
pikepdf mmap disabled                                                                                                                                                                           
Scanning contents: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 199.89page/s]
Using Tesseract OpenMP thread limit 3
pikepdf mmap disabled                                                                                                                                                                           
    1 Rasterize with png16m, rotation 0                                                                                                                                                         
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r5.000000x5.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.an15j778/origin.pdf']
    1 Rotating output by 0                                                                                                                                                                      
    1 resolution (50, 50)                                                                                                                                                                       
    1 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', PosixPath('/tmp/com.github.ocrmypdf.an15j778/000001_ocr.png'), '/tmp/com.github.ocrmypdf.an15j778/000001_ocr_tess', 'pdf', 'txt']
    1 [tesseract] Warning: Invalid resolution 50 dpi. Using 70 instead.                                                                                                                         
    1 [tesseract] Estimating resolution as 154                                                                                                                                                  
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                                          
    1 Grafting                                                                                                                                                                                  
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                                      
OCR: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:05<00:00,  5.15s/page]
Postprocessing...
Treating 13 as an optimization candidate
XrefExt(xref=13, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Treating 13 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/com.github.ocrmypdf.an15j778/optimize.opt.pdf, /tmp/com.github.ocrmypdf.an15j778/optimize.pdf)
/tmp/com.github.ocrmypdf.an15j778/optimize.pdf -> 6a.pdf
An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 397, in run_pipeline
    if not check_pdf(options.output_file):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/helpers.py", line 191, in check_pdf
    pdf.check_linearization(sio)
pikepdf._qpdf.ForeignObjectError: called readLinearizationData for file that is not linearized

Example file

Expected behavior
No error or clearer message

System
ArchLinux Linux 5.9.8-zen1-1-zen #1 ZEN SMP PREEMPT Tue, 10 Nov 2020 22:44:06 +0000 x86_64 GNU/Linux
ocrmypdf 11.3.2

The text was updated successfully, but these errors were encountered:

irgendwr · 2020-11-18T17:52:41Z

I can also reproduce this.

ocrmypdf linear.pdf ocr.pdf

results in:

Output file is a PDF/A-2B (as expected)
An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 397, in run_pipeline
    if not check_pdf(options.output_file):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/helpers.py", line 191, in check_pdf
    pdf.check_linearization(sio)
pikepdf._qpdf.ForeignObjectError: called readLinearizationData for file that is not linearized

And indeed, the resulting file is not linearized (but valid nonetheless!):

qpdf --check-linearization ocr.pdf
called readLinearizationData for file that is not linearized

Versions:

ocrmypdf --version
11.3.3

tesseract --version
tesseract 4.1.1
 leptonica-1.80.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.5) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

qpdf --version
qpdf version 10.0.3

uname -srmo
Linux 5.9.8-1-MANJARO x86_64 GNU/Linux

I can also provide a pdf file to reproduce this if needed.

irgendwr · 2020-11-18T18:46:06Z

@jbarlow83 I have submitted two pull-requests with different approaches to fixing this, feedback is welcome of course.

Either this error should be ignored (like RuntimeError already is) or linearization should not be checked, since it is not required in order for the PDF to be valid. ~~Currently this verification does not seem to make sense since the errors are ignored anyway and the validity is checked by pdf.check().~~

Edit: closed one PR since I was wrong, linearization is being checked even though the errors are ignored. But then again this might already be covered by pdf.check().

irgendwr · 2020-11-18T20:18:07Z

Thanks 😃

This was referenced Nov 18, 2020

Fix: except pikepdf._qpdf.ForeignObjectError #681

Closed

Fix: do not check linearization #682

Closed

jbarlow83 closed this as completed in d71e50e Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error 'called readLinearizationData for file that is not linearized' #680

Error 'called readLinearizationData for file that is not linearized' #680

pianoslum commented Nov 15, 2020

irgendwr commented Nov 18, 2020

irgendwr commented Nov 18, 2020 •

edited

Loading

irgendwr commented Nov 18, 2020

Error 'called readLinearizationData for file that is not linearized' #680

Error 'called readLinearizationData for file that is not linearized' #680

Comments

pianoslum commented Nov 15, 2020

irgendwr commented Nov 18, 2020

irgendwr commented Nov 18, 2020 • edited Loading

irgendwr commented Nov 18, 2020

irgendwr commented Nov 18, 2020 •

edited

Loading