Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 'called readLinearizationData for file that is not linearized' #680

Closed
pianoslum opened this issue Nov 15, 2020 · 3 comments
Closed

Comments

@pianoslum
Copy link

Describe the bug
When ocr'ing the attached file, I get an error at the end. The resulting pdf looks fine, however.

To Reproduce

ocrmypdf 6.jpg 6a.pdf --image-dpi 5 --output-type=pdf

Run with verbosity or higher -v1 to see more detailed logging. This information may be helpful.

ocrmypdf 11.3.2
Running: ['tesseract', '--list-langs']
No language specified; assuming --language eng
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['tesseract', '-l', 'eng', '--print-parameters', 'pdf']
Running: ['gs', '--version']
Found gs 9.53.3
pikepdf mmap disabled
os.symlink(6.jpg, /tmp/com.github.ocrmypdf.an15j778/origin)
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
imgformat = JPEG
input dpi = 96 x 96
rotation = 0°
input colorspace = RGB
width x height = 1240px x 1754px
read_images() embeds a JPEG
Successfully converted to PDF, processing...
pikepdf mmap disabled                                                                                                                                                                           
Scanning contents: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 199.89page/s]
Using Tesseract OpenMP thread limit 3
pikepdf mmap disabled                                                                                                                                                                           
    1 Rasterize with png16m, rotation 0                                                                                                                                                         
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r5.000000x5.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.an15j778/origin.pdf']
    1 Rotating output by 0                                                                                                                                                                      
    1 resolution (50, 50)                                                                                                                                                                       
    1 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', PosixPath('/tmp/com.github.ocrmypdf.an15j778/000001_ocr.png'), '/tmp/com.github.ocrmypdf.an15j778/000001_ocr_tess', 'pdf', 'txt']
    1 [tesseract] Warning: Invalid resolution 50 dpi. Using 70 instead.                                                                                                                         
    1 [tesseract] Estimating resolution as 154                                                                                                                                                  
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                                          
    1 Grafting                                                                                                                                                                                  
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                                      
OCR: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:05<00:00,  5.15s/page]
Postprocessing...
Treating 13 as an optimization candidate
XrefExt(xref=13, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Treating 13 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/com.github.ocrmypdf.an15j778/optimize.opt.pdf, /tmp/com.github.ocrmypdf.an15j778/optimize.pdf)
/tmp/com.github.ocrmypdf.an15j778/optimize.pdf -> 6a.pdf
An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 397, in run_pipeline
    if not check_pdf(options.output_file):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/helpers.py", line 191, in check_pdf
    pdf.check_linearization(sio)
pikepdf._qpdf.ForeignObjectError: called readLinearizationData for file that is not linearized

Example file

6

Expected behavior
No error or clearer message

System
ArchLinux Linux 5.9.8-zen1-1-zen #1 ZEN SMP PREEMPT Tue, 10 Nov 2020 22:44:06 +0000 x86_64 GNU/Linux
ocrmypdf 11.3.2

@irgendwr
Copy link

I can also reproduce this.

ocrmypdf linear.pdf ocr.pdf

results in:

Output file is a PDF/A-2B (as expected)
An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 397, in run_pipeline
    if not check_pdf(options.output_file):
  File "/usr/lib/python3.8/site-packages/ocrmypdf/helpers.py", line 191, in check_pdf
    pdf.check_linearization(sio)
pikepdf._qpdf.ForeignObjectError: called readLinearizationData for file that is not linearized

And indeed, the resulting file is not linearized (but valid nonetheless!):

qpdf --check-linearization ocr.pdf
called readLinearizationData for file that is not linearized

Versions:

ocrmypdf --version
11.3.3

tesseract --version
tesseract 4.1.1
 leptonica-1.80.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.5) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

qpdf --version
qpdf version 10.0.3

uname -srmo
Linux 5.9.8-1-MANJARO x86_64 GNU/Linux

I can also provide a pdf file to reproduce this if needed.

@irgendwr
Copy link

irgendwr commented Nov 18, 2020

@jbarlow83 I have submitted two pull-requests with different approaches to fixing this, feedback is welcome of course.

Either this error should be ignored (like RuntimeError already is) or linearization should not be checked, since it is not required in order for the PDF to be valid. Currently this verification does not seem to make sense since the errors are ignored anyway and the validity is checked by pdf.check().

Edit: closed one PR since I was wrong, linearization is being checked even though the errors are ignored. But then again this might already be covered by pdf.check().

@irgendwr
Copy link

Thanks 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants