Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table extraction - vertical text not handled correctly #3148

Closed
kyliemsauter opened this issue Feb 12, 2024 · 3 comments
Closed

Table extraction - vertical text not handled correctly #3148

kyliemsauter opened this issue Feb 12, 2024 · 3 comments
Assignees
Labels
bug fix developed release schedule to be determined

Comments

@kyliemsauter
Copy link

Description of the bug

Vertical text in table columns isn't handled correctly anymore, it used to work before the recent switch to use the "word extractor" methods.

How to reproduce the bug

Call extract() on any table that has vertical text
Example file: the two cells that have vertical text "Text" and "Numbers" are returned in reverse order as "txeT" and "srebmuN"
table_examples_vertical_text.pdf

PyMuPDF version

1.23.21

Operating system

Windows

Python version

3.11

@JorjMcKie JorjMcKie self-assigned this Feb 15, 2024
@JorjMcKie JorjMcKie added bug fix developed release schedule to be determined labels Feb 15, 2024
@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Feb 15, 2024

The fix includes support for all rotations by multiples of 90°.

You may be aware that line breaks in a cell are preserved. This makes some sense as a last resort to identify additional table rows, and in cases with complex cell content ... but only for rotation 0 (horizontal) text. In other cases (rotations by 90°, 180°, 270°) I am replacing line breaks by spaces.

Forgot to mention, that the rotation of all of the cell content is determined by its first character / word. IAW, there cannot be different rotations of content in one single cell. But the following perfectly works:
image

A slightly modified version of your nice example
image

will be extracted as

['', 'Column 1', 'Column 2', 'Column 3']
['Text', 'Text A', 'Text B', 'Text C']
[None, 'Text D', 'Text E', 'Text F']
[None, 'Text G', 'Text H', 'Text I']
['First Set of Numbers', '1', '1', '1']
[None, '2', '2', '2']
[None, '3', '3', '3']
[None, '4', '4', '4']
[None, '5', '5', '5']
['Second Set of Numbers', '6', '6', '6']
[None, '7', '7', '7']
[None, '8', '8', '8']
[None, '9', '9', '9']
[None, '10', '10', '10']

@kyliemsauter
Copy link
Author

great thank you!!!

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix developed release schedule to be determined
Projects
None yet
Development

No branches or pull requests

3 participants