Table extraction - vertical text not handled correctly #3148

kyliemsauter · 2024-02-12T18:07:10Z

Description of the bug

Vertical text in table columns isn't handled correctly anymore, it used to work before the recent switch to use the "word extractor" methods.

How to reproduce the bug

Call extract() on any table that has vertical text
Example file: the two cells that have vertical text "Text" and "Numbers" are returned in reverse order as "txeT" and "srebmuN"
table_examples_vertical_text.pdf

PyMuPDF version

1.23.21

Operating system

Windows

Python version

3.11

JorjMcKie · 2024-02-15T15:08:22Z

The fix includes support for all rotations by multiples of 90°.

You may be aware that line breaks in a cell are preserved. This makes some sense as a last resort to identify additional table rows, and in cases with complex cell content ... but only for rotation 0 (horizontal) text. In other cases (rotations by 90°, 180°, 270°) I am replacing line breaks by spaces.

Forgot to mention, that the rotation of all of the cell content is determined by its first character / word. IAW, there cannot be different rotations of content in one single cell. But the following perfectly works:

A slightly modified version of your nice example

will be extracted as

['', 'Column 1', 'Column 2', 'Column 3']
['Text', 'Text A', 'Text B', 'Text C']
[None, 'Text D', 'Text E', 'Text F']
[None, 'Text G', 'Text H', 'Text I']
['First Set of Numbers', '1', '1', '1']
[None, '2', '2', '2']
[None, '3', '3', '3']
[None, '4', '4', '4']
[None, '5', '5', '5']
['Second Set of Numbers', '6', '6', '6']
[None, '7', '7', '7']
[None, '8', '8', '8']
[None, '9', '9', '9']
[None, '10', '10', '10']

kyliemsauter · 2024-02-16T19:34:42Z

great thank you!!!

julian-smith-artifex-com · 2024-02-19T22:16:39Z

Fixed in 1.23.24.

JorjMcKie self-assigned this Feb 15, 2024

JorjMcKie added bug fix developed release schedule to be determined labels Feb 15, 2024

julian-smith-artifex-com closed this as completed Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table extraction - vertical text not handled correctly #3148

Table extraction - vertical text not handled correctly #3148

kyliemsauter commented Feb 12, 2024

JorjMcKie commented Feb 15, 2024 •

edited

Loading

kyliemsauter commented Feb 16, 2024

julian-smith-artifex-com commented Feb 19, 2024

Table extraction - vertical text not handled correctly #3148

Table extraction - vertical text not handled correctly #3148

Comments

kyliemsauter commented Feb 12, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Feb 15, 2024 • edited Loading

kyliemsauter commented Feb 16, 2024

julian-smith-artifex-com commented Feb 19, 2024

JorjMcKie commented Feb 15, 2024 •

edited

Loading