Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_tables on landscape page generates reversed text #2812

Closed
dsanchezseco-ibm opened this issue Nov 16, 2023 · 11 comments · Fixed by #2818
Closed

find_tables on landscape page generates reversed text #2812

dsanchezseco-ibm opened this issue Nov 16, 2023 · 11 comments · Fixed by #2818
Labels
bug fix developed release schedule to be determined

Comments

@dsanchezseco-ibm
Copy link

Describe the bug (mandatory)

When performing a find_tables over a table in a landscape page the extracted text, either with extract or with to_pandas generates the text and the lists to be reversed

To Reproduce (mandatory)

Execute this snippet over a table in a landscape page

def extract_tables(pdf_document, start, end):
    tables = []
    for page in range(start, end + 1):
        local_tables = pdf_document[page].find_tables()
        for table in local_tables:
            tables.append(table.extract())
    print(tables)

it will generate something like

[[['aicnerefeR\n)CRM(', '100', '200', '300', '400', '500', '600', '700', '800', '900'], ['abeurp\n,)sotseuper(\ned\nsopiuqe\nselbimusnoc\n,]sodireuqer\ny\nsatneimarreH[', 'acirtémomanid acirtémomanid\nevalL evalL\n]1DL[ ]2DL[', 'ó\notcatnoc\naselgnI\nnis evalL\nresál litátrop\n]IL[\nortemómreT ó\nrapomreT ajiF\nevalL\n]LT[ ]PT[ ]FL[', 'nóicarbiv\ned\nrodideM\n]VM[', 'esargne\n)K2K\nalotsiP\nasarg(\n]EP[', 'serotom\narap\nlitátrop\naicnetop\nortemóucavonaM\nortemónaM\ned\nrodideM\nsocisáfirt\n]50MM[ ]50VM[\n]WM[', 'aselgnI\n)rodaicnatsid )rotom\n)abmob )otneimalpoca\nevalL roiretxe )ocinácem )otelpmoc )otelpmoc otneimador\nroiretni\n)etsagsed\nocitsálp otneimador\n]IL[ó latigid\nnellA ortemórciM ortemórciM )roslupmi )rotcelfed olliuqsac\nollitraM ajiF erbilaC ereic ollina abmob )satnuj socat otneimalpoca( 3C/6136\nevalL evalL\n0222( 0452( 1.0024( 0051( ed 1.0642( 1103x2( ogeuj( ogeuj(\n]PM[ ]FL[ ]AL[ ]IM[ ]EM[ ]DC[ eje( x2(', 'aselgnI\nevalL\n)ocinácem\n]IL[\nlatigid\nó nellA\najiF erreic )satnuj\nerbilaC\nevalL evalL\n1.0024(\nogeuj(\n]FL[ ]AL[ ]DC[', 'aselgnI\n)rodaicnatsid\n)abmob\nevalL\nocitsálp otneimador\n]IL[\nlatigid\nolliuqsac\nó nellA\najiF )satnuj\nerbilaC ollitraM\nevalL evalL 1103\n1.0642(\nogeuj(\n]FL[ ]AL[ ]DC[ ]PM[ x\n2(', 'aselgnI\n)rotom\nevalL\notneimador\n]IL[\nlatigid\nó nellA\najiF\nerbilaC 3C/6136\nevalL evalL\n]FL[ ]AL[ ]DC[ x\n2('], ['aírogetaC lanosrep', 'ocináceM', 'ocináceM', 'ocináceM', 'ocináceM', '&\natsicirtcelE ocináceM', 'ocináceM', 'ocináceM', 'ocináceM', 'ocináceM'], ['odirrucsnart\nopmeiT )saroh(', '5,0', '80,0', '1', '1,0', '1', '4', '2', '3', '1'], ['daditnaC erbmoh', '1', '1', '1', '1', '2', '2', '2', '2', '1'], [')saroh(\nlatot\nopmeiT odamitse', '5,0', '80,0', '1', '1,0', '2', '8', '4', '6', '1'], ['dadicidoireP', 'lausneM', 'lartsemirT', 'lartsemirT', 'lartsemirT', 'launA', 'launaiB', 'ovitcerroC\n)aguf\nis(', ')arutor\novitcerroC\nyah\nis(', ')arutor\novitcerroC\nyah\nis('], ['le', 'odr', 'odr', 'odr', 'odr', '', '', '', '', '']]]

For problems when building or installing PyMuPDF, give the full output of the build/install command so that, for example, all pip/compiler/linker errors/warnings can be seen.

Expected behavior (optional)

expected for this case is

[[['009', '008', '007', '006', '005', '004', '003', '002', '001', '(MRC)\nReferencia'], ['(2\nx [CD] [LA] [LF]\nLlave Llave\n6316/C3 Calibre\nFija\nAllen ó\ndigital\n[LI]\nrodamiento\nLlave\nmotor)\nInglesa', '(2\nx [MP] [CD] [LA] [LF]\n(juego\n(2460.1\n3011 Llave Llave\nMartillo Calibre\njuntas) Fija\nAllen ó\ncasquillo\ndigital\n[LI]\nrodamiento plástico\nLlave\nbomba)\ndistanciador)\nInglesa', '[CD] [LA] [LF]\n(juego\n(4200.1\nLlave Llave\nCalibre\njuntas) cierre Fija\nAllen ó\ndigital\n[LI]\nmecánico)\nLlave\nInglesa', '(2x (eje [CD] [ME] [MI] [LA] [LF] [MP]\n(juego (juego (2x3011 (2460.1 de (1500 (4200.1 (2540 (2220\nLlave Llave\n6316/C3 (acoplamiento tacos juntas) bomba anillo ciere Calibre Fija Martillo\ncasquillo deflector) impulsor) Micrómetro Micrómetro Allen\ndigital ó[LI]\nrodamiento plástico\ndesgaste)\ninterior\nrodamiento completo) completo) mecánico) exterior Llave\nacoplamiento) bomba)\nmotor) distanciador)\nInglesa', '[MW]\n[MV05] [MM05]\ntrifásicos\nMedidor\nde\nManómetro\nManovacuómetro\npotencia\nportátil\npara\nmotores', '[PE]\n(grasa\nPistola\nK2K)\nengrase', '[MV]\nMedidor\nde\nvibración', '[LF] [TP] [TL]\nLlave\nFija Termopar\nó Termómetro\n[LI]\nportátil láser\nLlave sin\nInglesa\ncontacto\nó', '[LD2] [LD1]\nLlave Llave\ndinamométrica dinamométrica', '[Herramientas\ny\nrequeridos],\nconsumibles\nequipos\nde\n(repuestos),\nprueba'], ['Mecánico', 'Mecánico', 'Mecánico', 'Mecánico', 'Mecánico Electricista\n&', 'Mecánico', 'Mecánico', 'Mecánico', 'Mecánico', 'personal Categoría'], ['1', '3', '2', '4', '1', '0,1', '1', '0,08', '0,5', '(horas) Tiempo\ntranscurrido'], ['1', '2', '2', '2', '2', '1', '1', '1', '1', 'hombre Cantidad'], ['1', '6', '4', '8', '2', '0,1', '1', '0,08', '0,5', 'estimado Tiempo\ntotal\n(horas)'], ['(si\nhay\nCorrectivo\nrotura)', '(si\nhay\nCorrectivo\nrotura)', '(si\nfuga)\nCorrectivo', 'Bianual', 'Anual', 'Trimestral', 'Trimestral', 'Trimestral', 'Mensual', 'Periodicidad'], ['', '', '', '', '', 'rdo', 'rdo', 'rdo', 'rdo', 'el']]]

Your configuration (mandatory)

3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)]
 darwin

PyMuPDF 1.23.6: Python bindings for the MuPDF 1.23.5 library.
Version date: 2023-11-06 00:00:01.
Built for Python 3.11 on darwin (64-bit).

PyMuPDF installed with pip

Additional context (optional)

Add any other context about the problem here.

@JorjMcKie
Copy link
Collaborator

I would like to reproduce this. Can you please provide an example page?
Thanks.

@JorjMcKie JorjMcKie added bug fix developed release schedule to be determined labels Nov 17, 2023
JorjMcKie added a commit that referenced this issue Nov 18, 2023
Tables on pages with other than rotation 0 were not detected and extracted correctly.
This was due to incorrectly setting the clip parameter and to pfplumber's issues dealing with characters extracted by PyMuPDF.
JorjMcKie added a commit that referenced this issue Nov 19, 2023
We did not properly support tables on rotated pages for  number of causes.
This fix now correctly handles the clip area to look for tables and replaces cell text extraction with original PyMuPDF code.
@JorjMcKie JorjMcKie mentioned this issue Nov 19, 2023
JorjMcKie added a commit that referenced this issue Nov 19, 2023
Tables on pages with other than rotation 0 were not detected and extracted correctly.
This was due to incorrectly setting the clip parameter and to pfplumber's issues dealing with characters extracted by PyMuPDF.
JorjMcKie added a commit that referenced this issue Nov 19, 2023
We did not properly support tables on rotated pages for  number of causes.
This fix now correctly handles the clip area to look for tables and replaces cell text extraction with original PyMuPDF code.
@JorjMcKie JorjMcKie reopened this Nov 19, 2023
@JorjMcKie
Copy link
Collaborator

Wait for official publication before closing this.

@dsanchezseco-ibm
Copy link
Author

Cool, thank you! I'll keep an eye out. Sorry for not providing an example, this was the work account and I was on weekend mode

@JorjMcKie
Copy link
Collaborator

Cool, thank you! I'll keep an eye out. Sorry for not providing an example, this was the work account and I was on weekend mode

No problem. The necessary changes are confined to one file of the package, table.py.
Don't know about your update privileges on your work account, but there is the hot fix option to just replace that file in your installation folder and confirm the problem is indeed solved.

@dsanchezseco-ibm
Copy link
Author

I'll check it out and let you know

@dsanchezseco-ibm
Copy link
Author

dsanchezseco-ibm commented Nov 20, 2023

https://github.com/pymupdf/PyMuPDF/pull/2818/files#diff-93d282fd72c8edf2245fbab4a988f37011fea2c8db71a8b7441116e867ddeabaR1168

shouldn't it be self.cells?

installed the current main and it is complaining on that line about the cells var

@JorjMcKie
Copy link
Collaborator

Absolutely! Thank you!

@dsanchezseco-ibm
Copy link
Author

ok, the table extraction does work now correctly(trying the hot-fix) but there is another typo, you are assigning r instead of rect

@JorjMcKie
Copy link
Collaborator

incredible - it's Monday morning 😒
Thanks again!

@dsanchezseco-ibm
Copy link
Author

All good now! Thanks and enjoy the coffee!

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix developed release schedule to be determined
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants