Support for PDF 1.3 logical structure #963

dhdaines · 2023-08-10T04:32:10Z

As promised, here is the other PR supporting the structure tree using pdfminer.six - so no overhead and no typing weirdness. In the end the implementation is ~~rather nice and simple.~~ somewhat complex once we take into account the multiplicity of optional features in the structure tree specification.

There is one caveat, which is mentioned in the docstring: whereas other PDF engines will include empty structure elements in the structure tree, this implementation does not, for kind of the same reason that #961 doesn't do anything for marked structure points. Since pdfplumber is based around extracting objects from the PDF, it isn't very useful to have structure that can't be associated to any objects, at least in my opinion.

Also, in the case where there are unparsed pages in a PDF, it isn't quite clear what to do about structure elements with no explicit page ID, unless we assume that elements with no marked content are always excluded.

But, if you like, we can (optionally?) add these structure elements, it isn't too hard to do.

codecov · 2023-08-10T04:35:39Z

Codecov Report

Merging #963 (036044d) into develop (336f83f) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           develop      #963    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           18        19     +1     
  Lines         1613      1897   +284     
==========================================
+ Hits          1613      1897   +284

Files Changed	Coverage Δ
pdfplumber/cli.py	`100.00% <100.00%> (ø)`
pdfplumber/page.py	`100.00% <100.00%> (ø)`
pdfplumber/pdf.py	`100.00% <100.00%> (ø)`
pdfplumber/structure.py	`100.00% <100.00%> (ø)`

dhdaines · 2023-08-10T14:59:29Z

This should be complete now, I'll let you review it at your leisure! It works for me (tm)

dhdaines · 2023-09-06T03:07:20Z

Hi! If you get a chance could you review this soon? The test suite is now pretty extensive since I learned how to create "synthetic" PDFs with a text editor, and I've removed all but one of the pragma: nocover comments (the remaining one is a "shouldn't happen" case).

I think it is really a good implementation of PDF logical structure though obviously there will be weird PDFs out there that do undefined behaviour!

jsvine · 2023-11-09T20:10:15Z

Thanks for this, @dhdaines! My apologies for not getting to it sooner; it took me a little while to wrap my head around it. Now merged. One quick follow-up: Want to note the method in the README.md, summarized however you best see fit (or just linking to your docs/structure.md file?

dhdaines · 2023-11-09T20:27:30Z

Thanks for this, @dhdaines! My apologies for not getting to it sooner; it took me a little while to wrap my head around it. Now merged. One quick follow-up: Want to note the method in the README.md, summarized however you best see fit (or just linking to your docs/structure.md file?

Ah! You're right, it ought to be in README.md, I thought that I had put it there. I can submit another PR for this.

dhdaines · 2023-11-09T20:32:31Z

Thanks for merging as well! No problem about the delay, it is a large and complex feature. There is one quirk to the implementation that might require a follow-on: structure elements are allowed to span multiple pages, which is complicated to handle properly because PDF is otherwise extremely page-oriented (marked content sections notably can't do this). This means that objects that are in the structure tree might not appear to be in some situations. I will file this as an issue once I find a good test case for it.

jsvine · 2023-11-10T22:20:23Z

Ah, interesting. I think I understand in theory, but not quite sure in practice — so looking forward to that test case. Thanks!

dhdaines changed the base branch from stable to develop August 10, 2023 04:32

dhdaines force-pushed the structure_tree branch from 3fd0834 to e02857c Compare August 10, 2023 04:48

dhdaines force-pushed the structure_tree branch from c1bacc5 to 5a18a82 Compare August 10, 2023 17:32

dhdaines mentioned this pull request Aug 10, 2023

Add --structure-text flag to CLI (like pdfinfo -struct-text but better) #967

Closed

feat: extract structure tree from pages or documents

3c83366

dhdaines force-pushed the structure_tree branch from 5264d2a to 3c83366 Compare August 19, 2023 15:51

dhdaines added 3 commits August 19, 2023 10:00

feat: add --structure-text, like pdfinfo -struct-text (but better)

183d5a8

test: trivial synthetic pdf for completing structure coverage

8d485c3

fix: complete coverage and fix handling of OBJR/MCR

14f9a67

dhdaines force-pushed the structure_tree branch from 08da3e0 to 14f9a67 Compare September 6, 2023 03:03

fix: handle more MCR/OBJR madness

036044d

dhdaines mentioned this pull request Oct 13, 2023

Future Road Map pdfminer/pdfminer.six#154

Open

jsvine merged commit 35ed9e0 into jsvine:develop Nov 9, 2023

jsvine mentioned this pull request Nov 9, 2023

Accessibility tagging #909

Open

dhdaines deleted the structure_tree branch February 5, 2024 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for PDF 1.3 logical structure #963

Support for PDF 1.3 logical structure #963

dhdaines commented Aug 10, 2023 •

edited

Loading

codecov bot commented Aug 10, 2023 •

edited

Loading

dhdaines commented Aug 10, 2023

dhdaines commented Sep 6, 2023

jsvine commented Nov 9, 2023

dhdaines commented Nov 9, 2023

dhdaines commented Nov 9, 2023

jsvine commented Nov 10, 2023

Support for PDF 1.3 logical structure #963

Support for PDF 1.3 logical structure #963

Conversation

dhdaines commented Aug 10, 2023 • edited Loading

codecov bot commented Aug 10, 2023 • edited Loading

Codecov Report

dhdaines commented Aug 10, 2023

dhdaines commented Sep 6, 2023

jsvine commented Nov 9, 2023

dhdaines commented Nov 9, 2023

dhdaines commented Nov 9, 2023

jsvine commented Nov 10, 2023

dhdaines commented Aug 10, 2023 •

edited

Loading

codecov bot commented Aug 10, 2023 •

edited

Loading