-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for PDF 1.3 logical structure #963
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #963 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 18 19 +1
Lines 1613 1897 +284
==========================================
+ Hits 1613 1897 +284
|
3fd0834
to
e02857c
Compare
This should be complete now, I'll let you review it at your leisure! It works for me (tm) |
c1bacc5
to
5a18a82
Compare
5264d2a
to
3c83366
Compare
08da3e0
to
14f9a67
Compare
Hi! If you get a chance could you review this soon? The test suite is now pretty extensive since I learned how to create "synthetic" PDFs with a text editor, and I've removed all but one of the I think it is really a good implementation of PDF logical structure though obviously there will be weird PDFs out there that do undefined behaviour! |
Thanks for this, @dhdaines! My apologies for not getting to it sooner; it took me a little while to wrap my head around it. Now merged. One quick follow-up: Want to note the method in the README.md, summarized however you best see fit (or just linking to your |
Ah! You're right, it ought to be in README.md, I thought that I had put it there. I can submit another PR for this. |
Thanks for merging as well! No problem about the delay, it is a large and complex feature. There is one quirk to the implementation that might require a follow-on: structure elements are allowed to span multiple pages, which is complicated to handle properly because PDF is otherwise extremely page-oriented (marked content sections notably can't do this). This means that objects that are in the structure tree might not appear to be in some situations. I will file this as an issue once I find a good test case for it. |
Ah, interesting. I think I understand in theory, but not quite sure in practice — so looking forward to that test case. Thanks! |
As promised, here is the other PR supporting the structure tree using
pdfminer.six
- so no overhead and no typing weirdness. In the end the implementation israther nice and simple.somewhat complex once we take into account the multiplicity of optional features in the structure tree specification.There is one caveat, which is mentioned in the docstring: whereas other PDF engines will include empty structure elements in the structure tree, this implementation does not, for kind of the same reason that #961 doesn't do anything for marked structure points. Since
pdfplumber
is based around extracting objects from the PDF, it isn't very useful to have structure that can't be associated to any objects, at least in my opinion.Also, in the case where there are unparsed pages in a PDF, it isn't quite clear what to do about structure elements with no explicit page ID, unless we assume that elements with no marked content are always excluded.
But, if you like, we can (optionally?) add these structure elements, it isn't too hard to do.