-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accessibility tagging #909
Comments
Hi @NathanTech7713 — thanks for your interest in this library, and for this suggestion. For my own notes and for others who may be less familiar:
And some general questions: What should the output of this extraction look like? A nested tree of tags? Something else? @NathanTech7713: Do you have any examples of other PDF extraction libraries that have a feature like this, and which you think would provide a useful model? |
Hi! I was about to make this same feature request. I've done a bit of exploration here as I am working on extracting the structure from PDFs and, obviously, it makes sense to use explicit structure if it's there... well, sort of. Most of the libraries that support tagged PDF are closed-source, but some functionality to extract it exists in Poppler and pdf.js, and you can see the tags by running Basically there are a couple of moving parts, which you can find starting in section 10.5 of the PDF 1.7 spec (or maybe section 14, if you have the Adobe/ISO document?):
See https://github.com/dhdaines/alexi/blob/main/scripts/pdfstructure.py for a quick-and-dirty script (based on |
What I would find minimally useful (but I can't speak for the original author of this issue) would be:
|
Woops! Got to be honest, thought I replied and then didn't! @dhdaines sums it up quite well in what I am also hoping for. I think I mentioned quite a while ago about eventually wanting to put together an accessible PDf reader for screen reader (totally blind) users of windows, so and accessibility tagging would be a solid way of identifying structure. |
Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add. |
If it helps I can make a preliminary PR with something like what I mentioned above (extraction of marked content sections + structure tree parsing) |
@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into |
At first glance - extracting the structure tree is relatively easy and can be done on-demand as it's all in the document catalog - linking it to the MCIDs might have more of performance impact, at least, with |
Thanks! That sounds like a reasonable place to start. I suppose we could expose that similarly to how we do with |
The |
Actually this is quite easy. I should have a PR for you tonight or tomorrow, I hope. |
Ready for review, see PR above. I'll test it more on my PDFs of interest, but it is functional and somewhat documented, see |
Many thanks, @dhdaines, and a particular thanks for the documentation. It might take me a little while to review the PR, due to other workload and me being relatively new to the topic/feature, but on first glance, it seems like a helpful contribution. |
Thanks! There is at least one small add-on to consider - #961 doesn't give access to the tag attributes, only the tag name. These allow you to distinguish between different types of artifacts (header, footer, etc). I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section, as this could produce large outputs (it shouldn't be a huge problem for memory consumption since it's the same dictionary...) "Tagged PDF" is a fairly vaguely defined standard (or perhaps I just don't fully understand it yet) so there may be other things too. |
Thanks, @dhdaines. A couple of follow-up questions:
Could you share an example of what this would look like?
I agree with the general inclination here. Could we have it both ways and allow users to opt-in to this additional output? |
Hi there,
Was wondering if, when the dev is particularly bored, would you mind considering implementing extraction of accessibility tagging?
Thank youPlease describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.
The text was updated successfully, but these errors were encountered: