Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposition: list text objects, their size and location found in a pdf file #92

Open
d-ph opened this issue Aug 12, 2024 · 2 comments

Comments

@d-ph
Copy link

d-ph commented Aug 12, 2024

Hello,

Similar to how cpdf can list images with the -image-resolution operation, would it be possible to add a cpdf operation that lists text object (most importantly: their size and location) found in a pdf?

The caveat being that "text that has been converted to vector outlines" would not be detected by that new cpdf operation, which is understandable.

Regards.

@johnwhitington
Copy link
Contributor

There are two tasks here:

  1. Parse PDF page content to locate objects on the page; and
  2. Do PDF text extraction.

The first will be coming soon. The second will happen, but only for well-behaved modern PDFs. I don't want to get into the full field of PDF text extraction - it's a complex thing.

@d-ph
Copy link
Author

d-ph commented Aug 23, 2024

Understood and fair. Thanks for the information and explanation 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants