-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out invisible text rendered with Tr(3) #1230
Comments
Hi @svaaraniemi, and great question. It seems that doing so would require def do_Tr(self, render: PDFStackT) -> None:
"""Set the text rendering mode"""
self.textstate.render = cast(int, render) ... to |
Thanks for confirming what I was suspecting. Perhaps this way of placing hidden text in a PDF is not common enough and it never caught pdfminer.six team's attention. |
Thanks regardless, @svaaraniemi. Wondering if @dhdaines has any thoughts on this re. PLAYA? |
Hi! The rendering mode is parsed by This is why people say that implementation inheritance is considered harmful ;-) In the PLAYA branch it would be simple, just add |
I should mention while I'm here that there are some other common ways of hiding text that aren't supported by
|
(you can try the PR linked above as I have added |
Some PDFs have text rendered with the Invisible Text operator Tr(3). Note that the stroke_color of such text is usually the same as the color of normal text. Can such text be filtered out from the text extract?
The attached PDF is one page extracted from Texas regulation (public domain) which has examples of such text, e.g., the 6th line extracts text like so, using the pdfplumber page.extract_text_lines() method:
"text": "Sec. 545.001. AA DEFINITIONS. AA In this chapter:",
Here the two instances of 'AA' are invisible text.
Chapter_545-p1.pdf
The text was updated successfully, but these errors were encountered: