Accessing Watermark XObject #1249
Replies: 2 comments 1 reply
-
Interesting example, thanks for sharing. It seems that this XObject draws the text on the page: ![]() ... which makes sense, and explains why it is gettable via print(page.extract_text()
Or, if you prefer:
There does not, unfortunately, appear to be a way to distinguish that this text was created via an XObject rather than standard content. |
Beta Was this translation helpful? Give feedback.
-
I tried using Xobjects with resolve and pdfminer, but that was an endless rabbit hole. Not sure if it's the most efficient but I used the def check_draft(pdf: pdfplumber.pdf.PDF):
def figure_check(layout: LTPage):
for obj in layout._objs:
if isinstance(obj, LTFigure):
str_check = ''
for inner_obj in obj._objs:
if isinstance(inner_obj, LTChar):
str_check += inner_obj.get_text()
if str_check.lower() == "draft" or str_check.lower() == "estimate":
return True
print("Draft not found")
return False
def text_check(text: str) -> bool:
text = string_processor(text, keep_spaces=False).replace("\n", "")
if re.search(r'draft', text, re.IGNORECASE) or re.search(r'estimate', text, re.IGNORECASE):
print("Draft found")
return True
return False
def annot_check(page: pdfplumber.page.Page):
annots = page.annots
for annot in annots:
if isinstance(annot.get("contents"), bytes):
try:
title_bytes = annot['contents'].decode('utf-8')
except UnicodeDecodeError:
try:
title_bytes = annot['contents'].decode('utf-16')
except UnicodeDecodeError:
title_bytes = None
elif isinstance(annot['data'], dict):
annot_data = annot["data"]
if 'Watermark' in str(annot_data.get('Subtype', '')):
if isinstance(annot_data.get('V'), bytes):
try:
title_bytes = annot_data['V'].decode('utf-8')
except UnicodeDecodeError:
try:
title_bytes = annot_data['V'].decode('utf-16')
except UnicodeDecodeError:
title_bytes = None
if title_bytes is not None:
if re.search(r'draft', title_bytes, re.IGNORECASE) or re.search(r'estimate', title_bytes, re.IGNORECASE):
print("Draft found")
return True
return False
for page in pdf.pages:
if text_check(page.extract_text(use_text_flow=True)):
return True
if text_check(page.extract_text()):
return True
layout = page.layout
if figure_check(layout):
return True
if annot_check(page):
return True
return False edit: I chopped off half of the method, I did this a while ago |
Beta Was this translation helpful? Give feedback.
-
Thank you for your amazing work with pdfplumber! It has made it a breeze to work with PDFs. I am having trouble accessing the watermark of the attached PDF. It is technically a Form XObject and I do not see it in any of the extracted text/lines/words functions. I've tried to access through the
doc.catalog
andpage.getobj()
with resolve but no success.test_file.pdf
Here is some of the code I tried but I used the debug console and debug tools to explore:
I am able to access it through pdfminer's layout through the LTFigure object:
I can import the necessary pdfminer modules but I was wondering if there is any way to do this natively with pdfplumber?
Beta Was this translation helpful? Give feedback.
All reactions