Accessing Watermark XObject #1249

webdjoe · 2025-01-13T14:40:58Z

webdjoe
Jan 13, 2025

Thank you for your amazing work with pdfplumber! It has made it a breeze to work with PDFs. I am having trouble accessing the watermark of the attached PDF. It is technically a Form XObject and I do not see it in any of the extracted text/lines/words functions. I've tried to access through the doc.catalog and page.getobj() with resolve but no success.

test_file.pdf

Here is some of the code I tried but I used the debug console and debug tools to explore:

test_file = Path('test_dir/test_file.pdf')
pdf = pdfplumber.open(test_file)
doc = pdf.doc
xref = doc.xrefs[0]  
xref_obj = resolve(xrefs)

obj = doc.getobj(18)  # Found object ID in PDF editor
resolve(obj)  # PDF Stream
resolve(obj.data)  # Prints strange string that I'm assuming is part of the PDF spec

I am able to access it through pdfminer's layout through the LTFigure object:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure

# PDF path
pdf_path = "test_file.pdf"

for page_layout in extract_pages(pdf_path):
    for element in page_layout:
        if isinstance(element, LTFigure):
            print(f"Found LTFigure:")
            print(f"  Coordinates: {element.bbox}")
            print(f"  Type: {type(element)}")
            print(f"  Elements inside the figure:")
            for sub_element in element:
                print(f"    - {type(sub_element)}: {sub_element.get_text()}")

I can import the necessary pdfminer modules but I was wondering if there is any way to do this natively with pdfplumber?

jsvine · 2025-02-11T03:41:34Z

jsvine
Feb 11, 2025
Maintainer

Interesting example, thanks for sharing. It seems that this XObject draws the text on the page:

... which makes sense, and explains why it is gettable via pdfminer.six's text extraction. Similarly, you can get this text via pdfplumber:

print(page.extract_text()

K
R
A
M
R
E
T
A
W

Or, if you prefer:

print(page.extract_text(line_dir="btt", line_dir_render="ttb"))

W
A
T
E
R
M
A
R
K

There does not, unfortunately, appear to be a way to distinguish that this text was created via an XObject rather than standard content.

0 replies

webdjoe · 2025-02-11T04:28:55Z

webdjoe
Feb 11, 2025
Author

I tried using Xobjects with resolve and pdfminer, but that was an endless rabbit hole. Not sure if it's the most efficient but I used the layout._objs property:

def check_draft(pdf: pdfplumber.pdf.PDF):
    def figure_check(layout: LTPage):
        for obj in layout._objs:
            if isinstance(obj, LTFigure):
                str_check = ''
                for inner_obj in obj._objs:
                    if isinstance(inner_obj, LTChar):
                        str_check += inner_obj.get_text()
                if str_check.lower() == "draft" or str_check.lower() == "estimate":
                    return True
        print("Draft not found")
        return False

    def text_check(text: str) -> bool:
        text = string_processor(text, keep_spaces=False).replace("\n", "")
        if re.search(r'draft', text, re.IGNORECASE) or re.search(r'estimate', text, re.IGNORECASE):
            print("Draft found")
            return True
        return False

    def annot_check(page: pdfplumber.page.Page):
        annots = page.annots
        for annot in annots:
            if isinstance(annot.get("contents"), bytes):
                try:
                    title_bytes = annot['contents'].decode('utf-8')
                except UnicodeDecodeError:
                    try:
                        title_bytes = annot['contents'].decode('utf-16')
                    except UnicodeDecodeError:
                        title_bytes = None
            elif isinstance(annot['data'], dict):
                annot_data = annot["data"]
                if 'Watermark' in str(annot_data.get('Subtype', '')):
                    if isinstance(annot_data.get('V'), bytes):
                        try:
                            title_bytes = annot_data['V'].decode('utf-8')
                        except UnicodeDecodeError:
                            try:
                                title_bytes = annot_data['V'].decode('utf-16')
                            except UnicodeDecodeError:
                                title_bytes = None
                    if title_bytes is not None:
                        if re.search(r'draft', title_bytes, re.IGNORECASE) or re.search(r'estimate', title_bytes, re.IGNORECASE):
                            print("Draft found")
                            return True
        return False

    for page in pdf.pages:
        if text_check(page.extract_text(use_text_flow=True)):
            return True
        if text_check(page.extract_text()):
            return True
        layout = page.layout
        if figure_check(layout):
            return True
        if annot_check(page):
            return True
    return False

edit: I chopped off half of the method, I did this a while ago

1 reply

jsvine Feb 12, 2025
Maintainer

Thanks for sharing. FWIW, the LTChars within the LTFigure layout objects should be fundamentally the same objects as you get through pdfplumber's page.chars. The only distinction is that LTFigure groups those characters via pdfminer.six's own layout analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing Watermark XObject #1249

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Accessing Watermark XObject #1249

webdjoe Jan 13, 2025

Replies: 2 comments · 1 reply

jsvine Feb 11, 2025 Maintainer

webdjoe Feb 11, 2025 Author

jsvine Feb 12, 2025 Maintainer

webdjoe
Jan 13, 2025

Replies: 2 comments 1 reply

jsvine
Feb 11, 2025
Maintainer

webdjoe
Feb 11, 2025
Author

jsvine Feb 12, 2025
Maintainer