Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: With extract-annotated-pages command #98

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# CHANGELOG

## Version 0.5.0, not released yet

### New Features (ENH)
- New `extract-annotated-pages` to filter out only the user annotated pages ([PR #98](https://github.com/py-pdf/pdfly/pull/98))


## Version 0.4.0, 2024-12-08

### New Features (ENH)
Expand Down
49 changes: 25 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,23 +33,24 @@ $ pdfly --help

pdfly is a pure-python cli application for manipulating PDF files.

╭─ Options ───────────────────────────────────────────────────────────────────╮
│ --version │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────╮
│ 2-up Create a booklet-style PDF from a single input. │
│ cat Concatenate pages from PDF files into a single PDF file. │
│ compress Compress a PDF. │
| uncompress Uncompresses a PDF. │
│ extract-images Extract images from PDF without resampling or altering. │
│ extract-text Extract text from a PDF file. │
│ meta Show metadata of a PDF file │
│ pagemeta Give details about a single page. │
│ rm Remove pages from PDF files. │
│ update-offsets Updates offsets and lengths in a simple PDF file. │
│ x2pdf Convert one or more files to PDF. Each file is a page. │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --version │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────╮
│ 2-up Create a booklet-style PDF from a single input. │
│ cat Concatenate pages from PDF files into a single PDF file. │
│ compress Compress a PDF. │
| uncompress Uncompresses a PDF. │
| extract-annotated-pages Extract only the annotated pages from a PDF. |
│ extract-images Extract images from PDF without resampling or altering. │
│ extract-text Extract text from a PDF file. │
│ meta Show metadata of a PDF file │
│ pagemeta Give details about a single page. │
│ rm Remove pages from PDF files. │
│ update-offsets Updates offsets and lengths in a simple PDF file. │
│ x2pdf Convert one or more files to PDF. Each file is a page. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
```

You can see the help of every subcommand by typing `--help`:
Expand All @@ -63,13 +64,13 @@ $ pdfly 2-up --help
Pairs of two pages will be put on one page (left and right)
usage: python 2-up.py input_file output_file

╭─ Arguments ─────────────────────────────────────────────────────────────────╮
│ * pdf PATH [default: None] [required] │
│ * out PATH [default: None] [required] │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─ Arguments ──────────────────────────────────────────────────────────────────────────
│ * pdf PATH [default: None] [required]
│ * out PATH [default: None] [required]
╰──────────────────────────────────────────────────────────────────────────────────────
╭─ Options ────────────────────────────────────────────────────────────────────────────
│ --help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────
```

## Contributors ✨
Expand Down
25 changes: 25 additions & 0 deletions pdfly/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import pdfly.booklet
import pdfly.cat
import pdfly.compress
import pdfly.extract_annotated_pages
import pdfly.extract_images
import pdfly.metadata
import pdfly.pagemeta
Expand Down Expand Up @@ -321,3 +322,27 @@ def x2pdf(
exit_code = pdfly.x2pdf.main(x, output)
if exit_code:
raise typer.Exit(code=exit_code)


@entry_point.command(name="extract-annotated-pages", help=pdfly.extract_annotated_pages.__doc__) # type: ignore[misc]
def extract_annotated_pages(
input_pdf: Annotated[
Path,
typer.Argument(
dir_okay=False,
exists=True,
resolve_path=True,
help="Input PDF file.",
),
],
output_pdf: Annotated[
Optional[Path],
typer.Option(
"--output",
"-o",
writable=True,
help="Output PDF file. Defaults to 'input_pdf_annotated'.",
),
] = None,
) -> None:
pdfly.extract_annotated_pages.main(input_pdf, output_pdf)
35 changes: 35 additions & 0 deletions pdfly/extract_annotated_pages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
"""
Extract only the annotated pages from a PDF.

Q: Why does this help?
A: https://github.com/py-pdf/pdfly/issues/97
"""

from pathlib import Path
from pypdf import PdfReader, PdfWriter


# Check if an annotation is manipulable.
def is_manipulable(annot) -> bool:
return annot.get("/Subtype") not in ["/Link"]


# Main function.
def main(input_pdf: Path, output_pdf: Path | None) -> None:
if not output_pdf:
output_pdf = input_pdf.with_name(input_pdf.stem + "_annotated.pdf")
input = PdfReader(input_pdf)
output = PdfWriter()
output_pages = 0
# Copy only the pages with annotations
for page in input.pages:
annots = page.get("/Annots", [])
if not "/Annots" in page:
continue
if not any(is_manipulable(annot) for annot in list(annots)):
continue
output.add_page(page)
output_pages += 1
# Save the output PDF
output.write(output_pdf)
print(f"Extracted {output_pages} pages with annotations to {output_pdf}")
Binary file modified resources/input8.pdf
Binary file not shown.
14 changes: 14 additions & 0 deletions tests/test_extract_annotated_pages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from .conftest import RESOURCES_ROOT, chdir, run_cli


def test_extract_annotated_pages_input8(capsys, tmp_path):
with chdir(tmp_path):
run_cli(
[
"extract-annotated-pages",
str(RESOURCES_ROOT / "input8.pdf"),
]
)
captured = capsys.readouterr()
assert not captured.err
assert "Extracted 1 pages with annotations" in captured.out