Skip to content

Commit

Permalink
Add strict=True/False to .crop/within_bbox(...)
Browse files Browse the repository at this point in the history
See #421
  • Loading branch information
jsvine committed Jul 20, 2022
1 parent 28330da commit 71ad60f
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 8 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file. The format

## [0.7.4] - [unreleased]

### Added

- Add `strict=True/False` parameter to `Page.crop(...)` and `Page.within_bbox(...)`; default is `True`, while `False` bypasses the `test_proposed_bbox(...)` check. ([#421](https://github.com/jsvine/pdfplumber/issues/421))

### Fixed

- Fix `PageImage` conversions for PDFs with `cmyk` colorspaces; convert them to `rgb` earlier in the process.
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,8 @@ The `pdfplumber.Page` class is at the core of `pdfplumber`. Most things you'll d

| Method | Description |
|--------|-------------|
|`.crop(bounding_box, relative=False)`| Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values `(x0, top, x1, bottom)`. Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If `relative=True`, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See [Issue #245](https://github.com/jsvine/pdfplumber/issues/245) for a visual example and explanation.)|
|`.within_bbox(bounding_box, relative=False)`| Similar to `.crop`, but only retains objects that fall *entirely* within the bounding box.|
|`.crop(bounding_box, relative=False, strict=True)`| Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values `(x0, top, x1, bottom)`. Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If `relative=True`, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See [Issue #245](https://github.com/jsvine/pdfplumber/issues/245) for a visual example and explanation.) When `strict=True` (the default), the crop's bounding box must fall entirely within the page's bounding box.|
|`.within_bbox(bounding_box, relative=False, strict=True)`| Similar to `.crop`, but only retains objects that fall *entirely* within the bounding box.|
|`.filter(test_function)`| Returns a version of the page with only the `.objects` for which `test_function(obj)` returns `True`.|
|`.dedupe_chars(tolerance=1)`| Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within `tolerance` x/y) as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)|
|`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.</p></li></ul>|
Expand Down
21 changes: 15 additions & 6 deletions pdfplumber/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,14 +326,20 @@ def extract_text(self, **kwargs: Any) -> str:
def extract_words(self, **kwargs: Any) -> T_obj_list:
return utils.extract_words(self.chars, **kwargs)

def crop(self, bbox: T_bbox, relative: bool = False) -> "CroppedPage":
return CroppedPage(self, bbox, relative=relative)

def within_bbox(self, bbox: T_bbox, relative: bool = False) -> "CroppedPage":
def crop(
self, bbox: T_bbox, relative: bool = False, strict: bool = True
) -> "CroppedPage":
return CroppedPage(self, bbox, relative=relative, strict=strict)

def within_bbox(
self, bbox: T_bbox, relative: bool = False, strict: bool = True
) -> "CroppedPage":
"""
Same as .crop, except only includes objects fully within the bbox
"""
return CroppedPage(self, bbox, relative=relative, crop_fn=utils.within_bbox)
return CroppedPage(
self, bbox, relative=relative, strict=strict, crop_fn=utils.within_bbox
)

def filter(self, test_function: Callable[[T_obj], bool]) -> "FilteredPage":
return FilteredPage(self, test_function)
Expand Down Expand Up @@ -422,6 +428,7 @@ def __init__(
bbox: T_bbox,
crop_fn: Callable[[T_obj_list, T_bbox], T_obj_list] = utils.crop_to_bbox,
relative: bool = False,
strict: bool = True,
):
if relative:
o_x0, o_top, _, _ = parent_page.bbox
Expand All @@ -430,7 +437,9 @@ def __init__(
else:
self.bbox = bbox

test_proposed_bbox(self.bbox, parent_page.bbox)
if strict:
test_proposed_bbox(self.bbox, parent_page.bbox)

self.crop_fn = crop_fn
super().__init__(parent_page)

Expand Down
6 changes: 6 additions & 0 deletions tests/test_basics.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,12 @@ def test_invalid_crops(self):
with pytest.raises(ValueError):
bottom.crop((0.5 * float(bottom.width), 0, bottom.width, bottom.height))

# via issue #421, testing strict=True/False
with pytest.raises(ValueError):
page.crop((0, 0, page.width + 10, page.height + 10))

page.crop((0, 0, page.width + 10, page.height + 10), strict=False)

def test_rotation(self):
assert self.pdf.pages[0].width == 1008
assert self.pdf.pages[0].height == 612
Expand Down

0 comments on commit 71ad60f

Please sign in to comment.