Skip to content

Commit

Permalink
Normalize color representation (#917)
Browse files Browse the repository at this point in the history
This commit normalizes the type representation of `stroking_color` and
`non_stroking_color` values. Thanks to @dhdaines for pointing out this
inconsistency.

Previously, `pdfplumber` passed along `pdfminer.six`'s colors without
normalization. Due to quirks in `pdfminer.six`'s color handling, this
meant that those values could be floats, ints, lists, or tuples. This
commit normalizes all color values (when non-None) into n-tuples, where
(val,) represents grayscale colors, (val, val, val) represents RBG, and
(val, val, val, val) represents CMYK colors.

This should solve the consistency issue, although might cause breaking
changes to code that filters for non-tuple values — e.g., `[c for c in
page.chars if c == [1, 0 0]]`. Although breaking changes are unpleasant,
I think the tradeoff for longer-term consistency is worth it.
  • Loading branch information
jsvine committed Jul 4, 2023
1 parent 0de6da9 commit 57d51bb
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 8 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ Each object is represented as a simple Python `dict`, with the following propert
|`bottom`| Distance of bottom of the character from top of page.|
|`doctop`| Distance of top of character from top of document.|
|`matrix`| The "current transformation matrix" for this character. (See below for details.)|
|`stroking_color`|The color of the character's outline (i.e., stroke), expressed as a tuple or integer, depending on the “color space” used.|
|`stroking_color`|The color of the character's outline (i.e., stroke), expressed as a tuple, with length determined by the “color space” used (1 for grayscale, 3 for RBG, 4 for CMYK).|
|`non_stroking_color`|The character's interior color.|
|`object_type`| "char"|

Expand Down Expand Up @@ -186,7 +186,7 @@ my_char_rotation = my_char_ctm.skew_x
|`bottom`| Distance of bottom of the line from top of page.|
|`doctop`| Distance of top of line from top of document.|
|`linewidth`| Thickness of line.|
|`stroking_color`|The color of the line, expressed as a tuple or integer, depending on the “color space” used.|
|`stroking_color`|The color of the line, expressed as a tuple, with length determined by the “color space” used (1 for grayscale, 3 for RBG, 4 for CMYK).|
|`non_stroking_color`|The non-stroking color specified for the line’s path.|
|`object_type`| "line"|

Expand All @@ -205,7 +205,7 @@ my_char_rotation = my_char_ctm.skew_x
|`bottom`| Distance of bottom of the rectangle from top of page.|
|`doctop`| Distance of top of rectangle from top of document.|
|`linewidth`| Thickness of line.|
|`stroking_color`|The color of the rectangle's outline, expressed as a tuple or integer, depending on the “color space” used.|
|`stroking_color`|The color of the rectangle's outline, expressed as a tuple, with length determined by the “color space” used (1 for grayscale, 3 for RBG, 4 for CMYK).|
|`non_stroking_color`|The rectangle’s fill color.|
|`object_type`| "rect"|

Expand All @@ -226,7 +226,7 @@ my_char_rotation = my_char_ctm.skew_x
|`doctop`| Distance of curve's highest point from top of document.|
|`linewidth`| Thickness of line.|
|`fill`| Whether the shape defined by the curve's path is filled.|
|`stroking_color`|The color of the curve's outline, expressed as a tuple or integer, depending on the “color space” used.|
|`stroking_color`|The color of the curve's outline, expressed as a tuple, with length determined by the “color space” used (1 for grayscale, 3 for RBG, 4 for CMYK).|
|`non_stroking_color`|The curve’s fill color.|
|`object_type`| "curve"|

Expand Down
18 changes: 16 additions & 2 deletions pdfplumber/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,13 @@ def fix_fontname_bytes(fontname: bytes) -> str:
return str(prefix)[2:-1] + suffix_new


def normalize_color(color: Any) -> Optional[tuple[Union[float, int], ...]]:
if color is None:
return None
else:
return tuple(color) if isinstance(color, (tuple, list)) else (color,)


class Page(Container):
cached_properties: List[str] = Container.cached_properties + ["_layout"]
is_original: bool = True
Expand Down Expand Up @@ -234,13 +241,20 @@ def process_attr(item: Tuple[str, Any]) -> Optional[Tuple[str, Any]]:
attr["object_type"] = kind
attr["page_number"] = self.page_number

for color_attr in ["stroking_color", "non_stroking_color"]:
if color_attr in attr:
attr[color_attr] = normalize_color(attr[color_attr])

if isinstance(obj, (LTChar, LTTextContainer)):
attr["text"] = obj.get_text()

if isinstance(obj, LTChar):
# pdfminer.six (at least as of v20221105) does not
# directly expose .stroking_color and .non_stroking_color
# for LTChar objects (unlike, e.g., LTRect objects).
gs = obj.graphicstate
attr["stroking_color"] = gs.scolor
attr["non_stroking_color"] = gs.ncolor
attr["stroking_color"] = normalize_color(gs.scolor)
attr["non_stroking_color"] = normalize_color(gs.ncolor)

# Handle (rare) byte-encoded fontnames
if isinstance(attr["fontname"], bytes):
Expand Down
4 changes: 2 additions & 2 deletions tests/test_basics.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,11 +158,11 @@ def test_password(self):

def test_colors(self):
rect = self.pdf.pages[0].rects[0]
assert rect["non_stroking_color"] == [0.8, 1, 1]
assert rect["non_stroking_color"] == (0.8, 1, 1)

def test_text_colors(self):
char = self.pdf.pages[0].chars[3358]
assert char["non_stroking_color"] == [1, 0, 0]
assert char["non_stroking_color"] == (1, 0, 0)

def test_load_with_custom_laparams(self):
# See https://github.com/jsvine/pdfplumber/issues/168
Expand Down

0 comments on commit 57d51bb

Please sign in to comment.