Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colours have inconsistent types #917

Closed
dhdaines opened this issue Jun 29, 2023 · 14 comments
Closed

colours have inconsistent types #917

dhdaines opened this issue Jun 29, 2023 · 14 comments
Assignees
Labels

Comments

@dhdaines
Copy link
Contributor

Describe the bug

The documentation claims that the "stroking_color" ( and presumably also the "non_stroking_color") keys are:

expressed as a tuple or integer, depending on the “color space” used

This is quite false. Depending on the document and the phase of the moon, they could either be float, an int, a tuple, or a list.

This makes it needlessly complicated to handle them since one cannot simply check isinstance(c["stroking_color"], tuple) to know if you have a grayscale or colour value... eventually your program will encounter one that is a list (why?) and, maybe, crash, or worse, just give a bogus result.

Environment

  • pdfplumber version: 0.9.0
  • Python version: 3.10.6
  • OS: Ubuntu 22.04
@dhdaines dhdaines added the bug label Jun 29, 2023
@jsvine
Copy link
Owner

jsvine commented Jun 30, 2023

Thanks for flagging this, @dhdaines. You're right: That part of the documentation needs updating. There's a limit to what pdfplumber can do to here, since we're dependent on pdfminer.six's parsing, but I agree that there's room for improvement. Here are my observations, based on what I see in pdfminer.six's codebase:

pdfminer.six declares Color to have the following type signature:

Color = Union[
    float,  # Greyscale
    Tuple[float, float, float],  # R, G, B
    Tuple[float, float, float, float],
]  # C, M, Y, K

For the core stroking operations, pdfminer.six at first blush appears to follow that definition:

Method Color space Code snippet Link
do_G Grayscale cast(float, G) 🔗
do_RG RGB (cast(float, r), cast(float, g), cast(float, b)) 🔗
do_K CMYK (cast(float, c), cast(float, m), cast(float, y), cast(float, k)) 🔗

... except that cast(...) is the typing module's cast, which explicitly does nothing. ("This returns the value unchanged. To the type checker this signals that the return value has the designated type, but at runtime we intentionally don’t check anything (we want this to be as fast as possible).")

Moreover, there's another color-setting operation, do_SCN, that just pops tokens off the stack without converting that list of tokens to a tuple.

Together, I think this explains the variation you're seeing.

As for fixes:

  • I think it'd be a non-controversial improvement to convert all lists to tuples
  • I could also see the argument for converting ints to floats, although that might lead to some information loss (not in value, but of the original type in the PDF)
  • I think we might also want to convert the results of do_G into 1-item tuples (i.e., 0.25 becomes (0.25,)) for consistency's sake.

Thoughts on this? Other suggestions for improvements?

@jsvine jsvine self-assigned this Jun 30, 2023
@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 3, 2023

Ah! Thanks for the information! Sorry it took a while to reply, long weekend...

So ultimately this is really a bug in pdfminer.six - I think casting everything to (possibly single-element) tuples of floats is the right way to go. This means that nobody has to ever call isinstance, which is generally something to be avoided.

To the extent that there are ints in there, they're probably bool values in disguise, since at least in theory, pdfminer.six only deals in floats: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L539

@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 3, 2023

(so, there's no possible information loss from forcing them to float)

@jsvine
Copy link
Owner

jsvine commented Jul 3, 2023

Thanks for following up. A few notes in response:

To the extent that there are ints in there, they're probably bool values in disguise, since at least in theory, pdfminer.six only deals in floats: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L539

Hmm, that doesn't seem quite right to me, for a few reasons:

Or perhaps I misunderstand what you're suggesting?

(so, there's no possible information loss from forcing them to float)

I ought to have been clearer on this point. The information loss isn't so much that there's a quantitative difference between 1 and 1.0, but rather that this difference can be helpful for distinguishing between text/graphics/etc. that were produced by different software, or different subroutines of the same software, etc.

@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 4, 2023

Right, the PDF spec (like PostScript) obviously supports both ints and floats! (unlike ahem certain popular programming languages ahem). And despite the type annotations pdfminer.six will clearly put an int on the stack: https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/psparser.py#L610

I think we're in agreement that the problem here is the misunderstanding of Python typing by the pdfminer.six authors, and we should expect that both int and float could be there, and could be meaningful. I'm obviously thinking about things in a machine learning perspective where everything is going to end up as a bool or a float32 ;-)

It would definitely be easier from the user's point of view if everything were a tuple and you could just call len on it to determine what colour space you're dealing with, but I don't know if this strays too far from the PDF spec (which I don't claim to understand even remotely). But, minimally, it would make sense for the colour values to be either a tuple of ints, a tuple of floats, an int, or a float, and not to be a list!

@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 4, 2023

It would definitely be easier from the user's point of view if everything were a tuple and you could just call len on it to determine what colour space you're dealing with

Oh no, it definitely isn't as simple as that! Because the various colour spaces have a variety of different ranges and components, actually it probably isn't a great idea to type them as tuple since they might contain a mixture of ints (sometimes in a range of [-100,100], sometimes [-128,127], etc, etc) and floats. So really they would be something like Union[int, float, Sequence[Union[int, float]]], unless we provide specific types for all of the possible colour spaces.

In this case it's really just a matter of updating the documentation for the moment, but really, it would be helpful to have access to the colour space from pdfplumber.

@jsvine
Copy link
Owner

jsvine commented Jul 5, 2023

Hm, I'm not quite sure I understand your latest note re. "actually it probably isn't a great idea to type them as tuple, since they might contain a mixture [...]". Is the argument that tuples should only contain homogenous types? Or that tuples are insufficient to represent the full set of color possibilities? Or perhaps something else?

In this case it's really just a matter of updating the documentation for the moment, but really, it would be helpful to have access to the colour space from pdfplumber.

Agreed. I've been working on some tweaks based on this thread. Right now, the only information pdfplumber can expose is the non-stroking color space for char objects. The rest (stroking color space for char objects and either color space for the other types of objects) will require a PR to pdfminer.six — something I plan to do eventually.


Also, in case you're curious, there's a further wrinkle, which is that colors can also be specified as "patterns" (see section 4.6 here). So I'm also working on separating out stroking/non-stroking patterns from numerically-specified colors.

@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 5, 2023

Hm, I'm not quite sure I understand your latest note re. "actually it probably isn't a great idea to type them as tuple, since they might contain a mixture [...]". Is the argument that tuples should only contain homogenous types?

Ah, sorry, yes that wasn't clear. Yes, it's because a tuple needs to define the number and type of each of its elements (well, it doesn't need to do this, at least in Python, but it seems to be the general best practice). So given that any element of a colour might be an int or a float, and there might be 1, 3, 4, or ??? of them, the type definitions would get pretty ugly...

@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 5, 2023

Ah. Actually, much like JavaScript, PDF considers integers and floats to be the same thing (and doesn't even specify what the range of integers is...). See https://ghostscript.com/~robin/pdf_reference17.pdf#page=52:

An integer is written as one or more decimal digits optionally preceded by a sign:
123 43445 +17 −98 0
The value is interpreted as a signed decimal integer and is converted to an integer
object. If it exceeds the implementation limit for integers, it is converted to a real
object.

Wherever a real number is expected, an integer may be used
instead and is automatically converted to an equivalent real value. For example, it
is not necessary to write the number 1.0 in real format; the integer 1 is sufficient.

In light of this I think it's reasonable to convert all numbers to Python float.

@jsvine
Copy link
Owner

jsvine commented Jul 5, 2023

Thanks for the quick clarification. This is testing my knowledge of Python best practices, but my general impression is that tuples would be a decent fit for this use case. Per the Python docs:

Tuples are immutable, and usually contain a heterogeneous sequence of elements that are accessed via unpacking (see later in this section) or indexing (or even by attribute in the case of namedtuples).

Colors seem to check that list. And the type definition, seems manageable (or at least acceptable by mypy):

color_type = Tuple[Union[int, float], ...]

I'm less concerned about the fact that these tuples can be 1, 3, or 4-length, since every individual tuple's length is fixed. Ideally, getting color space information into pdfplumber will resolve the need to check the length of these tuples, and I'm not sure other types (list, etc.) would provide any advantage in this respect. But perhaps I'm overlooking a particular wrinkle?


In light of this I think it's reasonable to convert all numbers to Python float.

Perhaps I could have been clearer before: The reason for not converting isn't about how ints/floats are handled internally by PDF or Python, or what their specs say, but rather because whether the actual tokens in the PDF use int/float representation can be useful information, particularly when trying to discriminate between various objects in a PDF or between PDFs.

@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 5, 2023

Colors seem to check that list. And the type definition, seems manageable (or at least acceptable by mypy):

color_type = Tuple[Union[int, float], ...]

Yes, this is fine, and acceptable to mypy, of course... It's just not a particularly useful type definition, since it won't prevent runtime errors in code that assumes a specific configuration of the tuple in question, and the general use case of tuples is for objects that can be expected to have a well-defined shape.

But it is better than the incorrect type definition in pdfminer.six :-) and the important thing is just that the type definitions match the documentation, which matches reality!

@jsvine
Copy link
Owner

jsvine commented Jul 7, 2023

Roger that. And thanks for the conversation here. I have some fixes in progress, and hope to push in the next version. If additional thoughts on the topic occur to you, feel to keep posting them here.

jsvine added a commit that referenced this issue Jul 16, 2023
This commit normalizes the type representation of `stroking_color` and
`non_stroking_color` values. Thanks to @dhdaines for pointing out this
inconsistency.

Previously, `pdfplumber` passed along `pdfminer.six`'s colors without
normalization. Due to quirks in `pdfminer.six`'s color handling, this
meant that those values could be floats, ints, lists, or tuples. This
commit normalizes all color values (when non-None) into n-tuples, where
(val,) represents grayscale colors, (val, val, val) represents RBG, and
(val, val, val, val) represents CMYK colors.

This should solve the consistency issue, although might cause breaking
changes to code that filters for non-tuple values — e.g., `[c for c in
page.chars if c == [1, 0 0]]`. Although breaking changes are unpleasant,
I think the tradeoff for longer-term consistency is worth it.
@jsvine jsvine mentioned this issue Jul 16, 2023
@jsvine
Copy link
Owner

jsvine commented Jul 17, 2023

These changes have now landed in v0.10.0. I've also added a new docs page specifically re. colors: https://github.com/jsvine/pdfplumber/blob/stable/docs/colors.md

Closing this issue, but feel free to reopen or continue the discussion if you feel like the changes missed the mark. Thanks again for bringing this up, @dhdaines 👍

@jsvine jsvine closed this as completed Jul 17, 2023
@dhdaines
Copy link
Contributor Author

Thank you! This looks very good, nice documentation as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants