-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colours have inconsistent types #917
Comments
Thanks for flagging this, @dhdaines. You're right: That part of the documentation needs updating. There's a limit to what
Color = Union[
float, # Greyscale
Tuple[float, float, float], # R, G, B
Tuple[float, float, float, float],
] # C, M, Y, K For the core stroking operations,
... except that Moreover, there's another color-setting operation, Together, I think this explains the variation you're seeing. As for fixes:
Thoughts on this? Other suggestions for improvements? |
Ah! Thanks for the information! Sorry it took a while to reply, long weekend... So ultimately this is really a bug in To the extent that there are |
(so, there's no possible information loss from forcing them to |
Thanks for following up. A few notes in response:
Hmm, that doesn't seem quite right to me, for a few reasons:
Or perhaps I misunderstand what you're suggesting?
I ought to have been clearer on this point. The information loss isn't so much that there's a quantitative difference between |
Right, the PDF spec (like PostScript) obviously supports both ints and floats! (unlike ahem certain popular programming languages ahem). And despite the type annotations I think we're in agreement that the problem here is the misunderstanding of Python typing by the It would definitely be easier from the user's point of view if everything were a |
Oh no, it definitely isn't as simple as that! Because the various colour spaces have a variety of different ranges and components, actually it probably isn't a great idea to type them as In this case it's really just a matter of updating the documentation for the moment, but really, it would be helpful to have access to the colour space from |
Hm, I'm not quite sure I understand your latest note re. "actually it probably isn't a great idea to type them as tuple, since they might contain a mixture [...]". Is the argument that
Agreed. I've been working on some tweaks based on this thread. Right now, the only information Also, in case you're curious, there's a further wrinkle, which is that colors can also be specified as "patterns" (see section 4.6 here). So I'm also working on separating out stroking/non-stroking patterns from numerically-specified colors. |
Ah, sorry, yes that wasn't clear. Yes, it's because a |
Ah. Actually, much like JavaScript, PDF considers integers and floats to be the same thing (and doesn't even specify what the range of integers is...). See https://ghostscript.com/~robin/pdf_reference17.pdf#page=52:
In light of this I think it's reasonable to convert all numbers to Python |
Thanks for the quick clarification. This is testing my knowledge of Python best practices, but my general impression is that
Colors seem to check that list. And the type definition, seems manageable (or at least acceptable by color_type = Tuple[Union[int, float], ...] I'm less concerned about the fact that these tuples can be 1, 3, or 4-length, since every individual tuple's length is fixed. Ideally, getting color space information into
Perhaps I could have been clearer before: The reason for not converting isn't about how |
Yes, this is fine, and acceptable to But it is better than the incorrect type definition in |
Roger that. And thanks for the conversation here. I have some fixes in progress, and hope to push in the next version. If additional thoughts on the topic occur to you, feel to keep posting them here. |
This commit normalizes the type representation of `stroking_color` and `non_stroking_color` values. Thanks to @dhdaines for pointing out this inconsistency. Previously, `pdfplumber` passed along `pdfminer.six`'s colors without normalization. Due to quirks in `pdfminer.six`'s color handling, this meant that those values could be floats, ints, lists, or tuples. This commit normalizes all color values (when non-None) into n-tuples, where (val,) represents grayscale colors, (val, val, val) represents RBG, and (val, val, val, val) represents CMYK colors. This should solve the consistency issue, although might cause breaking changes to code that filters for non-tuple values — e.g., `[c for c in page.chars if c == [1, 0 0]]`. Although breaking changes are unpleasant, I think the tradeoff for longer-term consistency is worth it.
These changes have now landed in v0.10.0. I've also added a new docs page specifically re. colors: https://github.com/jsvine/pdfplumber/blob/stable/docs/colors.md Closing this issue, but feel free to reopen or continue the discussion if you feel like the changes missed the mark. Thanks again for bringing this up, @dhdaines 👍 |
Thank you! This looks very good, nice documentation as well! |
Describe the bug
The documentation claims that the
"stroking_color"
( and presumably also the"non_stroking_color"
) keys are:This is quite false. Depending on the document and the phase of the moon, they could either be
float
, anint
, atuple
, or alist
.This makes it needlessly complicated to handle them since one cannot simply check
isinstance(c["stroking_color"], tuple)
to know if you have a grayscale or colour value... eventually your program will encounter one that is alist
(why?) and, maybe, crash, or worse, just give a bogus result.Environment
The text was updated successfully, but these errors were encountered: