Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid CMap table generated #1571

Closed
janvogt opened this issue Feb 15, 2022 · 7 comments
Closed

Invalid CMap table generated #1571

janvogt opened this issue Feb 15, 2022 · 7 comments
Labels
bug Existing features not working as expected
Milestone

Comments

@janvogt
Copy link

janvogt commented Feb 15, 2022

Observation

Some printable UTF-8 charachters couldn't be recovered from the generated PDF. A poppler-cpp based pdftotext library warned:

poppler/error: Invalid entry in bfchar block in ToUnicode CMap

Research

Following down this trail I figured out that the following line in the CMap of the PDF generated by waesyprint triggered this error, since it is larger than 4096. See check in poppler source code

<10006c5b> <6c5b>

Using a hex editor to change the initial 1 to a 0, i.e.

<00006c5b> <6c5b>

fixed the PDF and made the character recoverable and also visible in a PDF Viewer (Apple Preview).

I attempted to figure out how the faulty value is generated, but had to give up eventually. The relevant line just renders the glyph, but it is unclear to me where the glyph originates.

Reproducible example

A reproducible example is availiable at https://github.com/janvogt/repro-weasyprint-cmap-utf8. Also I attached the invalid pdf example.pdf and the fixed pdf example_fixed.pdf, generated from the following html:

<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
</head>

<body></body>

</html>
@liZe liZe added the bug Existing features not working as expected label Feb 16, 2022
@liZe
Copy link
Member

liZe commented Feb 16, 2022

Thanks for the bug report.

The problem comes from the glyph number: PDF requires this number to be lower than ffff, and we have 10006c5b. That’s a very high number, and we can assume that your font doesn’t include more than 200 billions characters :D.

Actually, DejaVu Sans doesn’t include your character. I suppose that no font installed on your system includes this character, and that Pango falls back to an empty glyph. So, I suppose that your bug is actually #1508, already fixed in version 54.

Removing the leading 1 syntactically "fixes" the PDF, but it’s not a real fix. As glyph number 6c5b doesn’t exist in DejaVu (that’s embedded in the PDF), your PDF reader uses another font (installed on the computer where the PDF is viewed) to render Unicode character number 6c5b. It’s just a magic trick of the PDF renderer, but it doesn’t really work ;).

You can try to render your document with version 54.x, and you’ll probably get a correct but empty PDF as there’s no font including this character on your system. Or you can try to install a font that includes Chinese ideograms, and it will work, even with 53.x. That’s my assumption!

@janvogt
Copy link
Author

janvogt commented Feb 16, 2022

Thanks for the quick response!

Unfortunately, I was already using version 54.1 except in the repro. Here are the same pdfs rendered using 54.1 (and I also updated the repro):

Where you are correct is that the used font does indeed not contain a glyph for unicode charachter 0x6c5d. It turns out that using a font containing such a glyph, the problem does indeed not occur (see example_full_font.html in the repro).

However, I still think it is a bug to generate an invalid pdf when the font is missing a glyph. To me the expected behaviour would be to render the charachter not found glyph. But I could live with the current behaviour of showing nothing as well.

In any case though, when extracting the text from the pdf (e.g. using something like pdftotext) all characters should be preserved via a correct CMap. Otherwise, property testing becomes very cumbersome... But maybe there are good reasons against it, I just don't see?

After the report I also dug a little deeper and it seems that this way too large glyph value comes from pango. However, I am not 100% sure...

@liZe
Copy link
Member

liZe commented Feb 17, 2022

However, I still think it is a bug to generate an invalid pdf when the font is missing a glyph.

That’s true.

To me the expected behaviour would be to render the charachter not found glyph. But I could live with the current behaviour of showing nothing as well.

We’ll keep Pango’s fallback mechanism handle this for us, looks like it displays nothing ad not a fallback character.

In any case though, when extracting the text from the pdf (e.g. using something like pdftotext) all characters should be preserved via a correct CMap. Otherwise, property testing becomes very cumbersome... But maybe there are good reasons against it, I just don't see?

No, you’re right there’s a bug we should fix.

But your problem is specific, it’s not the common case. Usually, when a character is missing, Pango finds a fallback character and everything goes well. But here, Pango wants to use the glyph number 0x10006c5b (that’s 268463195 in decimal), and the PDF specification tells in 9.7.6.2 that "The code length shall not be greater than": that’s the problem you have.

I doubt that a font on your system has so many characters included, so I think that Pango doesn’t give a real glyph number. This code having the Unicode character number at the end (6c5b) is another hint, as Unicode characters and glyph ids are usually unrelated.

We have to check Pango’s documentation to see if this code doesn’t mean something else. And even if we don’t find anything in the doc, we should fix this case so that we display nothing instead of including a forbidden code.

@janvogt
Copy link
Author

janvogt commented Feb 17, 2022

Turns put there is a PANGO_GLYPH_UNKNOWN_FLAG that has exactly the offset we're seeing: 0x10000000. So every unicode character of the shape 0x0....... has a corresponding glyph 0x1....... in Pango that is used if it's not available in the current font. I think that is what we're seeing here.

This is an example of how it is used in the Pango Codebase https://gitlab.gnome.org/GNOME/pango/-/blob/main/pango/pango-layout.c#L1458

@liZe
Copy link
Member

liZe commented Feb 17, 2022

Turns put there is a PANGO_GLYPH_UNKNOWN_FLAG that has exactly the offset we're seeing: 0x10000000.

Thanks a lot, we understand what’s going on now, and we can fix this bug.

@liZe liZe added this to the 54.2 milestone Feb 17, 2022
@liZe liZe closed this as completed in 32ad7d5 Feb 17, 2022
@liZe
Copy link
Member

liZe commented Feb 17, 2022

The bug is fixed in the 54.x and master branches. Tests and feedback are welcome!

@janvogt
Copy link
Author

janvogt commented Feb 19, 2022

I am happy to confirm that the fix works in the minimal repro. Thanks for the quick responses, as well as providing and maintaining this awesome tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Existing features not working as expected
Projects
None yet
Development

No branches or pull requests

2 participants