Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can attach files via command line api but not the python api? #644

Closed
tjrhodes opened this issue Jun 16, 2018 · 15 comments
Closed

I can attach files via command line api but not the python api? #644

tjrhodes opened this issue Jun 16, 2018 · 15 comments

Comments

@tjrhodes
Copy link

tjrhodes commented Jun 16, 2018

Hi,

Using -a on the command line I can attach pdfs fine, when I try

<link rel="attachment" href="path/to/pdf.pdf">

tags in the html I get no attachments, using relative or absolute paths makes no difference.

I've also tried passing an array of attachments in the python like so...

doc.write_pdf(outputPath,1,attachments)

where doc is returned from an html.render() call. No dice that way either. I think I'm missing something basic, but I've looked through the docs and the issues on here and I can't see why it fails.

Any ideas?

@tjrhodes
Copy link
Author

python3 and 0.42.3 BTW :)

@tjrhodes
Copy link
Author

I can see this in the generated pdf, which I think might be comign from the link tag(s) in the html?

 obj
<</Names
  <</EmbeddedFiles
  <</Names
  [(attachment)
  <</Desc <> /EF <</F 3 0 R>> /F <> /Type /Filespec /UF
  (maidiremail_maggio_2018.pdf)>>]>>>>
  /Outlines <</Count 1 /First 4 0 R /Last 4 0 R /Type /Outlines>> /Pages
  5 0 R /Type /Catalog>>
endobj

using this code...

 self.attachmentPath = "../work/17_6_2018-10_35_02-Test/attachments/163a32b533fd93b7/maidiremail_maggio_2018.pdf"
        self.outputPath = "test.pdf"
        self.htmlPath = "../work/17_6_2018-10_35_02-Test/SEMILIBERTA'_GUBBIOTTI_PAOLO_17_6_2018.html"
        fontConfig = weasyprint_fonts_FontConfiguration()
        cssPath = "../css/mail.css"
        css = weasyprint_CSS(cssPath,None,None,None,None,None,None,None,None,None,fontConfig)
        html = weasyprint_HTML(self.htmlPath)
        attachment = weasyprint_Attachment(filename=self.attachmentPath)
        html.write_pdf(self.outputPath, [css], 1, [attachment]);

I get no errors, a pdf of the html but without the attachments included. When I use -a I get a pdf with the attachments.

I'm generating hundreds of pdfs at a time, and to get around the startup time of weasyprint I have made a zerorpc server to keep running which then gets called to make the pdfs, it runs many times faster this way but I'm having to use pdfunite to get around this attachments problem and then pdfinfo to count the pages. I'm guessing all of that should be possible with the weasyprint python API? Just I'm probably doing something daft...

@Tontyna
Copy link
Contributor

Tontyna commented Jun 17, 2018

No problem here. Neither with the current master branch nor with a version from last year.

Since the attachment is in the PDF and since your PDF reader can display the attached PDF when attached via -a I have no idea what causes your trouble.

Could you provide a PDF where you cannot see the attached PDF?

@tjrhodes
Copy link
Author

Hmmm, I was hoping I was being obviously stupid. So the fact it works with -a and not with python for me is an odd edge case then? Not got time right now, but I'll provide a couple of pdfs, one made on the command line with -a and one with python, using the exact same sources and see if that helps to isolate what's going on.

@liZe
Copy link
Member

liZe commented Jun 23, 2018

@tjrhodes the problem may come from the relative paths you're using. When you're using '-a', the path is relative to your current folder, but when you're using a <link> tag, it's relative to the folder the HTML file is in.

You should try to use absolute path in your Python example and see if it works.

@Tontyna
Copy link
Contributor

Tontyna commented Jun 23, 2018

@liZe: I suspected the same, but @tjrhodes said:

I get no attachments, using relative or absolute paths makes no difference.

Also, if WeasyPrint cannot find an attachment it raises a warning and doesnt put an <<Embedded Files object into the PDF.
I'm really keen to see a PDF with attachments that are not attached.

@tjrhodes
Copy link
Author

Hey, ok here you go, I'm now consistently getting it to fail silently through the command line and the python API. Pretty sure -a was working for me before though. I've been working around it with pdfunite for a while so maybe I got confused there. These pdfs were produced from html without the tag.

https://cloud.tjrhodes.com/index.php/s/0YBWa0GXVupFeiK
password: weasy

In there you have the attachment, the html and the two pdfs which seem identical. So I guess the problem now is that I get no attachments either way! No warnings and <<Embedded Files present in the pdfs.

Not urgent as I've got pdfunite and pdfinfo to fall back on, but your tool is awesome and I was looking to do everything with one command instead of 3 different commands. Plus using weasyprint with zerorpc to get the speed benefits of the long running process is like 10x faster than doing what I'm doing now.

@Tontyna
Copy link
Contributor

Tontyna commented Jun 23, 2018

As you say: The PDFs are identical.
The attached mail_lands.pdf is present in both. But when I try to click-open them I get an ERROR. Which one depends on the PDF reader I'm using. My Adobe Reader says something along the lines of

Couldnt open "mail_lands.pdf" cause either the file type isnt supported or the file is damaged (e.g. the file has been sent as an email attachment and wasnt decoded correctly)

FlateDecoding the embedded stream reproduces mail_lands.pdf, no error, no damage.

Looking at mail_lands.pdf with an editor, all I see on the first glance is: It's PDF-1.4, WeasyPrint produces PDF-1.3, maybe thats the point? Embedding 1.4 in 1.3 upsets the PDF readers?
Just a guess.

@tjrhodes
Copy link
Author

Right, interesting, that pdf and lots of others I'm attaching to the weasy generated ones with pdfunite, are created from libreoffice --convert-to on the command line. So I guess I need to look for a way to control the version there. I don't have adobe reader, gnome document viewer shows me the weasy generated content and nothing else.

Anyway, thanks for the hint I didn't know where to look for an answer but the 1.4 > 1.3 thing looks very promising, and thanks for the great tool, the results from html are fantastic.

@Tontyna
Copy link
Contributor

Tontyna commented Jun 23, 2018

What I dont get is why a viewer, capable of handling PDF 1.4 isn't able to unpack the attached PDF and detect that he should simply switch his parsing engine from 1.3 to 1.4... but thats what at least 3 viewers seemingly fail to do: Adobe Reader, Sumatra PDF viewer, gnome document viewer.

Would you mind changing the title to sth like "problems when embedding PDF 1.4 files"?

@liZe
Copy link
Member

liZe commented Jun 25, 2018

Version 1.3 is set by pdfrw, but Cairo creates PDF files with version 1.5. How does pdfrw transforms 1.5 documents into 1.3? I don't know, and I don't want to.

After many, many bugs (#644, #639, #565 and equivalent issues), I think that we should not use pdfrw anymore. Cairo now provides an API to add metadata (including bookmarks I think), there's not much more to handle by editing the PDF file (at least bleeding areas).

I'm sad, because pdfrw is really useful and its devs are really nice. But the work needed on CairoSVG and on WeasyPrint is probably less than the work needed to understand and fix these issues using pdfrw.

@Tontyna
Copy link
Contributor

Tontyna commented Jun 25, 2018

Wanted to be shure that the mixed PDF versions are the source of evil.
Attached the 1.4 PDF mail_lands.pdf to the SEMILIBERTA'_GUBBIOTTI_PAOLO_17_6_2018.html, provided by @tjrhodes, rendered that with WeasyPrint to a PDF 1.3, opened it in my PDF viewer.
And guess what?

No problem. No error opening the attached file.

Conclusion: Its not a PDF version conflict. Something must be wrong with the embedded encoded stream.

And indeed: The so called FlateDecoded stream in my working PDF looks completely different than the streams contained in @tjrhodes' PDFs.

Working embedded stream:

stream
xœt·cl.^Ðî]w׶mÛ¶ív׶mܵí]Û¶mÛ
[...]
endstream

Failing stream looks like a Pythonic binary string to me:

stream
b'x\x9ct\xb7cl.^\xd0\xee]w\xd7\xb6m\xdb\[...]'
endstream

Was unable to find the place where pdfrw writes the attachment's stream into the PDF and check whether it converts it into the wrong string type.
Instead I switched focus to zlib and now I'm quite sure that it's' not pdfrw, but zlib who creates bytes instead of str.

The stream is actually produced by zlib.compressobj().compress().decode()
My zlib.ZLIB_VERSION is 1.2.11 and my zlib.compressobj().compress().decode() returns a str.
When I forcible convert that str to bytes I can reproduce @tjrhodes bug, the embedded stream is a Pythonic binary string and the attached file not accessible.

The workaround consequently looks like that:

        if isinstance(pdf_file_object.stream, bytes):
            pdf_file_object.stream = str(pdf_file_object.stream)

Will create a PR asap.

@Tontyna
Copy link
Contributor

Tontyna commented Jun 25, 2018

@tjrhodes -- since I cant reproduce the bug: Would you please test whether the PR fixes it?

Edit: No need to test, see the ff comments.

@liZe
Copy link
Member

liZe commented Jun 26, 2018

@Tontyna thanks a lot for your investigation! You're right, it's an encoding problem when using Python 3. It's not pdfrw's fault … but we wouldn't have this issue without pdfrw.

This issue is actually a duplicate of #558, fixed in ce84073.

I've backported ce84073 into the 0.x branch (and fixed Python 2 support, as ce84073 was a Python3-only fix).

@liZe liZe closed this as completed Jun 26, 2018
@liZe
Copy link
Member

liZe commented Jun 26, 2018

I'll release 0.42.4 during the summer 🌞.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants