Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[securedrop-export] Broaden printable filetypes to include more types #1725

Closed
rocodes opened this issue Nov 3, 2022 · 3 comments · Fixed by #2166
Closed

[securedrop-export] Broaden printable filetypes to include more types #1725

rocodes opened this issue Nov 3, 2022 · 3 comments · Fixed by #2166
Labels
Milestone

Comments

@rocodes
Copy link
Contributor

rocodes commented Nov 3, 2022

We currently convert Office/Libreoffice files using unoconv to print them as PDFs. unoconv (already installed in sd-devices) currently supports the following filetypes: [Edit: see comment below regarding moving away from unoconv]

The following list of document formats are currently available:

  bib      - BibTeX [.bib]
  doc      - Microsoft Word 97/2000/XP [.doc]
  doc6     - Microsoft Word 6.0 [.doc]
  doc95    - Microsoft Word 95 [.doc]
  docbook  - DocBook [.xml]
  docx     - Microsoft Office Open XML [.docx]
  docx7    - Microsoft Office Open XML [.docx]
  fodt     - OpenDocument Text (Flat XML) [.fodt]
  html     - HTML Document (OpenOffice.org Writer) [.html]
  latex    - LaTeX 2e [.ltx]
  mediawiki - MediaWiki [.txt]
  odt      - ODF Text Document [.odt]
  ooxml    - Microsoft Office Open XML [.xml]
  ott      - Open Document Text [.ott]
  pdf      - Portable Document Format [.pdf]
  rtf      - Rich Text Format [.rtf]
  stw      - Open Office.org 1.0 Text Document Template [.stw]
  sxw      - Open Office.org 1.0 Text Document [.sxw]
  text     - Text Encoded [.txt]
  txt      - Text [.txt]
  uot      - Unified Office Format text [.uot]
  xhtml    - XHTML Document [.html]
  epub     - Electronic Publication [.epub]

The following list of graphics formats are currently available:

  bmp      - Windows Bitmap [.bmp]
  emf      - Enhanced Metafile [.emf]
  eps      - Encapsulated PostScript [.eps]
  fodg     - OpenDocument Drawing (Flat XML) [.fodg]
  gif      - Graphics Interchange Format [.gif]
  html     - HTML Document (OpenOffice.org Draw) [.html]
  jpg      - Joint Photographic Experts Group [.jpg]
  met      - OS/2 Metafile [.met]
  odd      - OpenDocument Drawing [.odd]
  otg      - OpenDocument Drawing Template [.otg]
  pbm      - Portable Bitmap [.pbm]
  pct      - Mac Pict [.pct]
  pdf      - Portable Document Format [.pdf]
  pgm      - Portable Graymap [.pgm]
  png      - Portable Network Graphic [.png]
  ppm      - Portable Pixelmap [.ppm]
  ras      - Sun Raster Image [.ras]
  std      - OpenOffice.org 1.0 Drawing Template [.std]
  svg      - Scalable Vector Graphics [.svg]
  svm      - StarView Metafile [.svm]
  swf      - Macromedia Flash (SWF) [.swf]
  sxd      - OpenOffice.org 1.0 Drawing [.sxd]
  tiff     - Tagged Image File Format [.tiff]
  wmf      - Windows Metafile [.wmf]
  xhtml    - XHTML [.xhtml]
  xpm      - X PixMap [.xpm]

The following list of presentation formats are currently available:

  bmp      - Windows Bitmap [.bmp]
  emf      - Enhanced Metafile [.emf]
  eps      - Encapsulated PostScript [.eps]
  fodp     - OpenDocument Presentation (Flat XML) [.fodp]
  gif      - Graphics Interchange Format [.gif]
  html     - HTML Document (OpenOffice.org Impress) [.html]
  jpg      - Joint Photographic Experts Group [.jpg]
  met      - OS/2 Metafile [.met]
  odg      - ODF Drawing (Impress) [.odg]
  odp      - ODF Presentation [.odp]
  otp      - ODF Presentation Template [.otp]
  pbm      - Portable Bitmap [.pbm]
  pct      - Mac Pict [.pct]
  pdf      - Portable Document Format [.pdf]
  pgm      - Portable Graymap [.pgm]
  png      - Portable Network Graphic [.png]
  potm     - Microsoft PowerPoint 2007/2010 XML Template [.potm]
  pot      - Microsoft PowerPoint 97/2000/XP Template [.pot]
  ppm      - Portable Pixelmap [.ppm]
  pptx     - Microsoft PowerPoint 2007/2010 XML [.pptx]
  pps      - Microsoft PowerPoint 97/2000/XP (Autoplay) [.pps]
  ppt      - Microsoft PowerPoint 97/2000/XP [.ppt]
  pwp      - PlaceWare [.pwp]
  ras      - Sun Raster Image [.ras]
  sda      - StarDraw 5.0 (OpenOffice.org Impress) [.sda]
  sxd      - OpenOffice.org 1.0 Drawing (OpenOffice.org Impress) [.sxd]
  sti      - OpenOffice.org 1.0 Presentation Template [.sti]
  svg      - Scalable Vector Graphics [.svg]
  svm      - StarView Metafile [.svm]
  swf      - Macromedia Flash (SWF) [.swf]
  sxi      - OpenOffice.org 1.0 Presentation [.sxi]
  tiff     - Tagged Image File Format [.tiff]
  uop      - Unified Office Format presentation [.uop]
  wmf      - Windows Metafile [.wmf]
  xhtml    - XHTML [.xml]
  xpm      - X PixMap [.xpm]

The following list of spreadsheet formats are currently available:

  csv      - Text CSV [.csv]
  dbf      - dBASE [.dbf]
  dif      - Data Interchange Format [.dif]
  fods     - OpenDocument Spreadsheet (Flat XML) [.fods]
  html     - HTML Document (OpenOffice.org Calc) [.html]
  ods      - ODF Spreadsheet [.ods]
  ooxml    - Microsoft Excel 2003 XML [.xml]
  ots      - ODF Spreadsheet Template [.ots]
  pdf      - Portable Document Format [.pdf]
  slk      - SYLK [.slk]
  stc      - OpenOffice.org 1.0 Spreadsheet Template [.stc]
  sxc      - OpenOffice.org 1.0 Spreadsheet [.sxc]
  uos      - Unified Office Format spreadsheet [.uos]
  xhtml    - XHTML [.xhtml]
  xls      - Microsoft Excel 97/2000/XP [.xls]
  xls5     - Microsoft Excel 5.0 [.xls]
  xls95    - Microsoft Excel 95 [.xls]
  xlt      - Microsoft Excel 97/2000/XP Template [.xlt]
  xlt5     - Microsoft Excel 5.0 Template [.xlt]
  xlt95    - Microsoft Excel 95 Template [.xlt]
  xlsx     - Microsoft Excel 2007/2010 XML [.xlsx]

It would be easy to add support for more of these filetypes (see eg freedomofpress/securedrop-export#109), would be helpful for users (see eg #2008 or #2007; though not about printing, the same concept applies), and would not require adding any additional dependencies. Let's discuss which formats are useful to include, as a stopgap while we figure out other formats such as webp support, or while we figure larger changes such as freedomofpress/securedrop-workstation#842.

Note: unoconv is also installed in sd-viewer, if we are looking for a quick way to add additional viewable filetypes without additional upstream dependencies.

@rocodes rocodes added the good first issue Good for newcomers label Nov 3, 2022
@eaon
Copy link
Contributor

eaon commented Nov 4, 2022

I am 💯 in favour of this, but I do want to point out that unoconv is deprecated. The replacement unoserver is a fairly lightweight layer that basically just ensures that LibreOffice will stay in memory for bulk conversion operations. I believe for our usecase, it would probably be best to not rely on unoconv or deal with unoserver and instead just go with LibreOffice's headless conversion directly.

@rocodes
Copy link
Contributor Author

rocodes commented Nov 8, 2022

I'm in favour. Just tested a little bit with soffice (libreoffice) --convert-to pdf. Here are some notes for our future selves:

  • .html files can be successfully converted to pdf. html files with separate image files can also be converted into a PDF, as long as they are in the expected location.
  • If there is an existing file with the same name as the desired name, the file will be overwritten 😬 This is fine for now but will be relevant when we start including more bulk actions where there could be multiple files with the same name (eg with different extensions). Our current naming strategy preserves the original extension and adds ".pdf" after it, so for now I think that's fine and I think we can resolve edge cases with duplicate or similar filenames as part of export all.
  • In some cases, even when errors are encountered during conversion, the return code is still 0, so relying on subprocess.check_call won't work. (An example of this is when a file is passed in without the correct file extension- a warning is printed to the console, the file is not converted, and the return code is 0). A workaround could be to check for a pdf file with the name we expect, and ensure it is newer than the original file we were trying to convert.

@rocodes rocodes changed the title Broaden printable filetypes to include more types supported by unoconv Broaden printable filetypes to include more types Nov 9, 2022
@zenmonkeykstop zenmonkeykstop changed the title Broaden printable filetypes to include more types [securedrop-export] Broaden printable filetypes to include more types Dec 13, 2023
@zenmonkeykstop zenmonkeykstop transferred this issue from freedomofpress/securedrop-export Dec 13, 2023
@deeplow
Copy link
Contributor

deeplow commented Jul 17, 2024

I'm in favour. Just tested a little bit with soffice (libreoffice) --convert-to pdf. Here are some notes for our future selves:

Dangerzone does use this "libreoffice headless" approach. We could piggy back on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants