This is a repro for an unrecoverable UTF-8 in a PDF generated by weasyprint.
Steps for reproduction:
-
weasyprint example.html example.pdf
-
cp example.pdf example_fixed.pdf
- Use hex editor to fix the cmap table of example_fixed.pdf by changing
<10006c5b> <6c5b>
to
<00006c5b> <6c5b>
-
./p2t.py
- See that the character has been recovered in example_fixed.txt but not in example.txt
Step 3 is not necessary anymore after Kozea/WeasyPrint#1571 (comment) is fixed.
This needs the python packages weasyprint (version 54.1) and pdftotext installed.
If you have a working nix setup use the provided default.nix
by calling
nix-shell
If you have direnv and nix, just use
direnv allow