Feature proposition: convert web links to internal page links #103

Gutesork · 2025-02-16T21:37:25Z

Hello.
How about a command like cpdf -convert-url-to-internal "http://site.org//index.php/Page_name" "Page or destination" in.pdf -o out.pdf?

User case:
one might make a pdf from a bunch of webpages saved as pdf, or in my case:
I have a private mediawiki that can only be accessed from local network. I can output pages as pdfs, but the links directs to http://ip/index.php/Page_name. The pdf original has a Table Of Contents and bookmarks, which gets the format "<chapter#> Page name" I have searched the web for an easy scriptable solution to convert links to internal page links, but every solution points to a graphical program.
I could make a script that will get the URL, and a way to get the page number in the pdf that this corresponds to, but no way to convert the links.

The text was updated successfully, but these errors were encountered:

johnwhitington · 2025-02-17T02:40:11Z

Links are a kind of annotation. So, you can run

cpdf -list-annotations-json in.pdf > annots.json

Now you can edit the JSON annotations, replacing the link annotations with destination annotations pointing to page numbers, and run:

cpdf -remove-annotations in.pdf AND -set-annotations-json annots.json -o out.pdf

Now, this will require some reading of the PDF standard to learn about annotations, but it shouldn't be too bad. (Cpdf will automatically rewrite the page numbers into page object numbers when the annotations are read back in - in fact, this is not documented in the manual and should be. I'll leave this bug open to note that.)

Gutesork · 2025-02-17T22:59:17Z

This did not help me.
I figured out that I need to change the "/A" key to a "/Dest" key with an array of values looking like this:

      "/Dest": [
        { "I": 4 }, { "N": "/XYZ" }, { "I": 0 }, { "F": 0.0 }, { "I": 0 }
      ],

Where the '4' seems to be the page.

Tried to figure out how to change keys using jq, but the leading slash gives me a jq: error (at <stdin>:44): Cannot index array with string "/A". Idk what to do or how to do it. I never worked with json before, so this is all new to me.
`

johnwhitington · 2025-02-18T10:19:27Z

I've never used jqeither - I just know it's the most popular JSON processor.

I think the first thing you should do is get it working manually. Edit the JSON in a text editor and get the process working. Then worry about jq.

Have another go, and if you get stuck you can attach to this bug an example PDF file.

Gutesork · 2025-02-19T00:45:48Z

Sorry for my frustration earlier. I got it to work by:

From --list-bookmarks-json :

{
"level": 0,
"text": "Appendix",
"page": 3,
"open": false,
"target": [
{ "I": 3 }, { "N": "/XYZ" }, { "F": 0.0 }, { "F": 836.0 }, { "F": 0.0 }
],
"colour": [ 0.0, 0.0, 0.0 ],
"italic": false,
"bold": false
}

Change annots from

[
2,
30,
{
"/Type": { "N": "/Annot" },
"/Rect": [
{ "I": 64 }, { "F": 480.3 }, { "F": 137.7 }, { "F": 577.4 }
],
"/Border": [ { "I": 0 }, { "I": 0 }, { "I": 0 } ],
"/A": {
"/S": { "N": "/URI" },
"/URI": {
"U": "http://192.168.1.190/index.php/File:Alve_Rinblad_-Norrbackaskolan-_Rektorsbrev_2025-01-24.pdf"
}
},
"/Subtype": { "N": "/Link" }
}
],

To:

[
2,
30,
{
"/Type": { "N": "/Annot" },
"/Rect": [
{ "I": 64 }, { "F": 480.3 }, { "F": 137.7 }, { "F": 577.4 }
],
"/Border": [ { "I": 0 }, { "I": 0 }, { "I": 0 } ],
"/Dest": [ { "I": 3 }, { "N": "/XYZ" }, { "F": 0.0 }, { "F": 836.0 }, { "F": 0.0 } ],
"/Subtype": { "N": "/Link" }
}
],

Gutesork · 2025-02-19T02:14:35Z

Lets say I make it so that the bookmark "text": is the same as the filenamne, do anyone have any tips on how to parse this in a bash shell script?

johnwhitington · 2025-02-19T08:58:02Z

I'm afraid I can't work out what you mean. Could you explain?

a-gss · 2025-02-19T09:30:04Z

Your problem looks it could be easily solved with a little bit of regex magic an perl commands.
Drafted a little script, but I'm really not sure about the perl part since I can't test it. Good luck!

#!/bin/bash

input="in.pdf"

# Extract bookmarks and annotations JSON
cpdf -list-bookmarks-json "$input" > bookmarks.json
cpdf -list-annotations-json "$input" > annots.json

# 1. Parse the bookmarks/page numbers into arrays
declare -a texts  # Array to store bookmark names
declare -a pages  # Array to store page numbers

while read -r line; do
    text=$(echo "$line" | perl -nle 'print $1 if /"text":\s*"([^"]+)"/')
    if [[ -n "$text" ]]; then
        texts+=("$text")  # Store bookmark text
    fi

    page=$(echo "$line" | perl -nle 'print $1 if /"page":\s*(\d+)/')
    if [[ -n "$page" ]]; then
        pages+=("$page")  # Store page number
    fi
done < bookmarks.json

for i in "${!texts[@]}"; do
    echo "${texts[$i]} -> page ${pages[$i]}"
done

# 2. Modify Annotations based on the bookmarks
perl -0777 -pe '
    my @texts = ('"${texts[@]}"');
    my @pages = ('"${pages[@]}"');

    s|("/A":\s*\{[^}]+?\"U\":\s*\"http://[^/]+/index.php/([^\"/]+))\"|
        my $page_name = $2;
        my $page_num = -1;

        for (my $i = 0; $i < scalar(@texts); $i++) {
            if ($texts[$i] eq $page_name) {
                $page_num = $pages[$i];
                last;
            }
        }

        if ($page_num != -1) {
            $1 =~ s|http://[^/]+/index.php/([^\"/]+)|#page=$page_num|g;
        }

        $1
    |ge
' annots.json > modified_annots.json

# 3. Apply the modified annotations to the PDF
cpdf -remove-annotations $input AND -set-annotations-json modified_annots.json -o out.pdf

Gutesork · 2025-02-20T22:36:27Z

Wow. Thank you for all of your time.
At first i thought "oh no.. regex. and perl? I dont know any perl". Then I found this excellent tool: https://regex101.com/r/j32mE7/1
and then I got excited. I can sprinkle some regex over a couple of sed and grep in some functions and at the end I'll get what I want.

a-gss >> thanks a lot for getting me on the right track. Although you misunderstood me. My thought was to first take the basename of the file from the http-link and search for it in bookmarks, to get the corresponding "target" value to then edit the http-link to be a "/Dest" instead.

then I'll have to use some conditions such as what file extension (Pdf), how the start of the url looks like and so on.
I played around with regex and came up with some cool stuff. I'll attatch examples.
I plan on using it with

${BASH_REMATCH[0]}: The entire string that was matched.
${BASH_REMATCH[1]}: The first parenthesized subexpression match.
${BASH_REMATCH[2]}: The second parenthesized subexpression match.

And so on…
like explained here:
https://medium.com/@linuxadminhacks/understanding-the-bash-rematch-in-bash-183a5b065081

Cheers!

bookmarks_regex.pdf
annots_regex.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature proposition: convert web links to internal page links #103

Feature proposition: convert web links to internal page links #103

Gutesork commented Feb 16, 2025

johnwhitington commented Feb 17, 2025

Gutesork commented Feb 17, 2025 •

edited

Loading

johnwhitington commented Feb 18, 2025

Gutesork commented Feb 19, 2025 •

edited

Loading

Gutesork commented Feb 19, 2025

johnwhitington commented Feb 19, 2025

a-gss commented Feb 19, 2025 •

edited

Loading

Gutesork commented Feb 20, 2025

Feature proposition: convert web links to internal page links #103

Feature proposition: convert web links to internal page links #103

Comments

Gutesork commented Feb 16, 2025

johnwhitington commented Feb 17, 2025

Gutesork commented Feb 17, 2025 • edited Loading

johnwhitington commented Feb 18, 2025

Gutesork commented Feb 19, 2025 • edited Loading

Gutesork commented Feb 19, 2025

johnwhitington commented Feb 19, 2025

a-gss commented Feb 19, 2025 • edited Loading

Gutesork commented Feb 20, 2025

Gutesork commented Feb 17, 2025 •

edited

Loading

Gutesork commented Feb 19, 2025 •

edited

Loading

a-gss commented Feb 19, 2025 •

edited

Loading