Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposition: convert web links to internal page links #103

Open
Gutesork opened this issue Feb 16, 2025 · 8 comments
Open

Feature proposition: convert web links to internal page links #103

Gutesork opened this issue Feb 16, 2025 · 8 comments

Comments

@Gutesork
Copy link

Hello.
How about a command like cpdf -convert-url-to-internal "http://site.org//index.php/Page_name" "Page or destination" in.pdf -o out.pdf?

User case:
one might make a pdf from a bunch of webpages saved as pdf, or in my case:
I have a private mediawiki that can only be accessed from local network. I can output pages as pdfs, but the links directs to http://ip/index.php/Page_name. The pdf original has a Table Of Contents and bookmarks, which gets the format "<chapter#> Page name" I have searched the web for an easy scriptable solution to convert links to internal page links, but every solution points to a graphical program.
I could make a script that will get the URL, and a way to get the page number in the pdf that this corresponds to, but no way to convert the links.

@johnwhitington
Copy link
Contributor

Links are a kind of annotation. So, you can run

cpdf -list-annotations-json in.pdf > annots.json

Now you can edit the JSON annotations, replacing the link annotations with destination annotations pointing to page numbers, and run:

cpdf -remove-annotations in.pdf AND -set-annotations-json annots.json -o out.pdf

Now, this will require some reading of the PDF standard to learn about annotations, but it shouldn't be too bad. (Cpdf will automatically rewrite the page numbers into page object numbers when the annotations are read back in - in fact, this is not documented in the manual and should be. I'll leave this bug open to note that.)

@Gutesork
Copy link
Author

Gutesork commented Feb 17, 2025

This did not help me.
I figured out that I need to change the "/A" key to a "/Dest" key with an array of values looking like this:

      "/Dest": [
        { "I": 4 }, { "N": "/XYZ" }, { "I": 0 }, { "F": 0.0 }, { "I": 0 }
      ],

Where the '4' seems to be the page.

Tried to figure out how to change keys using jq, but the leading slash gives me a jq: error (at <stdin>:44): Cannot index array with string "/A". Idk what to do or how to do it. I never worked with json before, so this is all new to me.
`

@johnwhitington
Copy link
Contributor

I've never used jqeither - I just know it's the most popular JSON processor.

I think the first thing you should do is get it working manually. Edit the JSON in a text editor and get the process working. Then worry about jq.

Have another go, and if you get stuck you can attach to this bug an example PDF file.

@Gutesork
Copy link
Author

Gutesork commented Feb 19, 2025

Sorry for my frustration earlier. I got it to work by:

From --list-bookmarks-json :

{
"level": 0,
"text": "Appendix",
"page": 3,
"open": false,
"target": [
{ "I": 3 }, { "N": "/XYZ" }, { "F": 0.0 }, { "F": 836.0 }, { "F": 0.0 }
],
"colour": [ 0.0, 0.0, 0.0 ],
"italic": false,
"bold": false
}

Change annots from

[
2,
30,
{
"/Type": { "N": "/Annot" },
"/Rect": [
{ "I": 64 }, { "F": 480.3 }, { "F": 137.7 }, { "F": 577.4 }
],
"/Border": [ { "I": 0 }, { "I": 0 }, { "I": 0 } ],
"/A": {
"/S": { "N": "/URI" },
"/URI": {
"U": "http://192.168.1.190/index.php/File:Alve_Rinblad_-Norrbackaskolan-_Rektorsbrev_2025-01-24.pdf"
}
},
"/Subtype": { "N": "/Link" }
}
],

To:

[
2,
30,
{
"/Type": { "N": "/Annot" },
"/Rect": [
{ "I": 64 }, { "F": 480.3 }, { "F": 137.7 }, { "F": 577.4 }
],
"/Border": [ { "I": 0 }, { "I": 0 }, { "I": 0 } ],
"/Dest": [ { "I": 3 }, { "N": "/XYZ" }, { "F": 0.0 }, { "F": 836.0 }, { "F": 0.0 } ],
"/Subtype": { "N": "/Link" }
}
],

@Gutesork
Copy link
Author

Lets say I make it so that the bookmark "text": is the same as the filenamne, do anyone have any tips on how to parse this in a bash shell script?

@johnwhitington
Copy link
Contributor

I'm afraid I can't work out what you mean. Could you explain?

@a-gss
Copy link

a-gss commented Feb 19, 2025

Your problem looks it could be easily solved with a little bit of regex magic an perl commands.
Drafted a little script, but I'm really not sure about the perl part since I can't test it. Good luck!

#!/bin/bash

input="in.pdf"

# Extract bookmarks and annotations JSON
cpdf -list-bookmarks-json "$input" > bookmarks.json
cpdf -list-annotations-json "$input" > annots.json

# 1. Parse the bookmarks/page numbers into arrays
declare -a texts  # Array to store bookmark names
declare -a pages  # Array to store page numbers

while read -r line; do
    text=$(echo "$line" | perl -nle 'print $1 if /"text":\s*"([^"]+)"/')
    if [[ -n "$text" ]]; then
        texts+=("$text")  # Store bookmark text
    fi

    page=$(echo "$line" | perl -nle 'print $1 if /"page":\s*(\d+)/')
    if [[ -n "$page" ]]; then
        pages+=("$page")  # Store page number
    fi
done < bookmarks.json

for i in "${!texts[@]}"; do
    echo "${texts[$i]} -> page ${pages[$i]}"
done

# 2. Modify Annotations based on the bookmarks
perl -0777 -pe '
    my @texts = ('"${texts[@]}"');
    my @pages = ('"${pages[@]}"');

    s|("/A":\s*\{[^}]+?\"U\":\s*\"http://[^/]+/index.php/([^\"/]+))\"|
        my $page_name = $2;
        my $page_num = -1;

        for (my $i = 0; $i < scalar(@texts); $i++) {
            if ($texts[$i] eq $page_name) {
                $page_num = $pages[$i];
                last;
            }
        }

        if ($page_num != -1) {
            $1 =~ s|http://[^/]+/index.php/([^\"/]+)|#page=$page_num|g;
        }

        $1
    |ge
' annots.json > modified_annots.json

# 3. Apply the modified annotations to the PDF
cpdf -remove-annotations $input AND -set-annotations-json modified_annots.json -o out.pdf

@Gutesork
Copy link
Author

Wow. Thank you for all of your time.
At first i thought "oh no.. regex. and perl? I dont know any perl". Then I found this excellent tool: https://regex101.com/r/j32mE7/1
and then I got excited. I can sprinkle some regex over a couple of sed and grep in some functions and at the end I'll get what I want.

a-gss >> thanks a lot for getting me on the right track. Although you misunderstood me. My thought was to first take the basename of the file from the http-link and search for it in bookmarks, to get the corresponding "target" value to then edit the http-link to be a "/Dest" instead.

then I'll have to use some conditions such as what file extension (Pdf), how the start of the url looks like and so on.
I played around with regex and came up with some cool stuff. I'll attatch examples.
I plan on using it with

${BASH_REMATCH[0]}: The entire string that was matched.
${BASH_REMATCH[1]}: The first parenthesized subexpression match.
${BASH_REMATCH[2]}: The second parenthesized subexpression match.

And so on…
like explained here:
https://medium.com/@linuxadminhacks/understanding-the-bash-rematch-in-bash-183a5b065081

Cheers!

bookmarks_regex.pdf
annots_regex.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants