-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature proposition: convert web links to internal page links #103
Comments
Links are a kind of annotation. So, you can run
Now you can edit the JSON annotations, replacing the link annotations with destination annotations pointing to page numbers, and run:
Now, this will require some reading of the PDF standard to learn about annotations, but it shouldn't be too bad. (Cpdf will automatically rewrite the page numbers into page object numbers when the annotations are read back in - in fact, this is not documented in the manual and should be. I'll leave this bug open to note that.) |
This did not help me.
Where the '4' seems to be the page. Tried to figure out how to change keys using jq, but the leading slash gives me a |
I've never used I think the first thing you should do is get it working manually. Edit the JSON in a text editor and get the process working. Then worry about Have another go, and if you get stuck you can attach to this bug an example PDF file. |
Sorry for my frustration earlier. I got it to work by: From --list-bookmarks-json : { Change annots from [ To: [ |
Lets say I make it so that the bookmark "text": is the same as the filenamne, do anyone have any tips on how to parse this in a bash shell script? |
I'm afraid I can't work out what you mean. Could you explain? |
Your problem looks it could be easily solved with a little bit of regex magic an perl commands. #!/bin/bash
input="in.pdf"
# Extract bookmarks and annotations JSON
cpdf -list-bookmarks-json "$input" > bookmarks.json
cpdf -list-annotations-json "$input" > annots.json
# 1. Parse the bookmarks/page numbers into arrays
declare -a texts # Array to store bookmark names
declare -a pages # Array to store page numbers
while read -r line; do
text=$(echo "$line" | perl -nle 'print $1 if /"text":\s*"([^"]+)"/')
if [[ -n "$text" ]]; then
texts+=("$text") # Store bookmark text
fi
page=$(echo "$line" | perl -nle 'print $1 if /"page":\s*(\d+)/')
if [[ -n "$page" ]]; then
pages+=("$page") # Store page number
fi
done < bookmarks.json
for i in "${!texts[@]}"; do
echo "${texts[$i]} -> page ${pages[$i]}"
done
# 2. Modify Annotations based on the bookmarks
perl -0777 -pe '
my @texts = ('"${texts[@]}"');
my @pages = ('"${pages[@]}"');
s|("/A":\s*\{[^}]+?\"U\":\s*\"http://[^/]+/index.php/([^\"/]+))\"|
my $page_name = $2;
my $page_num = -1;
for (my $i = 0; $i < scalar(@texts); $i++) {
if ($texts[$i] eq $page_name) {
$page_num = $pages[$i];
last;
}
}
if ($page_num != -1) {
$1 =~ s|http://[^/]+/index.php/([^\"/]+)|#page=$page_num|g;
}
$1
|ge
' annots.json > modified_annots.json
# 3. Apply the modified annotations to the PDF
cpdf -remove-annotations $input AND -set-annotations-json modified_annots.json -o out.pdf |
Wow. Thank you for all of your time. a-gss >> thanks a lot for getting me on the right track. Although you misunderstood me. My thought was to first take the basename of the file from the http-link and search for it in bookmarks, to get the corresponding "target" value to then edit the http-link to be a "/Dest" instead. then I'll have to use some conditions such as what file extension (Pdf), how the start of the url looks like and so on.
And so on… Cheers! |
Hello.
How about a command like
cpdf -convert-url-to-internal "http://site.org//index.php/Page_name" "Page or destination" in.pdf -o out.pdf
?User case:
one might make a pdf from a bunch of webpages saved as pdf, or in my case:
I have a private mediawiki that can only be accessed from local network. I can output pages as pdfs, but the links directs to
http://ip/index.php/Page_name
. The pdf original has a Table Of Contents and bookmarks, which gets the format"<chapter#> Page name"
I have searched the web for an easy scriptable solution to convert links to internal page links, but every solution points to a graphical program.I could make a script that will get the URL, and a way to get the page number in the pdf that this corresponds to, but no way to convert the links.
The text was updated successfully, but these errors were encountered: