Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Menu links of the downloaded site point to the on-line pages #418

Open
j-balint opened this issue Nov 11, 2024 · 2 comments
Open

Menu links of the downloaded site point to the on-line pages #418

j-balint opened this issue Nov 11, 2024 · 2 comments

Comments

@j-balint
Copy link

j-balint commented Nov 11, 2024

Monolith would be an ideal tool for me to download a complete website. It downloads my Wordpress based website quickly, the home page works perfectly, but unfortunately all the menu links point to the on-line pages. I could not get it to work fully off-line.
So I used it on Manjaro Linux/KDE:
monolith https://site-URL/ -b /home/balint/Desktop/B4X/B4X.html -o /home/balint/Desktop/B4X/B4X.html
Is this really not possible or did I parameterize it wrong?
[email protected]

@RaphGL
Copy link

RaphGL commented Nov 23, 2024

Not the developer. I came here to create this same issue.

I've glanced quickly at the source code and looked at the flags and there doesn't seem to be any functionality for this.
The program simply walks through the page and creates and embeds the resources it finds in the page to output a single document.

You can see here that it simply copies the anchor tag:

monolith/src/html.rs

Lines 1014 to 1036 in 2a8d5d7

"a" | "area" => {
if let Some(anchor_attr_href_value) = get_node_attr(node, "href") {
if anchor_attr_href_value
.clone()
.trim()
.starts_with("javascript:")
{
if options.no_js {
// Replace with empty JS call to preserve original behavior
set_node_attr(node, "href", Some("javascript:;".to_string()));
}
} else {
// Don't touch mailto: links or hrefs which begin with a hash sign
if !anchor_attr_href_value.clone().starts_with('#')
&& !is_url_and_has_protocol(&anchor_attr_href_value.clone())
{
let href_full_url: Url =
resolve_url(document_url, &anchor_attr_href_value);
set_node_attr(node, "href", Some(href_full_url.to_string()));
}
}
}
}

If the program recursively walked and built a local document tree it would greatly increase how useful it is imo

@snshn
Copy link
Member

snshn commented Dec 2, 2024

Hi @j-balint,

The -b option there is meant to be mostly for https:// URLs, e.g. to pull more resources from the internet, if converting a conventionally saved file+folder HTML page locally. I think it could in theory work for file:// links. What if you try monolith https://site-url/ -b file:///home/balint/Desktop/B4X/ -o /home/balint/Desktop/B4X/B4X.html?

And also hello Ralph, you are absolutely correct, monolith is extremely dumb, but let's look on the bright side, at least it won't take over the world and travel through time to try and klll its creator, right?
Archiving child pages is something that's been requested since day one, and there're programs to do that already, but I can see how it could be handy when every separate linked page is in its own .html file, even if it means certain overhead. One of the problems would be sharing one of those documents, since they will try to link locally, instead of externally.
There is work being done on making it possible to utilize monolith as a crate (library), rather than a stand-alone CLI, to make it possible to create scrapers and follow links, hence powering browser extensions and server-side software, along with promoting creation of monolith-based scrapers capable of archiving whole websites. I hope it sees the light of day soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants