Replies: 2 comments 6 replies
-
Gutenberg blocks (and theming) are fairly opinionated on how CSS is applied which leads me to the following questions:
|
Beta Was this translation helpful? Give feedback.
6 replies
-
Noting this interesting experiment originally shared in #data-liberation Slack channel related to HTML importing |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
tl;dr: HTML markup doesn't carry enough information to reasonably recreate the original site. Let's explore a browser extension to extract information from a rendered website instead.
The same HTML markup may represent every existing block and more, depending on the CSS and JS around it. You just won’t know without rendering it first. A live page can tell you what content is actually visible? How large is it? How large compared to the rest of the website? What’s the background color? Are there two columns side by side? Do they stack when I resize the viewport? How closely does it match a visual pattern like “Header, paragraph, author name”? etc.
Imagine a tool that recognizes visual patterns like a person would: this is a headline, that is author info, this looks like two columns etc. Even a very limited tool would go a long way. It could recover all the text from your site, but also a lot of information about blocks and layout, and most of the structure / hierarchies this way.
Visual web scrapers have been around for a while. HTML to Framer solves a similar problem. This approach works, and I've seem the markup-based approach fail more than once.
Webhosts change the markup faster than importers developers can keep up. Without a visual tool, we'll need a large number of webhost-specific importers that be a maintenance burden.
What about hidden content that can be read from the markup? E.g. navigation menus? You can't get that from an image.
From an image you can't, but I'm talking about using live DOM. DOM gives you a lot of information – you can reason about what’s hidden, that reveals more data when you click it etc. For something like an API-based React site, you won’t get the menus from either the markup or the image, but a browser extension can poke around the page, maybe even click something for you, and get the information.
Users would need to be heavily engaged, committed and never close their browser windows
For most sites with 10-20 pages the process should take under a minute. Open all the pages, transform the content, save a zip file or open it in Playground, done.
The importer could be resumable. It would stream the progress to a local file and just auto-resume the next time it’s opened. If there's a fresh remote site on the other end of the process, it could clearly communicate “I’m disconnected, I’ll continue syncing when you’re online”.
cc @jordesign @dmsnell @StevenDufresne @tellyworth
Beta Was this translation helpful? Give feedback.
All reactions