HTML importer: Build a Browser extension. Don't process HTML with PHP. #68

adamziel · 2024-04-25T13:19:34Z

adamziel
Apr 25, 2024

tl;dr: HTML markup doesn't carry enough information to reasonably recreate the original site. Let's explore a browser extension to extract information from a rendered website instead.

The same HTML markup may represent every existing block and more, depending on the CSS and JS around it. You just won’t know without rendering it first. A live page can tell you what content is actually visible? How large is it? How large compared to the rest of the website? What’s the background color? Are there two columns side by side? Do they stack when I resize the viewport? How closely does it match a visual pattern like “Header, paragraph, author name”? etc.

Imagine a tool that recognizes visual patterns like a person would: this is a headline, that is author info, this looks like two columns etc. Even a very limited tool would go a long way. It could recover all the text from your site, but also a lot of information about blocks and layout, and most of the structure / hierarchies this way.

Visual web scrapers have been around for a while. HTML to Framer solves a similar problem. This approach works, and I've seem the markup-based approach fail more than once.

Webhosts change the markup faster than importers developers can keep up. Without a visual tool, we'll need a large number of webhost-specific importers that be a maintenance burden.

What about hidden content that can be read from the markup? E.g. navigation menus? You can't get that from an image.

From an image you can't, but I'm talking about using live DOM. DOM gives you a lot of information – you can reason about what’s hidden, that reveals more data when you click it etc. For something like an API-based React site, you won’t get the menus from either the markup or the image, but a browser extension can poke around the page, maybe even click something for you, and get the information.

Users would need to be heavily engaged, committed and never close their browser windows

For most sites with 10-20 pages the process should take under a minute. Open all the pages, transform the content, save a zip file or open it in Playground, done.

The importer could be resumable. It would stream the progress to a local file and just auto-resume the next time it’s opened. If there's a fresh remote site on the other end of the process, it could clearly communicate “I’m disconnected, I’ll continue syncing when you’re online”.

cc @jordesign @dmsnell @StevenDufresne @tellyworth

StevenDufresne · 2024-04-26T01:20:41Z

StevenDufresne
Apr 26, 2024
Maintainer

Gutenberg blocks (and theming) are fairly opinionated on how CSS is applied which leads me to the following questions:

Can you go into what you envision for content transformation?
How do we go from a website with non-gutenberg markup / styling / JS to a template that isn't broken when the editor loads the template/page?
- How much does the origin need to resemble the destination?

6 replies

adamziel Apr 30, 2024
Author

How do we go from a website with non-gutenberg markup / styling / JS to a template that isn't broken when the editor loads the template/page?

High level, we'd have a function that accepts DOM and outputs a list of blocks. It never outputs anything other than blocks and it uses a mixture of techniques to map the DOM into blocks.

About these approaches, there's a lot we can do. Off the top of my head:

One would be matching visual patterns – once we think in terms of squares, circles, and rectangles, which visual pattern or which WordPress block comes the closest?

Access to DOM also gives us information about the content hierarchy – is this <h2> a child of something else? Are these two blocks of content separate entities?

There's a whole lot we can do here, e.g. change the viewport size, render the print preview and see what changes there, etc.

Another would be semantic analysis – we could run simple inference, or even a bunch of if/else-s, and figure out is this an author name, a title, a content etc.

Then there are content extraction algorithms (think Instapaper) to extract just the content and get rid of footers, navigations etc.

adamziel Apr 30, 2024
Author

I know @dmsnell also had a few ideas about mapping HTML to blocks.

dmsnell May 1, 2024

Nothing immediate, but @akirk and I had discussed ways of inferring block transforms on the server through automattic mapping of inputs to outputs through save().

akirk May 1, 2024
Maintainer

If we are running inside the playground, could we build a PHP extension that invokes Gutenberg code in JS and returns the result back to PHP? I once created https://github.com/pento/free-as-in-speech/blob/main/packages/gutenberg-for-node/src/index.js to run Gutenberg in a node environment, maybe it still works?

dmsnell May 1, 2024

this sounds like a great idea, because we could then directly compare output HTML from JS and PHP

jordesign · 2024-06-28T03:58:25Z

jordesign
Jun 28, 2024
Maintainer

Noting this interesting experiment originally shared in #data-liberation Slack channel related to HTML importing

https://github.com/StevenDufresne/patterns-everywhere

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML importer: Build a Browser extension. Don't process HTML with PHP. #68

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

HTML importer: Build a Browser extension. Don't process HTML with PHP. #68

adamziel Apr 25, 2024

What about hidden content that can be read from the markup? E.g. navigation menus? You can't get that from an image.

Users would need to be heavily engaged, committed and never close their browser windows

Replies: 2 comments · 6 replies

StevenDufresne Apr 26, 2024 Maintainer

adamziel Apr 30, 2024 Author

adamziel Apr 30, 2024 Author

dmsnell May 1, 2024

akirk May 1, 2024 Maintainer

dmsnell May 1, 2024

jordesign Jun 28, 2024 Maintainer

adamziel
Apr 25, 2024

Replies: 2 comments 6 replies

StevenDufresne
Apr 26, 2024
Maintainer

adamziel Apr 30, 2024
Author

adamziel Apr 30, 2024
Author

akirk May 1, 2024
Maintainer

jordesign
Jun 28, 2024
Maintainer