Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental: Automatically fetch WXR attachments into Pull Requests #52

Draft
wants to merge 33 commits into
base: trunk
Choose a base branch
from

Conversation

adamziel
Copy link
Contributor

@adamziel adamziel commented Jun 6, 2024

Adds an experimental workflow that, when it sees a WXR file in the pull request, it downloads all the remote images and rewrites their URL to point to the Blueprints repo.

This PR illustrates it with two WXR files, one of which references ~20 Woo product images. I committed a vanilla WXR file that referenced images from a remote server, and they all got automatically downloaded and included in the PR.

Details

In particular, this script:

  • Lists all the URLs found in the XML document
  • Rewrites the domain found in each URL while considering the context in which it was found (text nodes, cdata, block attributes, HTML attributes, HTML text)

Source code for the WXR normalizer.

Copy link

github-actions bot commented Jun 6, 2024

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@adamziel adamziel force-pushed the normalize-wxr-assets branch from c2bff48 to 0e54a98 Compare June 6, 2024 15:58
@adamziel adamziel force-pushed the normalize-wxr-assets branch from 4a43a1d to ed8dcdf Compare June 6, 2024 16:00
@adamziel adamziel force-pushed the normalize-wxr-assets branch from e1a9f20 to a0f31bc Compare June 6, 2024 16:04
@adamziel adamziel changed the title Experimental: CI workflow to grab assets from WXR files Experimental: Automatically download WXR attachments in Pull Requests Jun 6, 2024
@adamziel adamziel changed the title Experimental: Automatically download WXR attachments in Pull Requests Experimental: Automatically fetch WXR attachments into Pull Requests Jun 6, 2024
@brandonpayton
Copy link
Member

This is pretty cool.

Here are a couple of questions that came to mind while reviewing this:

  • What sorts of copyright concerns may come up with this? Are there situations where assets should be left remote?
  • If we are placing all extracted assets in a single directory, do we handle the possibility of naming collisions?

@brandonpayton
Copy link
Member

Also, TIL about curl_multi_exec. ✨
https://github.com/adamziel/wxr-normalize/blob/d9cd270d5abf0741f8773bc01010cff1d558d79e/rewrite-wxr.php#L265

It was fun to skim rewrite-wxr.php. Cool work.

adamziel added a commit to adamziel/playground-content-converters that referenced this pull request Jun 18, 2024
This is an experiment to provide a build-less Documentation Contributor
Workflow using WordPress Playground. It builds on top of the data
conversion toolkit (markdown ⇔ blocks ⇒ wxr) also shipped in this repo.

## Option 1: Run it in the browser

Click here to try it:

[<kbd> <br>Edit the Gutenberg
Handbook<br> </kbd>](https://playground.wordpress.net/?gh-ensure-auth=yes&ghexport-repo-url=https%3A%2F%2Fgithub.com%2Fadamziel%2Fplayground-docs-workflow&ghexport-content-type=custom-paths&ghexport-path=plugins/wp-docs-plugin&ghexport-path=plugins/export-static-site&ghexport-path=themes/playground-docs&ghexport-path=html-pages&ghexport-path=uploads&ghexport-commit-message=Documentation+update&ghexport-playground-root=/wordpress/wp-content&ghexport-repo-root=/wp-content&blueprint-url=https%3A%2F%2Fraw.githubusercontent.com%2Fadamziel%2Fplayground-docs-workflow%2Ftrunk%2Fblueprint-browser.json&ghexport-pr-action=create&ghexport-allow-include-zip=no)

Or watch the video:


https://github.com/WordPress/gutenberg/assets/205419/6142a675-5e4c-41e6-9a82-d4f21bcb429a

## Option 2: Run it on the server

* Install [Bun](https://bun.sh/)
* Install dependencies via `bun install`
* Start the editor using one of the following command:

```shell
# To convert .md -> Blocks in CLI and then start Playground:
$ bash src/run-markdown-editor-convert-markdown-in-cli.sh ./markdown

# To start Playground and convert .md -> Blocks using browser as the 
# JavaScript runtime:
$ bash src/run-markdown-editor-convert-markdown-in-browser.sh ./markdown
# And then go to http://127.0.0.1:9400/wp-admin/post-new.php to finish
# the conversion process.
```

## How does it work?

Here's what the button above does:

* Fetches the latest version of the Gutenberg handbook from the
[WordPress/gutenberg](https://github.com/WordPress/gutenberg/)
repository into the `wp-content/static-content` directory.
* Rewrites markdown as block markup and imports it as WordPress pages.
It uses a JavaScript markdown parser and the files are converted either
via a CLI command or as the first thing the web browser does before it
can interact with WordPress.
* Saves every edit from the block editor back into markup.
* Pre-configures the GitHub export modal for single-click Pull Request
creation.

## Follow-up work

* Support missing features
    * Exporting attachments
    * Rewrite URLs and paths
* Relative markdown paths as WordPress pages URLs and vice versa (or set
up a markdown-like permalink schema)
* Attachments URLs on export to make the resulting markdown document
reference the correct images.
* Ask the user to provide the base URL for links and attachments. We may
infer it and pre-populate the form, but we just can't quietly use those
guesses. The URLs must be explicitly provided either through a form or
through URL parameters.
* Related work:
[rewrite-wxr.php](https://github.com/adamziel/wxr-normalize/blob/trunk/rewrite-wxr.php),
WordPress/blueprints#52
    * Support renaming Markdown files in WordPress. How? Through slugs?
* Make the PHP plugins configurable for projects other than Gutenberg
* Accept information like "supported file extensions" via constants or
site options
* Support other possible directory structures, e.g. with `01-index.md`
file denoting a root instead of `README.md` as we assume now.
    * Support linking directly to editing a specific markdown page.
* Use highlighted code blocks instead of vanilla WordPress code blocks.
Preserve the programming language name (it's deleted now)
* Provide great User Experience
* Do not reformat lines that were not edited. Currently we re-serialize
blocks as markdown and sometimes format whitespaces differently which
may be confusing when reviewing the resulting PR.
    * Set up a separate domain with a dedicated UX
       * Remember GitHub credentials in the browser
* Don't display large GitHub forms, make it as easy as "I save a Page ->
a PR gets automatically created or updated for me"
* Easy integration with your repository – perhaps via a dedicated "quick
connect" tool.
    * Importing is a bit slow – let's make it snappy:
       * Cache Playground assets to cut on the download time
* Only fetch *.md files from the Handbook repo, don't download media
files.
* Stream-process each markdown file as it's downloaded instead of
downloading everything
* Switch to either [GitHub markdown
parser](https://github.com/github/cmark-gfm) (requires building it as
WASM) or a PHP markdown parser.
* Optional: Convert markdown to blocks lazily, as it's accessed. This
might not be worth the additional complexity.
* Extend to new use-cases
* End to end documentation toolkit – editing, collaborating, rendering
as HTML for the readers.
* Transplant static site rendering flow from
[playground-docs-workflow](https://github.com/adamziel/playground-docs-workflow)
        * Explore preserving custom plugins, themes, global styles.
    * Explore importing Jekyll sites and Obsidian notes.
        * Support editing front matter (via custom meta boxes?)
* Actually use front matter for rendering – how should we map these
arbitrary keys to WordPress values?
* Extend the `static file -> Playground -> static file` workflow for
other data sources
* WXR (load an entire site from a WXR file and save changes back to the
same WXR file)
        * .doc, .docx
        * Trac wiki markup
        * Playground snapshot
adamziel added a commit to adamziel/wxr-normalize that referenced this pull request Jul 15, 2024
Brings together a few explorations to stream-rewrite site URLs in a WXR file coming
from a remote server. All of that with no curl, DOMDocument, or other
PHP dependencies. It's just a few small libraries built with WordPress
core in mind:

* [AsyncHttp\Client](WordPress/blueprints#52)
* [WP_XML_Processor](WordPress/wordpress-develop#6713)
* [WP_Block_Markup_Url_Processor](https://github.com/adamziel/site-transfer-protocol)
* [WP_HTML_Tag_Processor](https://developer.wordpress.org/reference/classes/wp_html_tag_processor/)

Here's what the rewriter looks like:

```php
$wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr";
$xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT);
foreach( stream_remote_file( $wxr_url ) as $chunk ) {
    $xml_processor->stream_append_xml($chunk);
    foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) {
        $string_new_site_url           = 'https://mynew.site/';
        $parsed_new_site_url           = WP_URL::parse( $string_new_site_url );

        $current_site_url              = 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/';
        $parsed_current_site_url       = WP_URL::parse( $current_site_url );

        $base_url = 'https://playground.internal';
        $url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url );

        foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) {
            $updated_raw_url = rewrite_url(
                $url_processor->get_raw_url(),
                $parsed_matched_url,
                $parsed_current_site_url,
                $parsed_new_site_url
            );
            $url_processor->set_raw_url( $updated_raw_url );
        }

        $updated_text = $url_processor->get_updated_html();
        if ($updated_text !== $text) {
            $xml_processor->set_modifiable_text($updated_text);
        }
    }
    echo $xml_processor->get_processed_xml();
}
echo $xml_processor->get_unprocessed_xml();
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants