-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Liberation] WP_WXR_Reader #1972
Conversation
Drafts a `WP_WXR_Processor` class that can extract structured information from XML streams. This is an early version. The goal is to make it streamable and resumable.
@brandonpayton the API emerging from this work surprises me – the WXR file is a data source that emits data objects like a "site title", "site URL", "a post", or "a post comment". Blueprint handlers are functions that accept data objects and imprint them in a WordPress instance. I wonder what other connections might we draw here. |
Oooh, what if we treated all these data sources as streams of objects?
In such a scenario the |
@adamziel, these are interesting ideas! Will sleep on them. This might help connect ideas you've shared elsewhere when we've discussed a sort of WordPress concept language. It's not my desire to reinvent the wheel with yet another language or format, but it seemed like the problem space around site recipes might call for it. It depends on whether we want a human-writable thing or just something that can relay various entities. |
I'm not sure I understand one part. All of these data sources can have similar data (content) and they need to go through a similar processor to update the content and load assets. But what would be the expected destination for this data? Would it all end up as posts? |
* | ||
* @param int $offset | ||
* @return int | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WXR supports site options, posts, comments, users, metadata, and a few more data types – so that's what it would end up as. Raw markdown might be just posts, but we could support post meta and site options via frontmatter. WordPress -> WordPress would support every possible data type. |
I'll go ahead and merge to keep moving forward. The code isn't used anywhere yet. |
/** | ||
* UTF-8 decoding pipeline by Dennis Snell (@dmsnell), originally | ||
* proposed in https://github.com/WordPress/wordpress-develop/pull/6883. | ||
* | ||
* It enables parsing XML documents with incomplete UTF-8 byte sequences | ||
* without crashing or depending on the mbstring extension. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
return 0; | ||
} | ||
$name_byte_length = 0; | ||
while(true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamziel @dmsnell @sirreal
As I'm looking into a similar problem right now, I thought I'd dump here an idea that I used in my case.
In the loop, we could 1) check for a sequence of ASCII < 128 characters, 2) check if the next character can be multibyte, and 3) only then call utf8_codepoint_at
. In my scenario, it gives some performance gains. If we expect most attribute names to be ASCII < 128, then this could bring significant performance improvements.
(This is a simplification as I didn't consider the $test_as_first_character
handling.)
while (true) {
// First, let's try to parse an ASCII sequence.
$name_byte_length += strspn(
$this->xml,
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_',
$offset + $name_byte_length
);
// Check if the following byte can be part of a multibyte character.
$byte = $this->xml[ $offset + $name_byte_length ] ?? null;
if ( null === $byte || ord( $byte ) < 128 ) {
break;
}
// Check the \x{0080}-\x{ffff} Unicode character range.
$codepoint = utf8_codepoint_at( $this->xml, $offset + $name_byte_length, $bytes_parsed );
if (
null === $codepoint ||
! $this->is_valid_name_codepoint( $codepoint, $name_byte_length === 0 )
) {
break;
}
$name_byte_length += $bytes_parsed;
}
Description
This PR introduces the
WP_WXR_Reader
class for parsing WordPress eXtended RSS (WXR) files, along with supporting improvements to the XML processing infrastructure.Note:
WP_WXR_Reader
is just a reader. It won't actually import the data into WordPress – that part is coming soon.A part of #1894
Motivation
There is no WordPress importer that would check all these boxes:
WP_WXR_Reader
is a step in that direction. It cannot pause and resume yet, but the next few PRs will add that feature.Implementation
WP_WXR_Reader
uses theWP_XML_Processor
to find XML tags representing meaningful WordPress entities. The reader knows the WXR schema and only looks for relevant elements. For example, it knows that posts are stored inrss > channel > item
and comments are stored inrss > channel > item >
wp:comment`.The
$wxr->next_entity()
method stream-parses the next entity from the WXR document and exposes it to the API consumer via$wxr->get_entity_type()
and$wxr->get_entity_date()
. The next call to$wxr->next_entity()
remembers where the parsing has stopped and parses the next entity after that point.Similarly to
WP_XML_Processor
, theWP_WXR_Reader
enters a paused state when it doesn't have enough XML bytes to parse the entire entity.The next_entity() -> fread -> break usage pattern may seem a bit tedious. This is expected. Even if the WXR parsing part of the
WP_WXR_Reader
offers a high-level API, working with byte streams requires reasoning on a much lower level. TheStreamChain
class shipped in this repository will make the API consumption easier with its transformation–oriented API for chaining data processors.Supported WordPress entities
<item>
tags<wp:comment>
tags<wp:commentmeta>
tags<wp:author>
tags<wp:postmeta>
tags<wp:term>
tags<wp:tag>
tags<wp:category>
tagsCaveats
Extensibility
WP_WXR_Reader
ignores any XML elements it doesn't recognize. The WXR format is extensible so in the future the reader may start supporting registration of custom handlers for unknown tags in the future.Nested entities intertwined with data
WP_WXR_Reader
flushes the current entity whenever another entity starts. The upside is simplicity and a tiny memory footprint. The downside is that it's possible to craft a WXR document where some information would be lost. For example:WP_WXR_Reader
would accumulate post data until thewp:post_meta
tag. Then it would emit apost
entity and accumulate the meta information until the</wp:postmeta>
closer. Then it would advance to<wp:post_id>
and ignore it.This is not a problem in all the
.wxr
files I saw. Still, it is important to note this limitation. It is possible there is a.wxr
generator somewhere out there that intertwines post fields with post meta and comments. If this ever comes up, we could:post
entity first, then all the nested entities, and then emit a specialpost_update
entity.Buffering all the post meta and comments seems like a bad idea – there might be gigabytes of data.
Future Plans
The next phase will add pause/resume functionality to handle timeout scenarios:
n
entities to speed it up. Then also save then
for a quick rewind after resuming.Testing Instructions
Read the tests and ponder whether they make sense. Confirm the PHPUnit test suite passed on CI. The test suite includes coverage for various WXR formats and streaming behaviors.