Let's use the data processing libraries maintained in this repository to improve the official WordPress WXR importer.
Goals
- Short-term: Improve the WXR importing experience in a succession of small, frequent, meaningful changes to the existing wordpress-importer plugin.
- Long-term: Ship a selective, fast, live WordPress <-> WordPress sync.
- Anti-goal: Get stuck for two months on rewriting the entire existing importer plugin from scratch before shipping the first user-facing change.
The current state of the WordPress importer
I've mapped out some features and shortcomings of the existing WordPress importer. The proposed roadmap below is based on this research:
Draft roadmap – let's discuss
1. Testability
The wordpress-importer implementation is a mixture of wp-admin UI rendering, input processing, and importing logic.
The https://github.com/WordPress/wordpress-importer repository has some unit tests. Let's investigate how can we leverage and expand that test harness to make changes without breaking existing use-cases.
2. Compatibility with all hosts
The wordpress-importer plugin relies on libxml and won't work on some hosts. That's quite restrictive – let's make it compatible with all the hosts!
3. URL rewriting
wordpress-importer does not rewrite absolute URLs in the imported posts and comments. As a result, the imported data often contains broken links to a source site.
4. Add a streaming data flow in the wordpress-importer plugin
Existing filters, such as wp_import_categories, assume the entire import context is stored in memory. This isn't a viable approach for processing larger datasets. To support them, we need to break BC on the existing filters without breaking the extenders of those filters.
One way of doing that would be forking the importer plugin and removing those filters. But forking is challenging – the usage drops, changes needs to be backported, codebase becomes fragmented. There's an easier way.
Let's create a second, experimental streaming data flow in the importer plugin. By default, WXR files would still be processed by the existing machinery. When a flag or a checkbox is checked, we'd switch to a streaming processor that imports one chunk at a time and can recover from timeouts and OOM errors.
5. Naive large file support
wordpress-importer is unable to process large files due to two constraints: PHP request timeout and the memory limit. Let's break out of those by supporting a re-entrant, multi-request importing flow. First, we wouldn't store everything in memory. Second, we'd know how to pause the import process and resume it later.
Cases explicitly not covered at this stage:
6. Fast, concurrent assets download
The wordpress-importer plugin is downloading all the remote assets one-by-one. It's slow! Let's parallelize those downloads and fetch, say, up to 10 files concurrently at any given time.
7. Future
The above points will take some time to implement already. Here's some items that would be good to look into afterwards:
- Disable filters for the duration of the import to prevent, e.g., sending emails
- More data formats, e.g. Markdown, HTML
- More data sources, e.g. WXR URL, Git repo, another WordPress site, an arbitrary URL
- UI improvements, e.g. dropzone, progress bar, detailed import log and statistics
- Error recovery, e.g. "5 media files couldn't be fetched, do you want to retry? ignore them? upload alternative files?
- Real-time importing from another source, e.g. publishing a post on site A auto-publishes it on site B
Prior art
Here's the landscape of WordPress WXR importers out in the wild.
WXR importers:
Other WordPress migration products (selected few):
Other, related discussions:
Let's use the data processing libraries maintained in this repository to improve the official WordPress WXR importer.
Goals
The current state of the WordPress importer
I've mapped out some features and shortcomings of the existing WordPress importer. The proposed roadmap below is based on this research:
Draft roadmap – let's discuss
1. Testability
The
wordpress-importerimplementation is a mixture of wp-admin UI rendering, input processing, and importing logic.The https://github.com/WordPress/wordpress-importer repository has some unit tests. Let's investigate how can we leverage and expand that test harness to make changes without breaking existing use-cases.
2. Compatibility with all hosts
The
wordpress-importerplugin relies on libxml and won't work on some hosts. That's quite restrictive – let's make it compatible with all the hosts!3. URL rewriting
wordpress-importerdoes not rewrite absolute URLs in the imported posts and comments. As a result, the imported data often contains broken links to a source site.4. Add a streaming data flow in the
wordpress-importerpluginExisting filters, such as
wp_import_categories, assume the entire import context is stored in memory. This isn't a viable approach for processing larger datasets. To support them, we need to break BC on the existing filters without breaking the extenders of those filters.One way of doing that would be forking the importer plugin and removing those filters. But forking is challenging – the usage drops, changes needs to be backported, codebase becomes fragmented. There's an easier way.
Let's create a second, experimental streaming data flow in the importer plugin. By default, WXR files would still be processed by the existing machinery. When a flag or a checkbox is checked, we'd switch to a streaming processor that imports one chunk at a time and can recover from timeouts and OOM errors.
5. Naive large file support
wordpress-importeris unable to process large files due to two constraints: PHP request timeout and the memory limit. Let's break out of those by supporting a re-entrant, multi-request importing flow. First, we wouldn't store everything in memory. Second, we'd know how to pause the import process and resume it later.Cases explicitly not covered at this stage:
6. Fast, concurrent assets download
The
wordpress-importerplugin is downloading all the remote assets one-by-one. It's slow! Let's parallelize those downloads and fetch, say, up to 10 files concurrently at any given time.7. Future
The above points will take some time to implement already. Here's some items that would be good to look into afterwards:
Prior art
Here's the landscape of WordPress WXR importers out in the wild.
WXR importers:
wordpress-importerplugin$_POSTetc. superglobals and not showing any rendered HTML in the CLI output.Other WordPress migration products (selected few):
Other, related discussions: