Currently, there are two separate wikitext parsers in use within MediaWiki.
One is the
original core parser
(legacy parser), and the other is
Parsoid
As of early 2023, the core parser is used for all desktop and mobile web read views, while Parsoid is used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), some gadgets, mobile apps, Kiwix offline reader, Wikimedia Enterprise, and the Google knowledge graph project.
The goal of this project is to arrive at a
single parser that supports all clients and use cases
This will make MediaWiki more reliable and consistent for editors, readers, and tools to use. Having a single code base for wikitext processing will also facilitate future wikitext features.
This project is primarily driven by the
Content Transform Team
(previously
Parsing
Team) with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients,
Movement Communications
team (previously Community Relations Specialists), Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.
This page contains an overview of the project; there is also a
roadmap, milestones, and updates
page, and a list of pages with additional
technical information
Project Goals
Longer Term Goal
: Parsoid is the default wikitext engine for MediaWiki and the legacy parser is removed from the codebase
Intermediate Goal
: Parsoid replaces the core parser for all wikitext use cases on the Wikimedia cluster.
Why
Why unification?
: Maintaining two wikitext engines requires a lot of resources and would require a duplication of efforts for new features.
Why Parsoid?
: Parsoid meets all the editing use cases, API use cases that are unique to Parsoid (ex: Enterprise, Kiwix), and active work is in progress for it to meet all the read use cases.
The legacy parser does not support HTML-based editing use cases (ex: VisualEditor).
How is this change being tested?
Parser tests:
This is how Parsoid has been developed since its inception. Developers ensure that Parsoid continues to pass parser tests, and where divergence is known, it is recorded after careful review. Parser test coverage has also been vastly expanded over the years, and all patches against Parsoid need to pass tests.
Round-tripping / Integration tests:
In this mode, before every production deployment, wikitext is converted to HTML and HTML back to wikitext for about 180K pages from about 50 production wikis. While this testing mode is primarily to ensure that HTML -> wikitext conversion is not broken (which would impact editing client tools), this also implicitly serves to flag any breakages in the HTML output. But, these aren't the most reliable tests for verifying that the HTML output is not broken.
Visual diff tests:
Here, screenshots are taken of renderings of legacy parser and Parsoid HTML output and compared, generating a numeric diff score. A typical run will involve
25k+ pages from about 20 production wikis
. This has been a really reliable way to identify various breakages and bugs in Parsoid output. As Parsoid gets further deployed, testing gets expanded to a wider range of wikis, which also improved the tool's ability to distinguish real issues from false positives.
Parsoid reading and editing clients:
Parsoid's output has been used over the years by VisualEditor, Android and iOS mobile apps, Kiwix, and other clients. A number of bugs and incompatibilities have been fixed in Parsoid over the years, and various long-tail edge cases continue to be fixed as they are discovered and reported.
As the rollout progresses, other QA and testing methodologies may be added, to ensure that the change will occur as smoothly and non-disruptively as possible.
What is the deployment plan / strategy?
The following are the steps already taken, or planned, for parser unification.
✅ Deploy changes to core that makes media structure HTML largely identical to what Parsoid emits. This has
its own deployment plan
. This change has been live on mediawiki.org and officewiki since September 2021, and was rolled out to all wikis in 2022.
✅ Deploy individual user opt-in tools to use Parsoid for read views as part of the ParserMigration extension.
✅ Deploy changes to Wikimedia production that lets DiscussionTools use Parsoid HTML directly.
Turn on Parsoid HTML read views on additional wikis incrementally
✅ officewiki
✅ Talk pages on wikitech
✅ wikivoyage (except zhwikivoyage)
✅ Incubator and Dagbani Wikipedia
✅ Most Wiktionaries
(in progress) remaining Wiktionaries (except those using LanguageConverter)
(next) low-traffic Wikipedias
...
Continued work to ensure Parsoid is able to generate identical
metadata
that the legacy parser generates (categories, backlink tables, page properties, etc). This is needed for tighter integration of Parsoid into MediaWiki core and to start replacing the legacy parser in additional wikitext use cases.
Use of Parsoid to generate user interface messages
Shipping a long-term support release (planned to be 1.47) with Parsoid as the default wikitext parser out of the box.
Confidence Framework
To validate the road-map evolution and use data-driven decision making for deployments, a
Confidence Framework
for Parsoid Read Views was developed.
This framework contains the guidelines for how features, bug fixes, and deployments are prioritized.
How does this impact wikis?
For the most part, the switch to Parsoid generated HTML should be transparent to most users.
Below, however, are some possible impacts on readers, editors, and developers.
Readers
Parsoid models and processes wikitext differently compared to the legacy parser and this can sometimes
lead to differences in rendering
in some edge case scenarios.
If some wikitext pattern is commonly used, developers have attempted to support that in Parsoid where possible, and where not, by either fixing or providing support to fix them up.
At this time, the only remaining rendering differences are expected to be edge cases that can likely be adjusted by fixing wikitext either on individual pages or on templates.
Editors and bot, gadget, skin developers
Parsoid's HTML for media wikitext is different from what the legacy parser has typically generated. As part of a separate project to use semantic HTML5 output for images, the legacy parser has
been updated to generate HTML that is pretty close to Parsoid's HTML
. This may require some skins, gadgets, bots, and template styles to be updated (if not already).
The Cite extension that targets Parsoid relies on CSS rules to localize numbering of backlinks in the references section rather than generate localized HTML. While the known instances have been fixed, on some wikis where there is a rendering difference seen, this will require editors with appropriate permissions to update
MediaWiki:Common.css
on their wikis to add suitable CSS rules targeting this HTML. As part of
T384948
, WMDE might eliminate the need for CSS and adapt the rendered HTML, but till that time, these CSS rules are needed.
Extension developers
Parsoid's internal processing model
is different from
the legacy parser
As a result, extensions may need to be updated.
This only impacts extensions that do one or more of the following: (a) operate on wikitext (b) provide handlers for parser hooks (c) call a public method of the legacy parser.
Extensions that process wikitext will definitely need to be updated to work with Parsoid.
To date, the vast majority of such extensions have been updated.
Since Parsoid continues to access the legacy parser for expanding templates, processing parser functions, any parser hooks triggered during this processing will continue to operate and extensions that rely on these hooks will continue to operate.
For the rest, developers are exploring strategies to minimize updates needed to extensions.
The Content Transform Team files Phabricator tasks for all impacted extensions, and will fix whatever extensions they can directly.
If you are an extension developer, any proactive work is appreciated, as is prompt code review for patches you might receive.
What kind of support will be provided to impacted editor and developer communities?
The Content Transform Team is driving this project.
Their goal is to make this switch to Parsoid as seamless as possible.
So, they have tried to roll out changes over the years gradually.
First, HTML4 Tidy was replaced with HTML5 RemexHtml between 2015 and 2018.
In 2019, in preparation to integrate Parsoid into MediaWiki core more closely, Parsoid was ported from JS to PHP.
This switch went very smoothly.
In 2020, the work started to unify the media output generated by Parsoid and by core.
This has mostly involved making changes to core, but occasionally Parsoid's output was adjusted based on feedback and other technical considerations.
In 2024, Parsoid was deployed as the default parser for page views on Wikivoyage.
Going forward, support will be provided in the following ways:
Linter rules for any wikitext that needs fixing.
The vast majority of this work was completed as part of the Tidy -> Remex migration, and not many additional linter categories are expected for this
Communication via this page, via tech news updates, and via updates and posts to village pump and other wiki-specific forums.
Opt-in mechanisms for early adopter users / wikis to test and report problems.
See the next section for more details!
How can you help / be involved?
Starting November 2023,
you can opt-in to using the new Parsoid parser for reading articles
on Wikipedia.
See
Help:Extension:ParserMigration
for more information!
Other things you can do to help:
Test your gadgets / user scripts against Parsoid HTML to identify / fix any breakages
Parsoid read views will be rolled out first on wikis whose communities have elected to be early adopters; watch this space for more details.
Historical documents
US