MediaWiki Content File Exports - Wikitec

MediaWiki Content File Exports - Wikitech
Jump to content
From Wikitech
The
MediaWiki Content File Exports
are datasets available for download that include the unparsed content of the public wikis hosted by the Wikimedia Foundation.
These datasets are provided on a per wiki basis, and in a compressed XML format. This XML format is compatible with MediaWiki's
Special:Export
, and with the
legacy XML Dumps
Project rationale
"Dumps" of the content of Wikimedia wikis in XML format have been in production for many years, and can be obtained from
. These dumps enable reuse, repurposing, and analysis by both the community and internally by the Wikimedia Foundation.
However, the infrastructure that produces those XML files can no longer reliably produce the bigger wikis, and has been unmaintained for an extended period of time. Although we will continue to attempt to generate the legacy XML files, we are now deprecating that legacy path.
The Data Engineering team has reimplemented how this data is produced, making it reliably accessible internally. We are now in a position in which we feel confident to make it available publicly as well.
Content
The MediaWiki Content File Exports consists of two datasets:
mediawiki_content_history
Contains the unparsed content of
all
revisions, past and present, from all public Wikimedia wikis. This dataset is exported per wiki, once per month, on the 1st of the month.
mediawiki_content_current
Contains the unparsed content of the
current
revisions from all public Wikimedia wikis. This dataset is exported per wiki, once per month, on the 1st of the month.
How to download
Identify the wiki to download
Identify whether you need the full history, or if the latest revision per page is sufficient.
Attempt to fetch
If the file is available, that particular file export is done and available. If not, retry later.
Example
We want to download the current content of the English Wikipedia, that is,
enwiki
We figure that the URL to check is
If the files for a particular date are not ready, the
SHA256SUMS
file will not exist.
This file contains the sha256 fingerprint as well as the relative path of each file that composes the file export.
We can then iterate over each relative path in that file to download them all.
After downloading all files, you are highly encouraged to use the
SHA256SUMS
file again to verify each downloaded file via a command such as
sha256sum --check
FAQ
How do I know which URL corresponds to the wiki that I wish to download?
The URLs are indexed with what we call the
wiki_id
of each wiki. For example, for the English Wikipedia, that id is
enwiki
. For the Spanish Wikipedia, it is
eswiki
You can find a mapping between
wiki_id
's and the corresponding web address, site name and language at
MediaWiki Content File Exports/WikiId Mappings
How do I know that a file export is ready for a specific wiki?
As of today, the process generating the file export and the process that makes these files publicly available are separate. This means that, for any specific wiki, we could make files available before the whole file export is done.
However, the
SHA256SUMS
file will not be available unless the whole process is done. Thus, users should not attempt to download files until that file is available.
I have downloaded all of the files for a specific wiki. What do the file names of individual files mean?
Most filenames look like this:
wikidatawiki-2026-02-01-p996p1009.xml.bz2
The first part is the
wiki_id
, the second part is the publication date, and the third part is the range of
page_id
s contained in the file. Thus, the above file will contain all revisions of pages in the
page_id
range
[996, 1009]
, both inclusive.
Some files, however, look like this:
wikidatawiki-2026-02-01-p91783883r1170242959r1888636761.xml.bz2
wikidatawiki-2026-02-01-p91783883r1888636840r2273850121.xml.bz2
In these cases, a page was found to have too many or too big revisions, and so to keep the file size and corresponding computation cost manageable, we chose to export it on its own set of files. In the specific example above,
page_id = 91783883
was found to be big. Thus, the algorithm exports it in two files, with one of them including any revisions belonging to this page in the range
[1170242959, 1888636761]
, while the second file includes any revisions in the range
[1888636840, 2273850121]
. Between those two files, you can find the entirety of
page_id = 91783883
Retrieved from "
MediaWiki Content File Exports
Add topic