Dumps - Wikitech
Jump to content
From Wikitech
Wikimedia infrastructure
Data centers
Networking
Global traffic routing
MediaWiki SRE
Application servers
PHP 7 and php-fpm
BounceHandler
Citoid
Dumps
Envoy
EtcdConfig for MediaWiki
External storage
MediaWiki HTTP cache headers
MediaWiki On Kubernetes
mw-cron jobs
mw-experimental
MediaWiki Maintenance scripts
MediaWiki JobQueue
Mathoid
Memcached
mw-mcrouter
Mcrouter runbook
Nutcracker
Parser cache
Redis
Shellbox
Videoscaling
MediaWiki Engineering
MediaWiki at WMF
Parser cache
MediaWiki JobQueue
Performance review
PHP upgrade process
performance.wikimedia.org
Web Perf Hero award
Guides:
Frontend best practices
Backend best practices
more...
Runbooks:
Access control
Daily duties
Multimedia
Data Engineering
SRE Data Persistence
SRE Infra Foundations
SRE Observability
Wikidata Platform
Wikimedia Performance
Event Platform
Release Engineering
Fundraising
edit
See
Help:Shared storage#Dumps
for information on using Dumps data from
Toolforge
Please note that these dumps are now running under
Airflow
as
Kubernetes
workloads, rather than on the bare-metal
Dump servers
This documentation is currently being updated to reflect the new configuration.
Documentation about
Dumps
is stored in a few different wikis:
These Wikitech docs are for
maintainers
of the various dumps.
Information about the
clouddumps
servers serving mirrors to various clients can be found on
Portal:Data Services/Admin/Dumps
Information for
users
of the dumps can be found at Meta-wiki's
m:Data dumps
page.
Information for
developers
can be found at MediaWiki-wiki's
mw:SQL/XML Dumps
page.
Daily checks
These daily checks are now integrated with the
Data Platform Engineering/Ops week
rota, so engineers should include monitoring of Dumps with other routine data pipeline checks.
emails to the
data-engineering-alerts
internal mailing list.
xmldatadumps-l public mailing list
Phabricator Dumps-Generation workboard
(mentions the current run, unless idle)
Dumps types
We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.
xml/sql dumps
which contain
revision metadata and content
for public Wikimedia projects, along with contents of select
sql tables
adds/changes dumps
which contain a
daily xml dump of new pages
or pages with
new revisions
since the previous run, for public Wikimedia projects
Wikidata entity dumps
which contain dumps of
'entities' (Qxxx)
in various formats, and a dump of
lexemes
, run once a week.
category dumps
which contain weekly full and daily incremental
category lists
for public Wikimedia projects, in
rdf format
other miscellaneous dumps
including
content translation
dumps,
cirrus search
dumps, and
global block
information.
Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.
Service
Airflow
Please see
Dumps/Airflow
for more information on how the dumps are now executed and published.
Hardware
Dumps servers
that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users
The snapshot and dumpsdata servers below are now legacy and will be decommissioned soon, following the migration to Airflow.
Dumps snapshot hosts
that run scripts to generate the dumps
Dumps datastores
where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
Adding new dumps
If you are interested in adding a new dumpset, please check the
guidelines
(still in draft form).
If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see
Dumps/Wikibase dumps overview
Not an SLO but...
Dumps have never had an SLO. But current dumps maintainers have a set of unofficial standards for responsiveness and reliability.
We try to reply to incoming tasks filed, emails from folks interested in hosting mirrors, and requests for information, within 2 business days. This may be extended if the dumps maintainers are ill or out of the office for other reasons.
When the SQL/XML dumps for one or more wikis are broken, we do our best to respond to the breakage within 24 hours; this usually includes the filing of a task in Phabricator and some investigation of the problem. If changes to MediaWiki code are required, we will coordinate that work even when we do not write the patch, also arranging for a timely backport and deployment of the patch.
We do our best to ensure that all jobs for the SQL/XML dumps for every wiki are complete before the start of the next run. So for the run starting on the 1st of the month, all jobs on all wikis must be complete before the 20th of the month, and for the run starting on the 20th, all jobs for all wikis must be complete before the end of the month. This sometimes requires work on days off, or beyond the regular workday, in which case future workdays might be shortened to compensate.
Testing changes to the dumps or new scripts
See
Dumps/Testing
for more about this.
Mirrors
If you are adding a mirror, see
Dumps Mirror setup
Source code
operations/dumps.git
Retrieved from "
Category
Dumps
Dumps
Add topic
US