This page describes the first step of the edit history reconstruction pipeline: the loading of MediaWiki data into Hadoop. This is done using a couple of scripts stored in the
analytics-refinery-source
repository. After that, next steps in the pipeline will process that data within the cluster and generate the desired output.
Sqooping data
The main script uses
Apache sqoop
to import some MediaWiki tables from publicly available databases replica and production replicas into
Analytics' Hadoop cluster
. You can find it
in the analytics-refinery github
The MediaWiki tables imported from the public replicas are
archive
category
categorylinks
change_tag
change_tag_def
collation
content
content_models
externallinks
image
imagelinks
ipblocks_restrictions
iwlinks
langlinks
logging
page
pagelinks
page_props
page_restrictions
redirect
revision
slots
slot_roles
templatelinks
user
user_groups
user_properties
wbc_entity_usage
Tables imported from the production replicas are:
actor
comment
ipblocks
watchlist
. Finally some special case sqoop jobs are used to get the
cu_changes
and
discussiontools_subscription
tables from the production replicas.
In addition to that, another table is created in Hadoop: namespace_mapping. It contains localized namespaces for every wiki (see namespace mapping script below). This sqooping process is to be done at the beginning of every month, and it is done since the beginning of time every time. It is designed in a non-incremental fashion to adapt to the fact that MediaWiki data (revision, archive and logging) can suffer alterations in log records created in the past.
Sqooping wikis in groups
As in the other steps of the pipeline, there have been performance challenges associated with the size and nature of the data in this script as well. Sqoop would crash if trying to import all wikis at once, but also a job for each one of the ~800 wikis would be too slow and error prone. The solution we went for is to group the wikis in clusters that can be processed by sqoop in parallel. The groups have been determined by studying the sizes of all wikis. You can see
a diagram of the partitions in github
Namespace mapping script
An important feature of this process is also the generation of the namespaces mapping table. This table holds a relation between namespace names and namespace ids for all wikis. Note that many wikis have (or have had at some point in time) their own localized versions of the namespace names, like "Benutzer" (German) instead of "User". This table translates all versions (localized and standard) of namespace names into their namespace ids, which will help further steps in the pipeline to normalize and reconstruct the editing data. You can find
the namespace mapping script in github
Add a new table to be imported via Sqoop
Adding a new MariaDB table to be imported via Sqoop system involves modifying a number of import pipeline components.
Update
python/refinery/sqoop.py
in
refinery
queries["NEW_TABLE_NAME"] = { ... }
Update
bin/refinery-drop-mediawiki-snapshots
list of maintained tables in
refinery
Add to the list of tables in
modules/profile/templates/analytics/refinery/job/refinery-sqoop-mediawiki-not-history.sh.erb
in
puppet
--tables category,categorylinks,collation,content,content_models,externallinks,file,filetypes,image,imagelinks,ipblocks_restrictions,iwlinks,langlinks,pagelinks,page_props,page_restrictions,redirect,slots,slot_roles,templatelinks,user_properties,wbc_entity_usage
Update
main/dags/mediawiki/mediawiki_history_load_dag.py
in
airflow-dags