Data Platform/Data Lake/Edits - Wikitech
Jump to content
From Wikitech
Data Platform
Data Lake
(Redirected from
Analytics/Data Lake/Edits
Data Platform
Discover data
Explore datasets in DataHub
Data Lake
Traffic data
Edits data
Content data
Events data
Analytics Query Service (AQS)
Access, query, and analyze data
Get access to internal data
Analytics tools
Jupyter notebooks
Superset
Spark
Presto
Quickstart notebook
Internal API requests
Transform and publish data
Get help or file a request
Plan data lifecyle
Build tables and datasets
Dataset creation process
Data modeling guidelines
Airflow developer guide
Hive
Iceberg
Druid
Share data and dashboards
Data publication guidelines
Turnilo
Superset
Web publication guide
Dashiki
Manage published data
Data incident management
Data issue reporting
Data retention guidelines
Event data retention
Event sanitization
Dataset archiving and deletion
Collect data
Test Kitchen
Instrumentation tutorial
Event Platform
Data Platform infrastructure and operations
Systems overview
Data pipelines
Using search for new features
Search Platform
Wikidata Query Service (WDQS)
Operations and team processes
Ops week
Team pages on Wikitech
Team and project pages on MediaWiki.org
edit
The
Analytics Data Lake
contains a number of
editing datasets
To access this data see
how to request and set up access
. To understand the aspects of access and access guidelines see
Data Access Guidelines
and
accessing sensitive data
. For recipes that work with lots of data, see
Data_Platform/Data Lake/Cookbook
Note
: In comparison to
traffic datasets
, edit datasets are
not
continuously updated. They are regularly updated by fully re-importing/re-building them, creating a new
snapshot
. This
snapshot
notion is key when querying the Edits datasets, since including multiple snapshots doesn't make sense for most queries. As of 2017-04, snapshots are provided monthly. When we import, we grab all the data available from all tables except the
revision
table, for which we filter by
where rev_timestamp <= <>
. If the snapshot is a little late because of processing problems, then by the time it finishes it may have more data in tables like logging, archive, etc. These should not affect history reconstruction because we base everything on revisions, but they'll affect any queries you may run on those tables separately.
The pipeline used to generate these edits datasets is described at
Data_Platform/Systems/Data Lake/Edits/Pipeline
Datasets
Reference Data
wmf_raw.mediawiki_project_namespace_map
Raw Mediawiki data
These are unprocessed copies of the
MariaDB
application tables
(most of them publicly available) that back our MediaWiki installations. They are stored in the
wmf_raw
database. Main difference with the original tables in databases is that the import bundle all wikis together in every table, facilitating cross-wiki queries. This means every table contains a new field
wiki_db
allowing to choose the wikis to query. Another thing to notice about this field is that it is a
partition
in the sense of hive tables, so a restriction on that field will make the queries a lot faster for not having to read every wiki data.
mediawiki_archive
mediawiki_cu_changes
(from the
CheckUser
extension)
mediawiki_imagelinks
mediawiki_private_ipblocks
mediawiki_logging
mediawiki_page
mediawiki_pagelinks
mediawiki_redirect
mediawiki_revision
mediawiki_user
mediawiki_user_groups
Processed data
Those are preprocessed data, usually stored in Parquet format and sometimes containing additional fields. Those tables can be found in the
wmf
database.
mediawiki_history
: fully denormalized dataset containing user, page and revision processed data
mediawiki_history dumps
: TSV dump of the Mediawiki-History fully denormalized dataset. Available to download on
Mediawiki Dumps
mediawiki_history_reduced
: Dataset providing a reduced version of the
mediawiki_history
one, with fewer fields and specific precomputed events so that the datastore
druid
can compute by-page and by-user activity levels.
mediawiki_user_history
: a subset of
mediawiki_history
containing only user events
mediawiki_page_history
: a subset of
mediawiki_history
containing only page events
mediawiki_metrics
: Dataset providing precomputed metrics over edits data (e.g. monthly new registered users or daily edits by anonymous users)
mediawiki_wikitext_current
: Avro version of current-page XML-Dumps (updated monthly, middle of the month). It contains the text of each page latest revision as well as some page and user information.
mediawiki_wikitext_history
: Avro version of all revisions history XML-Dumps (updated monthly, late in the month). It contains the text of each non-deleted revision as well as some page and user information.
edit_hourly
: Cube-like data set focused on edits. Its structure resembles the one from
pageview_hourly
. It has an hourly granularity and is partitioned by snapshot (as it is computed from
mediawiki_history
).
Geoeditors
: Counts of editors by project by country at different activity levels. For reference, this is migrated from the old
Data_Platform/Systems/Geowiki
mediawiki_geoeditors_daily
mediawiki_geoeditors_monthly
mediawiki_geoeditors_edits_monthly
Public bucketed version of geoeditors monthly
Wikidata entity
: A parquet version of the wikidata json dumps. Updated weekly, partitioned by snapshot.
Wikidata item page link
: Links between wikidata-items and wiki pages (wiki_db, page_id). This is computed using the
wikidata_entity
mediawiki_page_history
and
project_namespace_map
tables every week. Warning: the page-history table is updated monthly only, so as month moves in, items to pages links get less precisely binded.
Public dataset
Download from
Limitations of the historical datasets
Users of this data should be aware that the reconstruction process is not perfect. The resulting data is not 100% complete throughout all wiki-history. In some specific slices/dices of the data set, some fields may be missing (null) or approximated (inferred value).
Why?
MediaWiki databases are not meant to store history (revisions yes, of course; but not user history or page history). They hold part of the history in the logging table, but it's incomplete and formatted in many different ways depending on the software version. This makes the reconstruction of MediaWiki history a really complex task. Even sometimes the data is not there, and can not be reconstructed.
The size of the data is considerably large. The reconstruction algorithm needs to reprocess the whole database(s) at every run since the beginning of time, because MediaWiki constantly updates the old records of the logging table. This presents hard performance challenges to the reconstruction job, which made the code much more complex. We need to balance the complexity of the job with the data quality, at some point we need to add a lot of complexity to "maybe" improve quality for a small percentage of data. For example, if only 0.5% of pages have field X missing and getting the info to fix the field would make reconstruction twice as complex, it will not be corrected but rather documented as not present. This is a balance of requirements so you always let us know whether we are missing something there.
How much/Which data is missing?
After vetting the data for some time we approximated that the recoverable data that we did not make to recover represented less than 1%. We also saw that this data corresponded mostly to the earlier years of reconstructed history (2007-2009), and especially related to deleted pages. We do not have yet an in-depth analysis of the completeness of the data, it's in our backlog, see:
phab:T155507
Will there be improvements in the future to correct this missing data?
Yes, if we know that the improvement will have enough benefit. The mentioned task would help in measuring that.
Examples
History of deleted pages that are (re)created:
Correctly identifying a page as deleted and recreated might be straightforward for small sets of pages. It might also be simplified by "recreated" not meaning the page was undeleted by an administrator. As mentioned above, how MediaWiki logs data changes over time. This further complicates the identification process, particularly on a scale of "across all wikis". You might therefore find examples of pages that were recreated with the same page ID, namespace, and title. This can result in their creation and deletion timestamps in the history table appearing to be incorrect. If you're looking to run analysis on those kind of cases, further narrowing of the dataset (e.g. by time) might allow for correct processing of those.
Retrieved from "
Categories
Edits data
Data platform
Data Platform/Data Lake/Edits
Add topic