Data Platform - Wikitech
Jump to content
From Wikitech
Not to be confused with
Data Services
within
Wikimedia Cloud Services
Wikimedia's Data Platform is a collection of systems and services that enable data producers and consumers to discover, use, and collect data to derive insights, conduct research, and build new data products. The Data Platform is primarily maintained by the
Data Platform Engineering team
. To contact the team, use the
intake process
Get started
The Data Platform provides access to
private data
and
internal WMF resources
, so you must have
specialized data access
to use it. For public, open access Wikimedia data and tools, see
meta:Research:Data
Discover data
Find datasets and documentation for WMF private data sources.
Access and query data
Use SQL query engines, Jupyter notebooks, libraries, and compute resources to explore and analyze data.
Transform and publish data
Create and share derivative datasets, reports, and dashboards based on existing Wikimedia data sources.
Collect data
Use the
Test Kitchen
to configure instruments and collect analytics data.
Advanced users: use the
Event Platform
to configure and deploy event streams.
Data platform infrastructure
Data platform systems and infrastructure include the data lake, ingestion and processing pipelines, and production search and query services.
Data pipelines
Information about data pipelines is currently at:
/Systems/Cluster
Airflow
Hadoop Event Ingestion
Category:Data pipelines
Search data and services
Using search for new features
Search Platform
Wikidata Query Service (WDQS)
Overview of data platform systems
Data Platform Technical Overview 2023
Analytics Data Platform 2021
The following list highlights some major Data Platform systems. For more details and a full list of Data Platform system documentation pages on this wiki, see
Data Platform/Systems
System name and link
Type
Airflow
Workflow Job Scheduler
Private
Archiva
Repository for Java archives
Private
AQS -
nalytics
uery
ervice
REST API for analytics data
Public
Ceph
Software defined storage, serving block and object storage
Private
Clients (stat100X)
Analytics client nodes to access Hadoop and various services
Private
Cluster (Hadoop, Gobblin, Hive, Spark...)
Hadoop
Private
Datahub
Data Catalog
Private
Dashiki
Framework for building dashboards
Public
Druid
Data storage engine optimized for exploratory analytics
Private
EventLogging
Ad-hoc streaming pipeline
Private
EventStreams
Mediawiki events streams
Public
Growthbook
Analysis of experiments and A/B tests
Private
Kafka
Data transport and streaming system
Private
MariaDB
Data storage for MediaWiki replicas and EventLogging
Private
Matomo
(formerly known as Piwik)
Small-scale web analytics platform
Private
Presto
Big data high performance sql query engine
Private
ReportUpdater
Job Scheduler
Private
Superset
Web interface for data visualization and exploration
Private
Jupyter
Hosted notebooks for data analysis
Private
Turnilo
Web interface for exploring data stored in Druid
Private
Wikistats
(1 and 2)
Community Dashboard with high-level metrics
Public
Wmfdata-Python
Python package for streamlined data access on the
analytics clients
Private
Full list of Data Platform systems
Data platform operations
Find
ops week
and other process documentation at
Data Platform Engineering on Wikitech
and the
project pages on MediaWiki.org
The list of scheduled manual maintenance tasks are documented at
/Systems/Manual maintenance
Retrieved from "
Categories
Portals
Documentation
Data platform
Data Platform
Add topic
US