wikimediastatus.net - Wikitech
Jump to content
From Wikitech
Wikimedia infrastructure
Data centers
Networking
Global traffic routing
MediaWiki SRE
MediaWiki Engineering
Multimedia
Data Engineering
SRE Data Persistence
SRE Infra Foundations
SRE Observability
Arc Lamp
Distributed tracing
Jaeger
Distributed tracing/Tutorial
Logs
MediaWiki UDP logging
OpenSearch Dashboards
(frontend)
Logstash
(backend)
Prometheus
performance.wikimedia.org
Thanos
Statsd
Graphite
Grafana
Alertmanager
Klaxon
Wikimediastatus.net
XHGui
Wikidata Platform
Wikimedia Performance
Event Platform
Release Engineering
Fundraising
edit
When distributing the link to others, include the www. prefix, as the HTTP redirect from wikimediastatus.net is served from
(offsite) WMF infrastructure
wikimediastatus.net
is a public and high-level uptime monitor. It is separated from our production infrastructure and hosted by
Atlassian Statuspage
It was launched in Jan 2022, and is maintained by the
SRE
team. It is the spiritual successor to
status.wikimedia.org
, which was hosted by Watchmouse, but no longer under the wikimedia.org domain for security reasons (
T293504
) and for availability reasons in the event of an outage of Wikimedia DNS and/or our networking infrastructure.
Instructions for users
Please see
user instructions
for how to read and interpret the page.
SRE usage instructions
Historical background
See also:
phab:T202061
Our status page is primarily intended to serve the general public and the news media, although of course we expect community members to also use it as a resource -- although we certainly don't mean to replace, for example, on-wiki technical village pumps. The focus is on very visible/widespread outages.
We selected Atlassian's statuspage.io with the following considerations:
Because we want the site to be working even in a widespread failure of Wikimedia infrastructure, any solution needs to be hosted externally
We decided we did not want to take on the non-trivial engineering effort needed to run scalable external hosting + separate CDN
It's critically important that the status site be scalable and able to serve large spikes of load, because that is exactly what will happen to it in the event of a major outage to Wikimedia infra: not only will users be checking in, but the site is sure to be linked in popular news articles
There are very few FLOSS status page projects that are more than just "toy" projects, and of those which aren't, even fewer are actively maintained
statuspage.io had some distinguishing features: not just the basic manually-posted up/down functionality, but also support for automated uploads of timeseries metrics, and
SLO
-like uptime history on each component
What merits posting on the status page?
We intend to post only major outages. By “major outages” we mean problems so severe that the general public or the media might notice—issues like wikis being very slow or unreachable for many users. We
don't
intend to post for issues that only affect niche editing features, for example if
automated citation generation
is malfunctioning, or if
mathematical formula rendering
is slow, or if
the Job Queue
has delays.
The status page will definitely be useful for the editor community and others directly involved in the projects, but it won’t be replacing forums for in-depth discussion like Technical Village Pumps or Phabricator – rather, it will supplement them, particularly as a place to check when the wikis are unreachable for you.
Statograph (automated metrics upload)
Statograph
Automatically uploads time-series metrics to the public status page.
Animated illustration of a pantograph, the namesake of Statograph
URL
Language
Python
Source code
operations/software/statograph
Puppet classes
Puppet module
hiera configuration
statograph
is a tool that uploads timeseries metrics from sources like Prometheus and Graphite to the metrics on your statuspage.io installation.
As configured at WMF, it runs on the
alerting_host
puppet role (e.g.
alert1001
alert2001
), and scrapes timeseries from both
Thanos
globally-aggregated
Prometheus
as well as one from
Graphite
These metrics are intentionally chosen to be high-level and broad. This means that not only do they show many kinds of possible outages, but also that they are hopefully understandable even to users with limited technical knowledge.
Said metrics may also be found on a
Grafana dashboard
that (manually) mirrors
Statograph's configuration
It is executed via a systemd timer that runs once a minute. Runs are idempotent, so this is a simple mechanism to give high availability.
More information on its execution model and on statuspage.io's API can be found in its
Uploader class
See also
Launch task:
phab:T202061
External links
Announcing wikimediastatus.net
, Wikimedia Blog, March 2022.
Retrieved from "
Category
Services
wikimediastatus.net
Add topic