WMDE/Wikidata/SSR Service - Wikitech
Jump to content
From Wikitech
WMDE
Wikidata
This page provides a brief overview of Server-side Rendering Service
Observability
Grafana dashboard for termbox service
Grafana dashboard for envoy proxy, filtered for termbox
Grafana dashboard for Termbox SSR Service Level Objective (SLO)
Grafana dashboard for Wikidata alerts with a panel showing Termbox request errors (requests from MediaWiki to Termbox)
Logstash
Logstash 2
(Todo: create some gadgets to see at a glance whether events are spiking, maybe consolidate this)
Details
Overview
The service was introduced in 2019, to initially serve server-side rendered content of the Wikidata/Wikibase "term box", i.e. the part of item/property page UI where labels, descriptions and aliases are shown and could be edited.
The service is used as part of generating the HTML output sent from MediaWiki to user's browser.
The HTML generated server-side is to be optionally "enhanced" by client-side JavaScript
There is a server-side and the client-side variant of the code, which are distributions of the same implementation.
The client-side variant is deployed into wikibase on a file system level through git submodules.
In case of no configured server-side rendering service or a malfunctioning of it, the client-side code will act as a fallback.
Technology
The SSR service is a node service. It is written in TypeScript. The code is "compiled" to JavaScript using webpack. The "compiled" code and "compiled" CSS is to be found in the dist folder of the git repository.
The service uses
Vue.js
as the UI framework.
The service is deployed on the WMF services Kubernetes cluster using helm. This means that the service is packaged as a docker image. The docker image is built by the
Deployment pipeline
Deployment
The images that are used in production can be found on the
WMF docker registry
. New images are built, after code is merged to the master branch, automatically by the deployment pipeline.
On Beta, the image is just run by Docker. The configuration for this can be found in the git repo in the
infrastructure
folder. The instructions for applying those changes can also be found there.
In Wikimedia production, the service is managed using Kubernetes and Helm.
Kubernetes deployments
are configured in the
operations/deployment-charts repo
. There are four releases in total:
production
releases, one for the
eqiad
cluster and one for
codfw
. These talk to Wikidata (wikidata.org, wikidatawiki) and are used by Wikidata as well.
staging
release, in the
staging
cluster. This one also talks to Wikidata, but is not used by anything.
test
release, also in the
staging
cluster. This one talks to Test Wikidata (test.wikidata.org, testwikidatawiki) and is used by Test Wikidata as well.
When deploying a new version of the Termbox, you should usually first update the
test
release (
values-test.yaml
) and deploy that to the
staging
cluster, then test that it works on Test Wikidata (check that a newly created item has an SSR termbox). Then, update the version in the
production
release (
values.yaml
; this will also update the
staging
release, because
values-staging.yaml
does not override the version). If you want to test the
staging
release before deploying the
production
release, you will have to do so using curl, because the
staging
release is not used by any wiki:
curl
'https://staging.svc.eqiad.wmnet:4004/termbox?entity=Q42&revision=1841500264&language=en&editLink=%2Fw%2Findex.php%2FSpecial%3ASetLabelDescriptionAliases%2FQ42&preferredLanguages=en%7Cde'
echo
# should return some HTML starting with

If this works, then deploy the
production
release to the
eqiad
and
codfw
clusters and check that
new Wikidata items
have an SSR termbox on mobile.
Some useful metrics for monitoring the deployment can be found shown in
grafana
Architecture
Wikidata Termbox SSR Architecture Diagram
Wikidata Termbox SSR Sequence Diagram
Sequence diagram
"source code"
Initial deployment & load details
The initial responsibility of this service will be the rendering of the term box for wikidata items and properties for mobile web views.
Currently wikidata.org gets no more that 80k
mobile web requests per day (including cached pages, and non item/property pages).
If we were to assume all of these requests were actually to item and property pages that were not cached this would result in this SSR service being hit 55 times per minute.
(In reality some of these page views are not to item or property pages, and some will be cached) so we are looking at no more than 1 call per second.
Availability objectives and accepted operational errors
The Service Level Objective (SLO) for the Termbox SSR is an error rate of less than
0.1%
. The current error rate and numbers of errors can be seen at
the Grafana Termbox SSR SLO dashboard
That availability is impacted by errors triggered inside Termbox SSR (i.e. the NodeJS app living in Kubernetes) that are caused by operational or performance issues in MediaWiki. They are unavoidable to a degree and acceptable as long as their overall frequency stays low, see the SLO above. The bulk of those errors is constituted by the following three error messages:
timeout of 3000ms exceeded
Some of these timeout errors seem to happen surprisingly often during the health checks that are run periodically (
config
docs
). This is judged to be strange but probably harmless.
Disregarding the health checks that go to the unused datacenter above, these errors also seem to correspond almost perfectly to the errors logged in
MediaWiki PHP logstash
with the message
Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote server
and content
Request failed with status 0. Usually this means network failure or timeout
That timeout for this connection going out from MediaWiki/PHP to the Termbox SSR is currently based on the
wikibase default configuration
Request failed with status code 500
i.e., the MediaWiki API having some server problem.
Request failed with status code 503
These seem to be triggered by the Envoy Proxy that sits between the Termbox SSR and the MediaWiki API. More detailed information about that is available in another
Phabricator comment
These errors are discussed in more detail in
a Phabricator comment
. Detailed descriptions of them are visible
on logstash
. Note that there seems to be
a bug
in how Prometheus calculates the numbers shown in Grafana, so they can diverge from what is shown in logstash.
Debugging and Testing Production
To connect to the production services for testing use ssh port forwarding as follows:
ssh -4 -L 3030:termbox.svc.codfw.wmnet:3030 @bast1002.wikimedia.org
You can alter the bastion host as needed.
You can also alter the service e.g. eqiad vs codfw.
References
Source code of the service
wikibase TermboxView
falling back to
termbox client code
mount point DOM element
Retrieved from "
WMDE/Wikidata/SSR Service
Add topic