Purged - Wikitech

Purged - Wikitech
Jump to content
From Wikitech
Wikimedia infrastructure
Data centers
Networking
Global traffic routing
CDN
CDN/Hardware
DNS
HTTPS
HAProxy
Varnish
Apache Traffic Server
HTTP timeouts
Purged
Kafka HTTP purging
URL path normalization
Query string normalization
Ncredir
Acme-chief
Katran
HAProxyKafka
MediaWiki SRE
MediaWiki Engineering
Multimedia
Data Engineering
SRE Data Persistence
SRE Infra Foundations
SRE Observability
Wikidata Platform
Wikimedia Performance
Event Platform
Release Engineering
Fundraising
edit
purged
(pronounced "purge-dee") is a daemon running on all cache hosts that reads Kafka purge messages, parses them, and turns them into HTTP purge requests to be sent to the local
ATS
and
Varnish
daemons (via TCP or Unix sockets).
Detailed information about running purged instances can be found on this
Grafana dashboard
The daemon is written in Golang, see
the repo
CDN purge flow (MW to Kafka to Purged to Varnish/ATS).
Building the package
To build just the binary (not the Debian package) refer to the README.md file in the
purged repository
To target Debian Buster, build the package as follows on the build host:
WIKIMEDIA=yes BACKPORTS=yes ARCH=amd64 DIST=buster GIT_PBUILDER_AUTOCONF=no gbp buildpackage -jauto -us -uc -sa --git-builder=git-pbuilder
Similar build command can be issued for
bullseye
and
bookworm
During the
bullseye
->
bookworm
transitioning we need to have the same
purged
version (as code) in the two different distribution releases. In this case the package versioning used is (eg.)
0.21+deb11u1
for Debian Bullseye (deb11u1) and
0.21+deb12u1
for Debian Bookworm (deb12u1).
This implies that if we don't use different tags or branches for git-buildpackage, every software change results in 2 different
debian/changelog
entries and 2 different package build/import into reprepro.
Alerts
Considering the purged daemon could be very resource intensive, especially when starting or when a lot of messages needs to be processed, is preferable to first
depool the host
if the
purged.service
needs to be restarted for any reason.
The details about which pages need to be purged come from Kafka: for this reason we monitor the amount of time since the last Kafka message received and alert if it is not within a certain threshold. Here's how the alert looks like:
Time elapsed since the last kafka event processed by purged on cp2041 is CRITICAL: cluster=cache_text instance=cp2041 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041
If this happens and nobody from Traffic is around to take a look, the best course of action is killing purged and checking the journal to see if it is being restarted properly by systemd:
$ sudo pkill -9 purged
$ sudo journalctl -u purged --since today
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Main process exited, code=killed, status=9/KILL
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Failed with result 'signal'.
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Consumed 2d 16h 43min 48.527s CPU time.
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Service RestartSec=100ms expired, scheduling restart.
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Scheduled restart job, restart counter is at 58.
Jun 29 08:57:05 cp2041 systemd[1]: Stopped Purger for ATS and Varnish.
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Consumed 2d 16h 43min 48.527s CPU time.
Jun 29 08:57:05 cp2041 systemd[1]: Started Purger for ATS and Varnish.
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Listening for topics eqiad.resource-purge,codfw.resource-purge
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Process purged started with 48 backend and 4 frontend workers. Metrics at :2112/metrics
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Start consuming topics [eqiad.resource-purge codfw.resource-purge] from kafka
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Reading from 239.128.0.112,239.128.0.115 with maximum datagram size 4096
Gather information for bug reports
purged CPU profile diagram
If
purged
seems to be misbehaving, data such as
perf
reports, callgraphs and go profiling can be useful to diagnose the issue.
One minute of
perf
data can be gathered on the host where
purged
is running with:
sudo timeout 60 perf record -p `pidof purged`
sudo perf report --stdio
Similarly,
go
profile information can be collected with:
curl http://localhost:2112/debug/pprof/profile?seconds=60 > cpuprof
Copy the file
cpuprof
to a system with go installed, and run the following command to get a detail of CPU usage to standard output
go tool pprof -top cpuprof
A PNG profile diagram can be created with:
go tool pprof -png cpuprof
See also
Source code:
Retrieved from "
Categories
Services
SRE Traffic
Purged
Add topic