Requests for comment/Multi datacenter st

Requests for comment/Multi datacenter strategy for MediaWiki/Progress - MediaWiki
Jump to content
From mediawiki.org
Requests for comment
Multi datacenter strategy for MediaWiki
This page is obsolete. It is being retained for archival purposes.
It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date.
Current status at:
wikitech:Performance/Multi-DC MediaWiki
References:
Multi-DC master tracking task
Multi DC strategy RFC:
Requests for comment/Multi datacenter strategy for MediaWiki
Multi-DC sync-up meeting regular attendees:
Aaron
Stas
Gabriel
Brandon
Giuseppe
Filippo
Gilles
Timo
JaimeC
2016-08-17
MediaWiki:
[assigned] Flow cache purges to use WAN cache (
[blocked] action=rollback uses GET (
patch reverted for now (user JS breakage); patch to be tweaked
needs user input; ask comm laisons, ask Design/Reading?
[in progress] wikidata master queries (
Subtask created:
First patch:
Configuration:
[unstarted] Switch parts of config to something like etcd.
Databases:
[done] pt-heartbeat usage for lag detection (
[in progress] mariadb clients (MediaWiki) to use TLS/SSL(
Make sure cross-DB TLS new connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead
Certificate management???
I need to coordinate with Performance and Availability to standarize all MySQL services with the same HA solution. That may require mediawiki changes so that most of
gets simplified to a single ip + port per "micro-service". Also probably those 2 files should disappear and only have db.php, given that we will have a single active-active setup (?) T141547
Media storage / Swift:
[unstarted] HTTPS for swift:
swiftrepl/MediaWiki cross-dc writes uses HTTP now. Lets clean this up before doing active/active though.
Session storage / redis:
[in progress] Use a dedicated HyperSwitch/cassandra cluster? (
Sync writes for ChronologyProtector (
) and SSL needed
What is the advantage of using restbase vs. direct cassandra?
RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
Last meeting affirmed cautious support for cassandra/hyperswitch
Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)
CDN / traffic:
[done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
Patch to distinguish callback updates deployed
Graph for GETs:
Mostly logging, parsercache updates, spreadAnyEditBlock() is 20/minute
[deferred] VCL routing logic:
Services:
[in progress] look into mcrouter too see if it can work for WANCache
Either email some people use a github question
initial mcrouter debianization:
Firming up design for session & auth service:
Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
ACTION: Gabriel to set up meeting for session storage next week.
The Big Active / Active Goal™
When to call it out / how far away are we from starting active-active operation?
What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active?
Workboard:
Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?
2016-08-03
MediaWiki:
[assigned] Flow cache purges to use WAN cache (
[blocked] action=rollback uses GET (
patch reverted for now (user JS breakage); patch to be tweaked
needs user input; ask comm laisons, ask Design/Reading?
[in progress] wikidata master queries (
Subtask created:
First patch:
Configuration:
[unstarted] Switch parts of config to something like etcd.
Databases:
[unblocked] pt-heartbeat usage for lag detection (
Config patch at
datacenter column now present \o/
[unblocked] Deploy MASTER_GTID_WAIT() support (
Patch merged in core
Config patch at
(might do testwiki first though)
[unstarted] mariadb clients (MediaWiki) to use TLS/SSL(
Make sure cross-DB TLS connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead
Certificate management???
[status?] ES compression...blocker?
Not a blocker
Media storage / Swift:
[unstarted] HTTPS for swift:
swiftrepl uses HTTP now. Do want to add MediaWiki to this?
[HARD BLOCKER] lets do SSL first
Session storage / redis:
[in progress] Use a dedicated HyperSwitch/cassandra cluster? (
Sync writes for ChronologyProtector (
) and SSL needed
What is the advantage of using restbase vs. direct cassandra?
RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
Last meeting affirmed cautious support for cassandra/hyperswitch
Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)
CDN / traffic:
[done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
Patch to distinguish callback updates deployed
Graph for GETs:
Mostly logging, parsercache updates, spreadAnyEditBlock() is 20/minute
[deferred] VCL routing logic:
Services:
[unstarted] look into mcrouter too see if it can work for WANCache
Either email some people use a github question
initial mcrouter debianization:
Firming up design for session & auth service:
Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
ACTION: Gabriel to set up meeting for session storage next week.
The Big Active / Active Goal™
When to call it out / how far away are we from starting active-active operation?
What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active?
Use tracking ticket: ACTION: Aaron to create, discuss at next meeting.
Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?
2016-07-20
MediaWiki:
[done] restbase BagOStuff subclass (
[assigned] Flow cache purges to use WAN cache (
[blocked] action=rollback uses GET (
patch reverted for now (user JS breakage); patch to be tweaked
needs user input; ask comm laisons, ask Design/Reading?
[unstarted] wikidata master queries (T110399)
Subtask created: T138376
[done] notify users to use POST for rollback/markpatrolled/purge tools
Databases:
[blocked] pt-heartbeat usage for lag detection (
Config patch
waiting on 'datacenter' pt-heartbeat table column
[done] MASTER_GTID_WAIT() support (
Initial version done, maybe test in betalabs with mariadb next?
[done] talk to RE about mariadb version (
[unstarted] mariadb clients (MediaWiki) to use TLS/SSL(
Media storage / Swift:
[unstarted] HTTPS for swift:
Session storage / redis:
[in progress] Use a dedicated HyperSwitch/cassandra cluster? (
Old patch for direct casandra use:
Sync writes for ChronologyProtector (
) and SSL needed
What is the advantage of using restbase vs. direct cassandra?
RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
CDN / traffic:
[done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
T137326: done
[deferred] VCL routing logic:
Services:
[unstarted] change_propagation module for CDN cache purges
[unstarted] look into mcrouter too see if it can work
initial mcrouter debianization:
[unstarted] develop xkey purge strategy: Brandon to set up initial brainstorm meeting
looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes):
new librdkafka based node client looking good, starting beta testing; adds Kafka 0.9/0.10 support
Firming up design for session & auth service:
Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
ACTION: Gabriel to set up meeting for session storage next week.
The Big Active / Active Goal™
When to call it out / how far away are we from starting active-active operation?
What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active?
Use tracking ticket: ACTION: Aaron to create, discuss at next meeting.
Aaron: I'd rather use a tag and board, TODO
Blocking tasks are now all in etherpad now
Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?
2016-06-22
MediaWiki:
[under review] restbase BagOStuff subclass (
[unassigned] Flow cache purges (
[assigned] action=rollback uses GET (
reverted for now (user JS breakage); patch to be tweaked
[unstarted] wikidata master queries (T110399)
Subtask created: T138376
[in progress] notify users to use POST for rollback/markpatrolled/purge tools
Databases:
[blocked] pt-heartbeat usage for lag detection (
Config patch
waiting on 'datacenter' pt-heartbeat table column
[in progress] MASTER_GTID_WAIT() support (
Initial version done, maybe test in betalabs with mariadb next?
[ACTION] talk to RE about mariadb version
[unstarted] mariadb clients (MediaWiki) to use TLS/SSL(
Media storage / Swift:
[done] Experiment with sync/async and watch statsd for api entry point for multiwrite backend
done and left on; no noticeable effect on api entry points
[unstarted] HTTPS for swift:
Session storage / redis:
[in progress] Use restbase/cassandra cluster? (
Old patch for direct casandra use:
Sync writes for ChronologyProtector (
) and SSL needed
CDN / traffic:
[done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
T137326: done
[deferred] VCL routing logic:
Services:
change_propagation module for CDN cache purges
[unstarted] look into mcrouter too see if it can work
initial mcrouter debianization:
looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes):
2016-06-08
MediaWiki:
[unassigned] restbase BagOStuff subclass (
[unassigned] Flow cache purges (
[assigned] action=rollback uses GET (
reverted for now (user JS breakage); patch to be tweaked
[unstarted] wikidata master queries (T110399)
Databases:
[blocked] pt-heartbeat usage for lag detection (
Config patch
waiting on 'datacenter' pt-heartbeat table column
[wip] MASTER_GTID_WAIT() support (
Initial version done, maybe test in betalabs with mariadb next?
[unstarted] mariadb clients (MediaWiki) to use TLS/SSL(
Media storage / Swift:
[unstarted] Experiment with sync/async and watch statsd for api entry point for multiwrite backend
statsd graphs finally fixed (at
use 'sync' if not too slow (little upload API speed change per statsd) (
[unstarted] HTTPS for swift:
Session storage / redis:
[unassigned] Use restbase/cassandra cluster? (
Old patch for direct casandra use:
Sync writes for ChronologyProtector (
) and SSL needed
CDN / traffic:
[deferred] VCL routing logic:
VCL or Apache proxying?
Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
tracks master queries on GET/HEAD
[ACTION] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
Too many deferred updates and a few sync exceptions (writes will be cross-DC then)
[status?] General Active/Active support (incl non-MW, not sticky-cookie specific):
Services:
change_propagation module for CDN cache purges
[unstarted] look into mcrouter too see if it can work
looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes):
2016-05-25
MediaWiki:
EventBus purge relayer for WAN cache
Flow cache purges (
action=rollback uses GET (
Reduce cross DC wiki DB queries
action=purge and wikidata (T110399)
Databases:
pt-heartbeat usage for lag detection (
We need both decent HA and correct lag estimates in all DCs
MASTER_GTID_WAIT() support (
Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate) (
Media storage / Swift:
FileBackendMultiWrite 'async' upload /thumbnail race conditions
Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend
Experiment with sync/async and watch statsd for api entry point
HTTPS for swift:
Session storage / redis:
Sync writes for ChronologyProtector (
) and SSL needed
Use another system (like a cassandra cluster?) (
CDN / traffic:
VCL routing logic:
Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
General Active/Active support (incl non-MW, not sticky-cookie specific):
Experiment with % of traffic to codfw (avoid loops?)
Services:
change_propagation module for WAN cache purges
2016-05-11
ACTION ITEMS:
MediaWiki:
EventBus purge relayer for WAN cache
Flow cache purges (
?action=rollback uses GET (
Databases:
pt-heartbeat usage for lag detection (
Related: cross-datacenter state visibility (in general, chronology checks) Use GTID? Use pt-heartbeat? Needs discussion. Joe mentiones that needs to work for "regular/simple" non-WMF mediawiki setups.
MASTER_POS_WAIT() does not work cross-DC with current file/coords [Jaime will file a task]
Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate)
Parsercache (not really DBs): General consensus on replacing the datastore from MySQL to somethings else with mult (which should eventually be done). Jaime proposes to do a couple of fixes to have something quicky.
Media storage / Swift:
FileBackendMultiWrite 'async' upload /thumbnail race conditions
Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend
Experiment with sync/async and watch statsd for api entry point
HTTPS for swift:
Session storage / redis:
Sync writes for ChronologyProtector (
Blocked on TLS/SSL for apaches <=> redis (
not supported)
Maybe use another system (like a cassandra cluster?) (
ElasticSearch:
Basically ready
CDN / traffic:
VCL routing logic:
Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
General Active/Active support (incl non-MW, not sticky-cookie specific):
Experiment with % of traffic to codfw (avoid loops?)
Services:
change_propagation module for WAN cache purges
Retrieved from "
Category
Pages kept for historical interest
Requests for comment/Multi datacenter strategy for MediaWiki/Progress
Add topic