⚓ T124954 Decrease max object TTL in varnishes
Page Menu
Phabricator
Create Task
Maniphest
T124954
Decrease max object TTL in varnishes
Closed, Resolved
Public
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
BBlack
Authored By
BBlack
Jan 27 2016, 7:34 PM
2016-01-27 19:34:28 (UTC+0)
Tags
SRE
(Backlog)
Traffic
(Caching)
Performance-Team (Radar)
(Limbo)
Referenced Files
None
Subscribers
Agabi10
Aklapper
BBlack
Danielsberger
EBernhardson
ema
gerritbot
View All 18 Subscribers
Description
Currently we set a hard cap on object lifetime at 30 days in our VCL for all clusters (in addition to a few tighter restrictions in certain cases). I think we can/should reduce this lifetime if we can.
Possible Concerns
Obviously, cache hitrate could be negatively impacted. However, I suspect this isn't a big problem in practice. If we end up reducing some long-lived objects from 30 days to, say, 14 days, the effective hitrate if the object is very hot is virtually unchanged. For example, if it's requested once per second and virtually never changes, we've gone from from an effective hitrate of 99.9999614% to 99.9999173%. The less hot an object is, the less it matters for overall perf/hitrate averages anyways.
Long-lived objects help protect us in certain operational corner cases. The principle example is taking a cache cluster offline from live traffic for multiple days (e.g. due to network link risks), and then bringing it back online later without wiping (because the link was never actually down, and purges were flowing fine). In that scenario, the cache will effectively wipe itself anyways if the downtime exceeds the lifetime of most (or all) objects.
The upside is that by reducing the maximum cache lifetime, we reduce concerns and headaches related to stale objects (or at least, fears of very-stale objects) from code/asset deployers. In other words, we're able to provide a tighter guarantee of the form "Even if all else goes wrong with invalidation, nothing in this cache can possibly be older than X".
I'd like to propose that we come down first from 30 to 21 days, wait a month to make sure we've seen the effects, and then move down to 14 days, and remain at that value for the foreseeable future.
I've taken a few stats sample so far (single cache host, ~10 minute samples) to get some preliminary ideas. On the upload cluster, I'm seeing a rate of served
Age:
headers >= 86400 (1 day) at 0.01% of responses. On the text cluster, it maps out like:
1s+: 99.70% (age < 1s: 0.30%)
1m+: 90.71% (age < 1m: 9.29%)
1h+: 53.85% (age < 1h: 46.15%)
1d+: 37.33% (age < 1d: 62.67%)
7d+: 12.37% (age < 7d: 87.63%)
14d+: 0.70% (age < 14d: 99.30%)
21d+: 0.67% (age < 21d: 99.33%)
[original figures in description here were flawed, these are more-valid numbers]
Details
Related Changes in Gerrit:
Subject
Repo
Branch
Lines +/-
varnish: swap around backend ttl cap and keep values [2/2]
operations/puppet
production
+12
-9
Lower default $wgSquidMaxage from 31 days to 14 days
operations/mediawiki-config
master
+1
-2
cache_misc: raise default_ttl to 1h
operations/puppet
production
+1
-1
cache_upload: 1d FE TTL cap
operations/puppet
production
+1
-1
Set $wgSquidMaxage to 14 days on test2wiki
operations/mediawiki-config
master
+1
-1
Lower $wgSquidMaxage to 1 day for test2wiki
operations/mediawiki-config
master
+3
-2
cache_upload: experiment with 4h fe ttl cap
operations/puppet
production
+1
-1
cache_text: cap frontend TTL at 1d
operations/puppet
production
+1
-0
VCL: lower TTL caps from 14 to 7 days
operations/puppet
production
+4
-4
VCL: cap all TTLs at 14d (or less in existing cases)
operations/puppet
production
+5
-5
VCL: drop default ttl_cap to 21 days
operations/puppet
production
+4
-3
VCL: ttl fixed/cap params vcl_fetch
operations/puppet
production
+14
-29
Show related patches
Customize query in gerrit
Related Objects
Search...
Task Graph
Mentions
Duplicates
Status
Subtype
Assigned
Task
Restricted Task
Duplicate
None
T109331
Deleted files sometimes remain visible to non-privileged users if permanently linked
Duplicate
None
T133819
upload-lb.ulsfo.wikimedia.org still allow access to some deleted files
Declined
None
T125920
[EPIC] Future exciting reading web performance endeavours
Declined
Krinkle
T124966
Inline above-fold CSS in HTML response for MediaWiki to reduce time to first paint
Duplicate
BBlack
T119038
Image cache issue when 'over-writing' an image on commons
Resolved
ema
T133821
Make CDN purges reliable
Resolved
Krinkle
T127328
Optimise critical rendering path
Resolved
BBlack
T124954
Decrease max object TTL in varnishes
Mentioned In
T422985: WP25EasterEggs disabled but "Birthday mode (Baby Globe) settings" link still present
T373495: Investigate ways to reduce cache retention timespans
T340952: Edge caching issues on Vector 2022 in wmf.16
T341041: Vector 2022 is broken on wmf.16
T270796: Message boxes classes should carry `mw-`
T286835: Port RelatedArticles to Codex
T265543: UI Regression: Personal tools menu is appearing unstyled for anonymous users on cached HTML
T254227: Switch test wikis to new version of vector by default
T119366: Disable caching on the main page for anonymous users
T205355: A/B config flag should be subject to ResourceLoader caching rules not HTML caching rules
T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
T142848: Stop using persistent storage in our backend varnish layers.
T140921: Reduce static asset time on disk from five trains' worth to two
rOPUPe1e727d50a2f: cache_misc: raise default_ttl to 1h
rOPUPf8d67164cc16: cache_upload: 1d FE TTL cap
T138721: Remove duplicated styles in shared.css
rOPUPc9aee4fb850b: cache_upload: experiment with 4h fe ttl cap
rOPUP4fcbec53400d: VCL: cap all TTLs at 14d (or less in existing cases)
rOPUPb06f7d9c0dc6: VCL: lower TTL caps from 14 to 7 days
T135384: Raise cache frontend memory sizes significantly
rOPUP1e0cc7ae15e1: cache_text: cap frontend TTL at 1d
T127328: Optimise critical rendering path
rOPUPfd34f56c0d9e: VCL: lower TTL caps from 14 to 7 days
T50835: Separate Cache-Control header for proxy and client
T133821: Make CDN purges reliable
T131894: Collect Backend-Timing in Prometheus
T127571: Percentage of users with DNT on
T124966: Inline above-fold CSS in HTML response for MediaWiki to reduce time to first paint
T126063: Estimate effective cache time for text
Mentioned Here
T111588: RFC: API-driven web front-end
T46570: Time prior to removal of old wmfbranch directories from cluster MUST be higher than longest cache of ANY kind; leads to missing resources
T127328: Optimise critical rendering path
T50835: Separate Cache-Control header for proxy and client
rOPUPe10801bfe8fd: Add 'Backend-Timing' response header on all Apaches
Duplicates Merged Here
T126063: Estimate effective cache time for text
Event Timeline
There are a very large number of changes, so older changes are hidden.
Show Older Changes
gerritbot
subscribed.
Feb 11 2016, 2:19 PM
2016-02-11 14:19:52 (UTC+0)
Comment Actions
Change 269967 had a related patch set uploaded (by BBlack):
VCL: ttl fixed/cap params vcl_fetch
gerritbot
added a project:
Patch-For-Review
Feb 11 2016, 2:19 PM
2016-02-11 14:19:52 (UTC+0)
Comment Actions
Change 269968 had a related patch set uploaded (by BBlack):
VCL: drop default ttl_cap to 21 days
gerritbot
added a comment.
Feb 11 2016, 3:40 PM
2016-02-11 15:40:11 (UTC+0)
Comment Actions
Change 269967 merged by BBlack:
VCL: ttl fixed/cap params vcl_fetch
gerritbot
added a comment.
Feb 16 2016, 1:55 PM
2016-02-16 13:55:25 (UTC+0)
Comment Actions
Change 269968 merged by BBlack:
VCL: drop default ttl_cap to 21 days
BBlack
added a comment.
Edited
Feb 16 2016, 5:30 PM
2016-02-16 17:30:48 (UTC+0)
Comment Actions
I took a look at another small sample of data today, over on the cache_upload clusters, which we'd expect to behave very differently. This was a single 10-minute run on a eqiad upload cache. Things to keep in mind:
The upload frontends are (and have been historically) limited to 1h cache lifetime. This seems "bad" from a design perspective - there's no fundamental reason to not let objects live as long as they're able in the frontends, within the 30d limits at the backends. I've left it alone so far simply because it's probably (and perhaps, accidentally) helping to paper over fallout from cache purge race conditions where a frontend might otherwise keep longer-lived object past their race-losing purge in the backends.
Due to the above, we can't accurately measure this at the frontend layer like we do with cache_text, as all served objects there have 1h TTL or less regardless of how long they live in the backends. Therefore the statistics I pulled were from an eqiad
backend
instance's cache hits, which is backending all datacenters.
All of that said, the results of binning up the
Age:
values coming out of an eqiad backend for cache hits only looks like:
Total: 375009
1s+: 99.99%
1m+: 99.77%
1h+: 88.55%
4h+: 54.53%
12h+: 0.01%
1d+: 0.01%
7d+: 0.01%
14d+: 0 (actually 0, not just rounded to 0.00%)
The dropoff somewhere between 4 and 12 hours could be the result of the total set of unique URLs commonly fetched simply not fitting in the total hashed backend storage, leading to a naturally-short cache rollover time. Our total hashed backend storage in eqiad is ballpark 9.36TB, which is further split into subsets for giant objects and regular-sized objects (the split is at 100MB size limit, and ~17% goes to larger objects for ~1.5TB, and 83% to smaller ones for ~7.7TB).
In the net, we know from previous measurements that the cache object hitrate for the upload cluster is ~98% (counting hits at any layer as a hit), so I know we're not in performance trouble from lifetimes and/or LRU eviction in the general case.
BBlack
added a comment.
Feb 16 2016, 6:00 PM
2016-02-16 18:00:56 (UTC+0)
Comment Actions
I re-ran the parsing script over the exact same input data as the last results, with finer-grained detail on the 4-12h range (I had captured the output at an intermediate stage of the pipeline just in case):
Total: 375009
1s+: 99.99%
1m+: 99.77%
1h+: 88.55%
2h+: 76.09%
3h+: 64.80%
4h+: 54.53%
5h+: 45.59%
6h+: 37.42%
7h+: 29.80%
8h+: 22.96%
9h+: 16.56%
10h+: 10.76%
11h+: 5.23%
12h+: 0.01%
1d+: 0.01%
7d+: 0.01%
14d+: 0
The time falloff there does seem somewhat "natural" in its pattern, although the fact that the natural pattern winds down at exactly 12h is a little smelly of some other limitation there...
BBlack
added a comment.
Edited
Feb 16 2016, 6:30 PM
2016-02-16 18:30:28 (UTC+0)
Comment Actions
next upload datapoint is this. This is an esams backend instance (pulls from eqiad, gets requests from esams frontends):
Total: 704980
1s+: 100.00%
1m+: 99.95%
1h+: 98.87%
2h+: 97.52%
3h+: 95.62%
4h+: 93.46%
5h+: 91.03%
6h+: 88.42%
7h+: 85.60%
8h+: 82.64%
9h+: 79.47%
10h+: 76.23%
11h+: 73.04%
12h+: 69.99%
1d+: 32.63%
7d+: 0.00%
14d+: 0 (truly zero)
Things to note about esams:
This data was for hits where the hit was at eqiad or esams backends.
While esams isn't tier-1 (it backends to eqiad caches), the total hashed storage in esams is 11.5TB vs eqiad's 9.36TB.
The implications I see here are:
The extra 2.2TB of storage moves the natural cache rollover times out a bit, so that they seem to be tapering down to zero-ish at ~2 days instead of 12 hours, but otherwise the pattern is similar.
We're still getting zero hits at 14days+ ... ?
BBlack
added a comment.
Feb 16 2016, 8:37 PM
2016-02-16 20:37:45 (UTC+0)
Comment Actions
(note I've edited some of my cache_upload commentary above to remove questions/mysteries that turned out to mostly be my own braindeadness)
BBlack
added a comment.
Feb 17 2016, 6:03 PM
2016-02-17 18:03:22 (UTC+0)
Comment Actions
So, I've figured out some of the things that were confusing me yesterday. To recap that:
I now question and need to investigate whether our TTL caps are really effective in the first place. In practice not many hits live as long as the caps anyways, but I think the previous thinking (that capping beresp.ttl on fetch at all layers is effective) is wrong. Capping beresp.ttl may affect the TTL in the local cache, but I don't think it actually affects the cacheability headers sent to the next layer up the chain. So we could, in fact, see objects live longer than the TTL cap in total with our current VCL. In other words, capping at 30 days of life at each of 3 layers of varnishd could equate to an effective 90 day cap when we're talking about absolute limits. This is not the first time I've been confused about related things, though - needs more investigation.
swift doesn't send any cacheability info, so the default is going to be the varnishd default_ttl setting, which is currently 3 days.
The beresp.ttl fixed/cap settings at various upload layers/tiers have to consider that effect. That's why the 1h cap on upload-frontend works at all: the object arrives with no TTL (no cacheability headers), defaults to 3 days, then gets capped down to 1h. For the backends, the current ttl_fixed + ttl_capped set it to 30 days independently at each layer (which could theoretically be additive as in (1) above), but we can't just remove the ttl_fixed at the tier-2 backends to fix that, as that would revert to default_ttl of 3 days.
ori
mentioned this in
T127571: Percentage of users with DNT on
Feb 20 2016, 12:23 AM
2016-02-20 00:23:24 (UTC+0)
Krinkle
mentioned this in
T131894: Collect Backend-Timing in Prometheus
Apr 5 2016, 9:48 PM
2016-04-05 21:48:49 (UTC+0)
Andrew
triaged this task as
Medium
priority.
Apr 14 2016, 9:05 PM
2016-04-14 21:05:40 (UTC+0)
MZMcBride
subscribed.
Apr 25 2016, 1:40 PM
2016-04-25 13:40:52 (UTC+0)
BBlack
mentioned this in
T133821: Make CDN purges reliable
Apr 28 2016, 12:08 AM
2016-04-28 00:08:40 (UTC+0)
BBlack
added a parent task:
T133821: Make CDN purges reliable
gerritbot
added a comment.
May 5 2016, 4:14 PM
2016-05-05 16:14:41 (UTC+0)
Comment Actions
Change 287109 had a related patch set uploaded (by BBlack):
VCL: cap all TTLs at 14d (or less in existing cases)
BBlack
added a comment.
May 5 2016, 4:17 PM
2016-05-05 16:17:27 (UTC+0)
Comment Actions
We're overdue to circle back to this, but there's also a lot of investigating and thinking left to do, and IMHO the varnish4 transition as well as the Surrogate-Control ideas (
T50835
) play into this as well. I think we're ultimately going to solve this problem with varnish4 and some custom Surrogate-Control stuff that's initially just inter-cache, and later we can expand that to supporting it from MediaWiki as well. For now, I think further-dropping the text TTL cap from 21d to 14d, and dropping the upload cap from 30d to 14d, as in the patch above, will be an improvement and possibly help patch over any current fallouts.
gerritbot
added a comment.
May 5 2016, 4:21 PM
2016-05-05 16:21:45 (UTC+0)
Comment Actions
Change 287109 merged by BBlack:
VCL: cap all TTLs at 14d (or less in existing cases)
BBlack
mentioned this in
T50835: Separate Cache-Control header for proxy and client
May 5 2016, 4:33 PM
2016-05-05 16:33:42 (UTC+0)
gerritbot
added a comment.
May 26 2016, 8:41 PM
2016-05-26 20:41:05 (UTC+0)
Comment Actions
Change 291059 had a related patch set uploaded (by BBlack):
VCL: lower TTL caps from 14 to 7 days
gerritbot
added a comment.
May 26 2016, 8:42 PM
2016-05-26 20:42:02 (UTC+0)
Comment Actions
Change 291059 merged by BBlack:
VCL: lower TTL caps from 14 to 7 days
BBlack
mentioned this in
rOPUPfd34f56c0d9e: VCL: lower TTL caps from 14 to 7 days
May 26 2016, 8:47 PM
2016-05-26 20:47:09 (UTC+0)
Krinkle
mentioned this in
T127328: Optimise critical rendering path
May 26 2016, 9:03 PM
2016-05-26 21:03:48 (UTC+0)
gerritbot
added a comment.
May 27 2016, 1:13 PM
2016-05-27 13:13:35 (UTC+0)
Comment Actions
Change 291220 had a related patch set uploaded (by BBlack):
cache_text: cap frontend TTL at 1d
gerritbot
added a comment.
May 27 2016, 1:16 PM
2016-05-27 13:16:13 (UTC+0)
Comment Actions
Change 291220 merged by BBlack:
cache_text: cap frontend TTL at 1d
BBlack
mentioned this in
rOPUP1e0cc7ae15e1: cache_text: cap frontend TTL at 1d
May 27 2016, 1:19 PM
2016-05-27 13:19:33 (UTC+0)
ema
subscribed.
May 27 2016, 3:03 PM
2016-05-27 15:03:26 (UTC+0)
Gilles
subscribed.
Jun 8 2016, 12:02 PM
2016-06-08 12:02:33 (UTC+0)
BBlack
mentioned this in
T135384: Raise cache frontend memory sizes significantly
Jun 9 2016, 3:14 PM
2016-06-09 15:14:31 (UTC+0)
BBlack
mentioned this in
rOPUPb06f7d9c0dc6: VCL: lower TTL caps from 14 to 7 days
Jun 17 2016, 6:07 PM
2016-06-17 18:07:54 (UTC+0)
BBlack
mentioned this in
rOPUP4fcbec53400d: VCL: cap all TTLs at 14d (or less in existing cases)
Jun 17 2016, 6:10 PM
2016-06-17 18:10:22 (UTC+0)
gerritbot
added a comment.
Jun 17 2016, 8:57 PM
2016-06-17 20:57:09 (UTC+0)
Comment Actions
Change 295007 had a related patch set uploaded (by BBlack):
cache_upload: experiment with 4h fe ttl cap
gerritbot
added a comment.
Jun 17 2016, 8:57 PM
2016-06-17 20:57:42 (UTC+0)
Comment Actions
Change 295007 merged by BBlack:
cache_upload: experiment with 4h fe ttl cap
BBlack
mentioned this in
rOPUPc9aee4fb850b: cache_upload: experiment with 4h fe ttl cap
Jun 17 2016, 9:01 PM
2016-06-17 21:01:49 (UTC+0)
Krinkle
added a parent task:
T127328: Optimise critical rendering path
Jun 22 2016, 4:01 PM
2016-06-22 16:01:28 (UTC+0)
Krinkle
subscribed.
Edited
Jun 22 2016, 4:13 PM
2016-06-22 16:13:24 (UTC+0)
Comment Actions
How does the cache ttl of Varnish interact with the concept of 304 renewals?
I remember in the past we often had bugs where a cache object had expired (but not yet garbage collected) at which point Varnish does (and should) make a request to the backend with a If-Modified-Since header. At this point, MediaWiki would respond with 304 Not Modified (since the page wasn't edited since that timestamp), and Varnish would renew the cache object.
This would cause data that is not strictly versioned to go stale indefinitely:
Skin html.
Links from that html to other static files (e.g. powered-by image).
Anchor links in the navigation sidebar (configurable through MediaWiki:Sidebar, and extendable from PHP extensions as well, which can get deployed or undeployed, e.g. WikimediaShopLink).
Translated interface messages such as "View history" etc.
I don't know if that problem ever got fixed, but if it isn't, then merely lowering the ttl in Varnish is not enough to unblock
T127328
Note, this "304 renewal" behaviour is very much intended and required in general. (The whole point of 304 is that you determine freshness without computing and transferring the whole page again). Max-age (in Cache-Control) isn't about how long a client
stores
the content. It's about how long the client may
blindly use
the content without checking with the server (=304).
However if we want fault tolerance and easy migration, max-age isn't the way to do it. A low max-age does not mean that broken html will roll over after it expires. It also doesn't mean that it's safe to remove "unused" end points 30 days after we no longer emit them. So let's make sure that we understand what this "ttl" means exactly, and if needed, we may need a second behaviour mechanism that (when reached) would result in Varnish requesting the backend without If-Modified-Since/If-None-Match header.
BBlack
added a comment.
Jun 22 2016, 4:28 PM
2016-06-22 16:28:48 (UTC+0)
Comment Actions
Varnish 3 and 4 may differ a bit on 304 basics, and Varnish 4 clearly does a better job of managing grace-mode in general, and using it for 304-refreshes, and my current recollections may be more Varnish4-tainted and miss something about Varnish3 without digging deeper. All that being said:
Yes, in general varnish will re-use stale objects for 304-refresh from backends.
I don't think it uses any random object that happens to still exist in storage. It re-uses objects that are still in their grace time, and once they're out of grace they're gone for all practical purposes, regardless of low-level storage GC/reuse.
So if an object has 7 days of real TTL and an additional 1 day of grace time, if it receives a request during the 8th day that could theoretically have used the stale object, it does an conditional request (e.g. IMS, or maybe even ETag) to the backend. If conditional request gives a 304, it refreshes the life of the stale object, reusing the content, and updates the relevant headers from the ones that came with the 304. If there was no request during the 8th day, a request on the 9th day would be a normal cache miss.
IMHO, if MediaWiki is handing out illegitimate 304s in response to conditional requests (saying something is Not Modified when it was, in fact, modified), then that's the bug to be fixed here.
BBlack
added a comment.
Jun 22 2016, 4:30 PM
2016-06-22 16:30:22 (UTC+0)
Comment Actions
I should have noted above: our current maximum grace is 1 hour beyond whatever the TTL is. Basically we're really not using grace very effectively today, but it's enough to be sure we handle the overlap well on fairly hot items that need to be refreshed occasionally.
Krinkle
added a comment.
Jun 24 2016, 2:29 PM
2016-06-24 14:29:27 (UTC+0)
Comment Actions
@BBlack
I agree that technically "Not Modified" is a lie from MediaWiki in that case, but I'm not convinced that behaviour is wrong or needs changing.
In many cases Not Modified means "not *significantly* changed". For two reasons:
Computational overhead to determine exact changes.
Impact of global cache invalidation on insignificant changes.
All of the below are examples of things that technically change the HTML output, but are not currently tracked (they are effectively stateless and just happen in whatever way they are currently configured - unlike content revisions, which have a timestamp and a revision ID).
Vector skin HTML.
e.g. wgReferrerPolicy and other things in wmf-config.
Static file references
e.g. bits.wikimedia.org > $wikidomain, $wikidomain.org/static/1.28-$version > $wikidomain/w.
Sidebar configuration.
e.g. installing or disabling WikimediaShopLink.
Any interface message.
Much more...
The only way to reliably track these is to essentially forego the optimisation for 304 responses, do a full page render, and make a hash digest (and use ETag to communicate it). It also would effectively lead to a full cache invalidation if anything changes anywhere. (Though Varnish and browsers would still be allowed to unconditionally cache for the ttl duration, after that it would always cause a fresh page render to happen in the backend, though it wouldn't need to be transferred per se).
The computational overhead is probably manageable given that the majority of it already happens anyway (overhead to contacting Apache backends, initialising MediaWiki WebStart, making several db queries). The wrapping the output is non-trivial, but manageable.
The impact of cache-invalidation may be undesirable though. But the more I think about it, it may not be that bad actually.
BBlack
added a comment.
Jun 24 2016, 2:42 PM
2016-06-24 14:42:23 (UTC+0)
Comment Actions
Well, it's certainly legal from some point of view. But if you want to claim Not Modified on what are considered minor non-breaking changes then you have to live with the consequences that old content may live on indefinitely due to 304-refresh.
If there are content updates that affect broad swaths of content non-critically (like the examples you mention), couldn't we simply (a)
not
PURGE all related things immediately from traffic caches and (b) update the IMS timestamp (or ETag) when the parsercache entry is regenerated for each item affected by the change, and store that timestamp/etag with the parsercache output? I assume that's a slow/throttled process for massive updates, and it would let 304 still work correctly and efficiently. As items affected by such changes naturally fall out of TTL time in the caches, they'll get new data if the throttled parsercache update has already hit those objects. It puts an upper bound on how old things can get: up to $total_traffic_TTL after the slow parsercache update is done for a given change.
Krinkle
added a subscriber:
tstarling
Jun 24 2016, 2:50 PM
2016-06-24 14:50:37 (UTC+0)
Comment Actions
For as long as I can remember (at least 6 years), we've made countless breaking changes based on the basic assumption that caches roll over within ttl ("30 days").
For example, earlier today. From
@tstarling
wrote:
Also, in RaggettWrapper, switch to the new class mw-empty-elt, following
Html5Depurate, instead of mw-empty-li. The old class can be removed once
HTML caches have expired.
In this case, we're changing the parser output (as usual, without explicitly invalidating the parser cache key and purging all Varnish HTML cache). And expecting to safely remove the CSS declaration for the old output once the caches have expired. The url response from which the CSS is served will "modify" when that happens, and thus affect all cached content. Even once parser cache has rolled over (which does truly roll over, given that it isn't HTTP based, but purely TTL/LRU based) - per
T124954#2399694
, Varnish will happily renew the old parser output from its stale content over 304, and live on. For another ttl period, and again etc. unless the page is edited or otherwise purged.
BBlack
added a comment.
Jun 25 2016, 11:58 AM
2016-06-25 11:58:05 (UTC+0)
Comment Actions
Yeah it's not great, but what do you expect to happen? That's what we're telling Varnish to do based on the standards. This is the timeline we're talking about (just using a generic integer counter as time moving forward):
Fresh object X is generated in MediaWiki.
Varnish fetches X and gets some positive TTL N for caching.
The underlying object changes in MediaWiki before the TTL is even up
X's TTL expires, at which point there's a small grace-window for "stale-while-revalidate" type behavior (so that new content can be fetched (or existing re-validated) without stalling out clients).
Varnish asks if X has been modified since it was last fetched in (2)
Mediawiki says "304 - No, it hasn't, and you can cache it again for another TTL N" <-
This is a lie, and if you tell this lie Varnish is going to believe you
Varnish happily refreshes headers/timestamps on the existing object for new clients going forward, putting it in the same basic state it had in (2); it can infinitely loop through these steps.
There are mitigating factors that probably make it
unlikely
that a bad object gets stuck in this cycle repeatedly:
While our maximum grace in grepping our VCL is 60 minutes, that's only on detection of an unhealthy backend, and our default grace is actually 5 minutes, so that's what applies most of the time. An object has to be hot enough to be requested during the 5 minute grace window at the end of its natural expiry to have a chance at the above. If it misses the 5-minute window it's gone for good and the first requesting client has to stall on reloading whole new content into the cache.
Objects can be pushed out of cache storage before they naturally expire (by newer objects) - surviving in the face of this depends, again, on hotness.
We do wipe caches over time, irregularly, due to maintenance. The frontends more often than the backends.
The ones to worry about the most are the very hot objects that we know never go 5 minutes without a fetch somewhere.
Why don't we update IMS timestamp or ETag when cached parser output actually-changes from slow rollover?
matmarex
mentioned this in
T138721: Remove duplicated styles in shared.css
Jun 26 2016, 10:00 PM
2016-06-26 22:00:19 (UTC+0)
Krinkle
added a comment.
Jun 27 2016, 5:36 PM
2016-06-27 17:36:34 (UTC+0)
Comment Actions
In
T124954#2406470
@BBlack
wrote:
Why don't we update IMS timestamp or ETag when cached parser output actually-changes [..]
There is no detection of that kind of change. We don't version the Parser right now. And even if we would, we'd have to somehow salt it with all relevant configuration and list of activated extensions to be precise. Similar to how it's impractical to do a full hash of the skin output (see previous comment), doing so for parser output would equally require a lot of state tracking and/or doing all the computations we're trying to save in the first place.
Alternatively, we could change MediaWiki to enforce that cached responses will not be used beyond the intended max-age. We'd compare to
max(revision.timestamp, now - maxage
instead of just
revision.timestamp
. That effectively means that if the last tracked change was more than (maxage) in the past, we'll return false from the If-Modified-Since check and respond with a regenerated 200 OK.
Edit: Looks like we do that already! (Done for
T46570
, which is an example of the kind of bug that happens when secondary content goes stale due 304-renewed).
$lastMod
$module
->
getConditionalRequestData
'last-modified'
);
if
$lastMod
!==
null
$modifiedTimes
'page'
=>
$lastMod
'user'
=>
$this
->
getUser
()->
getTouched
(),
'epoch'
=>
$this
->
getConfig
()->
get
'CacheEpoch'
),
];
if
$this
->
getConfig
()->
get
'UseSquid'
// T46570: Stateless data can still change even if the wiki page did not
$modifiedTimes
'sepoch'
wfTimestamp
TS_MW
time
()
$this
->
getConfig
()->
get
'SquidMaxage'
);
$lastMod
max
$modifiedTimes
);
So, with that, we just need to decide what to do with $wgSquidMaxage in wmf-config. That is the effective config for how long untracked content may be serve to users, not Varnish ttl. That maxage is what we should look at when removing unused server endpoints, unused styles, etc. For most purposes, the length of this is merely an inconvenience (shorter allows faster iteration, but migration works either way). For
T127328
to be unblocked however, we need it to actually be low enough since it's not about migration but about freshness of styles across content. Ideally as low as Varnish ttl (24 hours).
Krinkle
updated the task description.
(Show Details)
Jun 27 2016, 5:42 PM
2016-06-27 17:42:42 (UTC+0)
gerritbot
added a comment.
Jun 29 2016, 2:54 AM
2016-06-29 02:54:18 (UTC+0)
Comment Actions
Change 296495 had a related patch set uploaded (by Krinkle):
Lower $wgSquidMaxage to 1 day for test2wiki
gerritbot
added a comment.
Jun 29 2016, 3:05 AM
2016-06-29 03:05:58 (UTC+0)
Comment Actions
Change 296495 merged by jenkins-bot:
Lower $wgSquidMaxage to 1 day for test2wiki
BBlack
added a comment.
Jun 30 2016, 2:39 PM
2016-06-30 14:39:37 (UTC+0)
Comment Actions
@Krinkle
- the varnish TTL cap is *per layer*, and it's still 7 days in the backend layers (it's only 1 day in the frontend layers). If the test2wiki change is intended to go to production, IMHO it's not a good idea to drop the squid maxage to 1 day. It needs to at least be 7 days, but I'd start higher than that (14?) until we get past Varnish4 transition for text and can make grace-mode behaviors work better.
Re: detecting parser output changes, couldn't we just do a hash over the output to generate an ETag?
gerritbot
added a comment.
Jun 30 2016, 4:02 PM
2016-06-30 16:02:52 (UTC+0)
Comment Actions
Change 296765 had a related patch set uploaded (by Krinkle):
Set $wgSquidMaxage to 14 days on test2wiki
Krinkle
added a comment.
Jun 30 2016, 4:45 PM
2016-06-30 16:45:59 (UTC+0)
Comment Actions
In
T124954#2418150
@BBlack
wrote:
Re: detecting parser output changes, couldn't we just do a hash over the output to generate an ETag?
That's a paradox. If we do that, we'd have to validate the ETag on an If-None-Match request by invoking the parser and extension hooks those backend requests, hash the output of that and compare the hash. That would be rather expensive.
Most things previously mentioned and dozens more aspects of a MediaWiki page response are actually not even in the parser output cache. They're also not versioned in a way accessible to the run-time. To verify nothing changed one'd have to build the whole page. Since that's too expensive, we've essentially decided long ago to instead only track the critical portion (revision content). The rest is still important, but as long as we can be sure that html responses unconditionally expire and regenerate-on-demand after X time - it's fine. Slow deployment is acceptable for those, as long as they do get universally deployed, eventually, and within a predictable timeframe.
That timeframe has historically been 31 days. Until last year we did leak a fair amount beyond 31 days due to 304-renewals, but that was fixed after
T46570
by forcing Last-Modified to be
max(revision.timestamp, cacheEpoch, now-smaxage)
In most cases, changes of this kind are not very noticeable and okay to roll out gradually over our content (e.g. for up to 30 days, different articles may have either the old or new version). For example, migration from bits urls to local load.php was okay to roll out slowly. For more user-visible aspects, we tend to use CSS - in which case they do roll out globally at once since that's a separate request url. However that's changing with
T127328
, which will make some styles into the html.
A few years ago we some major changes to the Vector skin - and for a month users perceived an alternating layout from one page to another. I hope to avoid that in the future with
T111588
(which, like ESI, applies the skin as separate cacheable entity at the edge).
Anyway, back to the topic of this task. Let's start by lowering smaxage from MediaWiki to 14 days?
gerritbot
added a comment.
Jun 30 2016, 9:48 PM
2016-06-30 21:48:20 (UTC+0)
Comment Actions
Change 296765 merged by jenkins-bot:
Set $wgSquidMaxage to 14 days on test2wiki
BBlack
added a comment.
Edited
Jul 1 2016, 2:12 AM
2016-07-01 02:12:45 (UTC+0)
Comment Actions
@Krinkle
- I think 14d for the maximum s-maxage MW advertises to Varnish is fine for now. We'd obviously like to, in the long run, get the effective lifetimes even lower (both enforced in Varnish, and in the s-maxage or similar from MW), but I don't think it's safe to go much lower until we get through the V4 transition and switch to proper use of Surrogate-Control between layers and using grace-mode correctly to handle the datacenter/network outage cases (as in, have the "normal" TTLs down somewhere in the 1d range, but have grace-mode capable of using stale objects in emergencies for a week).
Danielsberger
subscribed.
Jul 7 2016, 6:48 PM
2016-07-07 18:48:23 (UTC+0)
gerritbot
added a comment.
Jul 14 2016, 2:29 PM
2016-07-14 14:29:58 (UTC+0)
Comment Actions
Change 298968 had a related patch set uploaded (by BBlack):
cache_upload: 1d FE TTL cap
gerritbot
added a comment.
Jul 14 2016, 2:29 PM
2016-07-14 14:29:59 (UTC+0)
Comment Actions
Change 298970 had a related patch set uploaded (by BBlack):
cache_misc: raise default_ttl to 1h
BBlack
mentioned this in
rOPUPf8d67164cc16: cache_upload: 1d FE TTL cap
Jul 14 2016, 2:33 PM
2016-07-14 14:33:35 (UTC+0)
BBlack
mentioned this in
rOPUPe1e727d50a2f: cache_misc: raise default_ttl to 1h
gerritbot
added a comment.
Jul 14 2016, 2:36 PM
2016-07-14 14:36:18 (UTC+0)
Comment Actions
Change 298968 merged by BBlack:
cache_upload: 1d FE TTL cap
gerritbot
added a comment.
Jul 14 2016, 2:36 PM
2016-07-14 14:36:44 (UTC+0)
Comment Actions
Change 298970 merged by BBlack:
cache_misc: raise default_ttl to 1h
gerritbot
added a comment.
Jul 15 2016, 2:35 PM
2016-07-15 14:35:35 (UTC+0)
Comment Actions
Change 299153 had a related patch set uploaded (by Krinkle):
Lower default $wgSquidMaxage from 31 days to 14 days
gerritbot
added a comment.
Jul 15 2016, 6:26 PM
2016-07-15 18:26:47 (UTC+0)
Comment Actions
Change 299153 merged by jenkins-bot:
Lower default $wgSquidMaxage from 31 days to 14 days
Krinkle
removed a project:
Patch-For-Review
Jul 16 2016, 1:30 AM
2016-07-16 01:30:12 (UTC+0)
Krinkle
mentioned this in
T140921: Reduce static asset time on disk from five trains' worth to two
Jul 20 2016, 5:44 PM
2016-07-20 17:44:58 (UTC+0)
BBlack
mentioned this in
T142848: Stop using persistent storage in our backend varnish layers.
Aug 22 2016, 4:35 PM
2016-08-22 16:35:25 (UTC+0)
ema
moved this task from
Backlog
to
Caching
on the
Traffic
board.
Sep 30 2016, 2:33 PM
2016-09-30 14:33:11 (UTC+0)
gerritbot
added a comment.
Mar 21 2017, 10:39 AM
2017-03-21 10:39:46 (UTC+0)
Comment Actions
Change 343845 had a related patch set uploaded (by Ema):
[operations/puppet] varnish: swap around backend ttl cap and keep values [2/2]
gerritbot
added a project:
Patch-For-Review
Mar 21 2017, 10:39 AM
2017-03-21 10:39:47 (UTC+0)
Comment Actions
Change 343844 had a related patch set uploaded (by Ema):
[operations/puppet] varnish: swap around backend ttl cap and keep values [1/2]
phuedx
subscribed.
Apr 2 2017, 4:30 PM
2017-04-02 16:30:00 (UTC+0)
gerritbot
added a comment.
May 4 2017, 4:48 PM
2017-05-04 16:48:42 (UTC+0)
Comment Actions
Change 343845 merged by Ema:
[operations/puppet@production] varnish: swap around backend ttl cap and keep values [2/2]
BBlack
mentioned this in
T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan
May 30 2017, 11:08 PM
2017-05-30 23:08:05 (UTC+0)
BBlack
added a comment.
Jul 10 2017, 3:42 PM
2017-07-10 15:42:37 (UTC+0)
Comment Actions
Recap of recent progress: where we're at now is a hard cap of 1 day TTL within each cache layer, regardless of any longer max-age sent by the application layer. Depending on the user's geographic location, there can be anywhere from 2 to 4 cache layers involved in their request. In edge cases with hot items the per-layer TTL cap behavior will have a natural race condition which under uncommon conditions could cause the total TTL of the caching stack to be multiplied by the number of layers, resulting in 2-4 days of total TTL before the object is fully expired for all users.
We don't believe it should be possible at this time for an object to exist in the caching layers for more than 4 days, assuming there are no application-layer HTTP bugs in play (e.g. the application incorrectly giving a
304 Not Modified
response to a conditional request from the cache, for content which has in fact been modified).
Our next step here is to begin using
Surrogate-Control
headers for inter-cache communication of capped TTLs, which will remove the layer-multiplication issues and give us hard limit for the total cache stack at 1 full day. There are some interactions between that work and related grace/keep issues (calculating cache-local ttl and grace values as percentages of the total TTL, etc), so they should probably be tackled in tandem.
Krinkle
added a comment.
Jul 11 2017, 4:09 AM
2017-07-11 04:09:29 (UTC+0)
Comment Actions
In
T124954#3421257
@BBlack
wrote:
[..] We don't believe it should be possible at this time for an object to exist in the caching layers for more than 4 days, assuming there are no application-layer HTTP bugs in play (e.g. the application incorrectly giving a
304 Not Modified
response to a conditional request from the cache, for content which has in fact been modified).
Is there an upper limit to how long or how often the same cache object can be "304-whitewashed"? (E.g. as long as it keeps being requested in the grace period between ttl expiring and object actually being removed from storage).
I assume that it does allow infinite white-washing, and that that is by design. As I understand it:
ttl
is how long the object is considered fresh.
grace
is how long to keep it around so that it may be served as stale object to the user while the currently stale object is either being renewed (by a 304 Not Modified response) or replaced (by a 200 OK response).
I've recently seen new Varnish configuration for a property called
obj.keep
beresp.keep
. It's unclear to me how
keep
fits in with this. If an object is beyond
ttl+grace
, what purpose will the object serve? I suppose the only remaining use is, if a request is made after
ttl+grace
but within
keep
it can be used to renew the object if the next user request yields a 304 response.
MediaWiki quite often responds with a 304 Not Modified when the response is in fact different because we only track the internal wiki page content as means for validating If-Modified-Since. Changes to MediaWiki core output format, WMF configuration changes, and changes to the Skin, are not tracked in a way that the application is aware of. And besides, we wouldn't want to reject the entire global cache every time a minor change or configuration change happens. For the most part, the architecture design for large-scale MediaWiki deployments is that all state outside the actual revision history of wiki pages is observed as static. And we rely on cache expiry to base compatibility decisions, such as:
How long to keep CSS or JS code around for HTML compatibility? (E.g. when changing something in the HTML output that is styled by CSS or enhanced by JS, we keep both the old and new CSS/JS around until we believe any previously generated HTML has dropped out of the CDN caches.)
How long before we remove a file from
/static
after updating MediaWiki configuration to output references to a different file.
This kind of decision happens almost every week. And for that, we need a high-confidence threshold for how long cache is supposed to take to fully turn over. In extreme cases we'll get real data (e.g. tail varnishlog, query
wmf.webrequest
in Hive, ad-hoc use of statsv or EventLogging), but doing that every time doesn't scale. (And shouldn't be needed.)
Historically, the upper limit was a month ("31 days"). Last year this was lowered to 14 days. Over the last few months, some people assumed it to be 7 days, 5 days, or 4 days, but I'm holding on to "14 days" until I hear otherwise.
Assuming infinite white-washing, the upper limit is effectively decided by
wgSquidMaxage
. This is currently 14 days. MediaWiki will always generate a fresh response when the previously stored object is older than this. Precisely to ensure "static" changes (e.g. Skin layout, config changes, core features etc.) will propagate eventually.
Should we lower
$wgSquidMaxage
to, say, 5 days? That would give it a day breathing room from Varnish perspective (4 days), while still confidently under the deployment frequently (7 days) – which would allow us to reduce HTML-compat to 1 week instead of 3 weeks (rounding up).
Krinkle
removed a project:
Patch-For-Review
Jul 11 2017, 4:09 AM
2017-07-11 04:09:40 (UTC+0)
BBlack
added a comment.
Jul 11 2017, 3:06 PM
2017-07-11 15:06:21 (UTC+0)
Comment Actions
In
T124954#3423643
@Krinkle
wrote:
In
T124954#3421257
@BBlack
wrote:
[..] We don't believe it should be possible at this time for an object to exist in the caching layers for more than 4 days, assuming there are no application-layer HTTP bugs in play (e.g. the application incorrectly giving a
304 Not Modified
response to a conditional request from the cache, for content which has in fact been modified).
Is there an upper limit to how long or how often the same cache object can be "304-whitewashed"?
As far as I know, there's no upper limit and Varnish will infinitely whitewash via 304 so long as an object is within its total keep time each time it needs to refresh. The infinite cycle would stop if the object ever went un-accessed long enough (e.g. over a week). To recap varnish behavior, the 3 values in play are
ttl
grace
, and
keep
, and they add up serially (the timers do not run concurrently). TTL is the basic lifetime of the object. After the TTL is expired, if the object is still within the grace period it can be served stale to a user while the content is refreshed in the background (possibly via conditional request, if applicable). Once the grace period has expired, the object can remain valid in storage for the duration of the keep timer, during which the contents can only be used as the source of a conditional, synchronous verification to the applayer looking for a 304 to refresh the life of the contents (saving transfer bandwidth and storage churn vs a 200). Our current settings are to cap the application-provided TTL to 1-day, fixed grace period of 5 minutes, and cap the keep value to an additional 7 days (if the app-provided TTL is <7d, the keep value currently gets lowered to the app-provided TTL, to help minimize bad-304 fallout with shorter-lived objects).
So, in the standard MediaWiki case of fresh page object with a 14d app-specified TTL, the backendmost cache will end up with ttl=1d + grace=5m + keep=7d for a total of 8d5m duration that the content is considered valid for a conditional refresh via 304, and the 304 cycle can repeat indefinitely AFAIK, keeping stale content alive forever if the application layer always claims it's still unmodified.
MediaWiki quite often responds with a 304 Not Modified when the response is in fact different
We'll obviously have to work with what we have today, but for the record this is
Not Ok
, and should probably be addressed in the future. It probably will continue to be a pain point with various future cache and/or proxy technologies. It's a real problem with HTTP semantics, and it's hard to ever hack around it in an appropriate way that doesn't introduce other subtle issues. Referencing the justification below (because I'm leaving this argument aside for the remainder) you wouldn't have to invalidate the whole global Varnish cache every time a minor skin change happens to have the 304 mechanism work correctly. Skin updates that affect the main page output could correctly change the conditional responses of MediaWiki without sending an explicit purge to Varnish. The existing objects would still get their normal cache lifetimes and refresh correctly to the new Skin as they expire from their normal TTLs.
The other side of the issue is erring on the safe side of the equasion as we do today (effectively invalidating for conditional refresh all objects older than
$wqSquidMaxage
). While it's not a semantic problem any more than simply never issuing 304s would be, it's also potentially an unnecessary cause of performance meltdown. There could be cases where we'd hope to rely on 304s to avoid transfer bursts to the caches, but we're getting a full 200 on content that didn't happen to change across that artificial barrier in time. The ideal we'd hope for is that conditional-request semantics apply exactly correctly.
because we only track the internal wiki page content as means for validating If-Modified-Since. Changes to MediaWiki core output format, WMF configuration changes, and changes to the Skin, are not tracked in a way that the application is aware of. And besides, we wouldn't want to reject the entire global cache every time a minor change or configuration change happens. For the most part, the architecture design for large-scale MediaWiki deployments is that all state outside the actual revision history of wiki pages is observed as static. And we rely on cache expiry to base compatibility decisions, such as:
How long to keep CSS or JS code around for HTML compatibility? (E.g. when changing something in the HTML output that is styled by CSS or enhanced by JS, we keep both the old and new CSS/JS around until we believe any previously generated HTML has dropped out of the CDN caches.)
How long before we remove a file from
/static
after updating MediaWiki configuration to output references to a different file.
This kind of decision happens almost every week. And for that, we need a high-confidence threshold for how long cache is supposed to take to fully turn over. In extreme cases we'll get real data (e.g. tail varnishlog, query
wmf.webrequest
in Hive, ad-hoc use of statsv or EventLogging), but doing that every time doesn't scale. (And shouldn't be needed.)
Right, because we're versioning these files, and therefore the core page output changes every time they change, to update the versioning hash in the link reference?
Historically, the upper limit was a month ("31 days"). Last year this was lowered to 14 days. Over the last few months, some people assumed it to be 7 days, 5 days, or 4 days, but I'm holding on to "14 days" until I hear otherwise.
Assuming infinite white-washing, the upper limit is effectively decided by
wgSquidMaxage
. This is currently 14 days. MediaWiki will always generate a fresh response when the previously stored object is older than this. Precisely to ensure "static" changes (e.g. Skin layout, config changes, core features etc.) will propagate eventually.
Should we lower
$wgSquidMaxage
to, say, 5 days? That would give it a day breathing room from Varnish perspective (4 days), while still confidently under the deployment frequently (7 days) – which would allow us to reduce HTML-compat to 1 week instead of 3 weeks (rounding up).
It's complicated because
$wgSquidMaxage
is actually controlling a few different things: the max age sent to Varnish as a TTL signal, the maximum age for which MW will continue conditionally-verifying content that may have changed due to meta-level changes (Skin, etc), and thus also the artificial barrier after which MW will no longer conditionally-verify content that hasn't changed. It's also indirectly controlling our
keep
-reducing hack, which isn't great since we're hoping the
keep
values save us from cache meltdown when we have our now-short-TTL caches offline for 1-7d periods (by reducing burst transfer on repool).
I'd propose for now to:
Change
$wgSquidMaxAge
to 7 days. If you were to go any lower, it would again cause us burst-transfer problems with our 1-week timeline, because MW is going to consider everything older than this value 304-invalid even if it hadn't changed.
We're still going to aim for 1d TTLs in our Varnishes in general, but given the 304 issues and our need for it to work correctly to appropriately handle maintenance and outages, MW's wgSquidMaxAge really shouldn't go under 7d at this time, and is also the only TTL you can rely on for things like removing old versioned static files.
Separately, I'd like to eliminate (or at least slightly fix) our "cap the keep value to the TTL" hack, since it doesn't work right on a number of levels.
Since MediaWiki is the only complicated case we care a lot about (the reason we went with the paranoid keep-reduction on short TTLs), if we could verify that there aren't other 304 mis-behaviors from MW that matter for other short-lived objects (e.g. RL? cacheable short-TTL MW API output cases?), I'd propose we move forward with just using a fixed 7-day keep value as the simplest answer. Alternatively, if there are other shorter-TTL objects that do have 304-misbehavior, we could consider trying to use the actual CC:s-maxage value as a cap on the keep value, rather than the current TTL. But this wouldn't work either for outputs I'm observing today because of another oddity: when serving "old" objects, MW seems to count down the TTL in the CC:s-maxage field, when the more-correct behavior would be to keep the CC:-s-maxage field constant at the
$wgSquidMaxage
value and count up an
Age:
output header.
In example terms, what we expect is:
[fresh object just parsed for the first time]
GET /wiki/Foo HTTP/1.1
....
Cache-Control: s-maxage=1209600
Age: 0

[next request for same object, 60s later]
GET /wiki/Foo HTTP/1.1
....
Cache-Control: s-maxage=1209600
Age: 60
What we seem to get from MW is:
[fresh object just parsed for the first time]
GET /wiki/Foo HTTP/1.1
....
Cache-Control: s-maxage=1209600
[no Age header]

[next request for same object, 60s later]
GET /wiki/Foo HTTP/1.1
....
Cache-Control: s-maxage=1209540
[no Age header]
Since
Age
is implicitly zero, the calculated TTL of the object (
CC:s-maxage - Age
) is the same, but this denies us the ability to see an object's policy-based max-age, which is useful information when we're trying to do something intelligent with grace and keep behaviors, as there's a big difference between a 2-week-age type of object that has 10 seconds of life left and a freshly generated object that only ever gets to live for 10 seconds.
Tbayer
subscribed.
Jul 15 2017, 1:03 AM
2017-07-15 01:03:23 (UTC+0)
Jhernandez
subscribed.
Jul 31 2017, 4:48 PM
2017-07-31 16:48:52 (UTC+0)
Keegan
subscribed.
Aug 9 2017, 8:55 PM
2017-08-09 20:55:02 (UTC+0)
BBlack
closed this task as
Resolved
Oct 23 2017, 3:11 PM
2017-10-23 15:11:11 (UTC+0)
BBlack
claimed this task.
Comment Actions
Closing this ticket as it's getting rather long in the tooth. We did reduce our TTL caps down to 1d across the board at all layers, with up to ~7d keep times, and that did accomplish a lot of what was desired here. Further work on rationalizing MediaWiki's output behaviors is complicated to even comprehend fully and not directly related, maybe new tickets should be filed about that.
Jdlrobson
mentioned this in
T205355: A/B config flag should be subject to ResourceLoader caching rules not HTML caching rules
Sep 24 2018, 10:49 PM
2018-09-24 22:49:48 (UTC+0)
Krinkle
added a project:
Performance-Team (Radar)
Oct 13 2018, 1:21 AM
2018-10-13 01:21:00 (UTC+0)
BBlack
mentioned this in
T119366: Disable caching on the main page for anonymous users
Nov 17 2018, 12:17 AM
2018-11-17 00:17:10 (UTC+0)
Jdlrobson
mentioned this in
T254227: Switch test wikis to new version of vector by default
Jun 3 2020, 4:56 PM
2020-06-03 16:56:23 (UTC+0)
Jdlrobson
mentioned this in
T265543: UI Regression: Personal tools menu is appearing unstyled for anonymous users on cached HTML
Oct 14 2020, 9:40 PM
2020-10-14 21:40:36 (UTC+0)
Krinkle
mentioned this in
T286835: Port RelatedArticles to Codex
Nov 23 2021, 4:52 AM
2021-11-23 04:52:41 (UTC+0)
Jdlrobson
mentioned this in
T270796: Message boxes classes should carry `mw-`
Apr 13 2022, 8:27 PM
2022-04-13 20:27:25 (UTC+0)
Jdlrobson
mentioned this in
T341041: Vector 2022 is broken on wmf.16
Jul 5 2023, 3:08 PM
2023-07-05 15:08:31 (UTC+0)
Jdlrobson
mentioned this in
T340952: Edge caching issues on Vector 2022 in wmf.16
Jul 5 2023, 3:15 PM
2023-07-05 15:15:05 (UTC+0)
Jdrewniak
mentioned this in
T373495: Investigate ways to reduce cache retention timespans
Aug 27 2024, 11:49 PM
2024-08-27 23:49:53 (UTC+0)
Jdrewniak
mentioned this in
T422985: WP25EasterEggs disabled but "Birthday mode (Baby Globe) settings" link still present
Tue, Apr 14, 1:42 AM
2026-04-14 01:42:28 (UTC+0)
Log In to Comment
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct.
Wikimedia Foundation
Code of Conduct
Disclaimer
CC-BY-SA
GPL
Credits