I think 5% is huge! As part of T233886 and T189966, I took many work-days to achieve similar gains, and even that is becoming harder and harder without it turning into weeks of multi-person/cross-team dependencies. These kinds of gain will decide how much work it is going to take to achieve certain consistent latencies on the new REST api for example, and also make latencies generally more consistent.
Let me note that the difference is well below 2% in most cases, and that it's much smaller than the variations in backend response times we see daily due to a variety of other effects, which can be in the range 5-10% or higher.
Moreover: a simple badly optimized database query costs us a 20% of performance for hours, easily, in terms of backend response times. And it doesn't require us to radically change our production environment.
I will run more extensive tests so I have more precise results, but in terms of performance evaluation, this gain is barely noticeable over a full day, and well below what we would consider significant.
It also makes the memory usage of the whole process about 3x what it is in normal operating conditions, and raises the CPU usage.
From the point of view of overall backend performance (that is what I'm talking about) this is a third-order optimization, notwithstanding whatever your perception of it is. Also, those figures above are quite un-scientific - I thought the result was lackluster enough not to justify further analysis. I'll post more precise numbers, testing over a full day on two servers restarted in the same second.
But this is not even my main reason of worry ( coming below).
The l10n store win isn't a one-time cost difference, but it scales with call frequency (T99740#5929577). This is among the reasons why on cold cache performance varies so wildely because we have many layers of caching on top of l10n store. LocalisationCache is global per language, then MessageCache is per-wiki per-language (based on hooks and on-wiki overrides), and then there is MessageBlobStore for ResourceLoader on a per-module basis. Until recently MessageBlobStore had a dedicated DB table and nightly clean up cron. This might have been needed as much if base l10n performed better. Yet even with MessageBlobStore, the cache-miss experience is still pretty bad. The first cache miss for any module, in any language, on any wiki (1001 * 419 * 944), can currently takes upwards of 5 seconds to compute on-demand. And that's on an unstressed server. For a single JS resource. Not to mention end-user latency, HTML/wikitext processing, CSS, .. page load performance is lost before it even begins.
We can't completely revisit how we deploy software for such a tail gain. I find it hard to believe that those 5 seconds are completely due to CDB files. Is that the case?
If so, I'm sure there are ways to keep that time down without polluting a single php-fpm cache with 2.5 GB of additional php data.
Aside from performance, this change also has benefits for the deployment process:
[CUT]
I would expect deployments to be faster, with simpler tools. And for container images in the future to therefore be smaller and take less complexity/time to build.
I don't think the two statements above are correct. And these are my main worries.
For scap deploys, we'll need to perform a full rolling restart of all appservers for every non sync-file change, as a single train deploy can fill the opcache up easily on a server using 3 GB of opcache.
When I proposed to do a full rolling restart at every release, it was deemed unpractical, dangerous and basically refused by the release engineering and performance teams. Let me note that this might allow us to go back to not validating opcache, which would make deployments much more atomic :)
A single deploy (for the train, but probably for most SWATs too) will require a restart, making it significantly slower than it is now. A full safe rolling restart of our application servers might take up to 5 minutes or more.
As for container images:
- Why should they be smaller using php arrays isntead of cdb files? I would expect the opposite to be true (we're talking about compressed file sizes)
- Having a 3 GB overhead of RAM usage would kill our ability to run much smaller installations of php-fpm in parallel, and force us to run "fat pods", which is decidedly suboptimal - kubernetes doesn't like to have to allocate very large chunks of one server's memory. Also, we'd get less available workers per server, because of the 2.5 GB memory overhead.
Basically for every pod we'd need a 3 GB opcache space + 3 GB apcu space *even before* we try to allocate workers. It's a 50% increase (from 4 GB to 6 GB) of the baseline occupied memory.
More in general, php-fpm performs better (much, much more than 1%) when you can keep its concurrency low - so much that we've discussed running multiple php-fpms with smaller footprint on a physical appserver even before we have moved to kubernetes. So increasing the memory footprint of the single daemon seems dangerous.
- overall Memory usage is higher with LCStoreArray, but not significantly enough to be a worry in our current setup. Every php worker uses ~ 1 GB of memory at startup vs ~ 500k in the normal setup.
We might be able to bring this down a bit. The opcache config I staged was optimised for benchmarking latency, not memory. I rounded the number up significantly to make sure it would definitely use opcache and not fallback to re-parsing disk reads. But, I don't know if all of that allocated space is actually needed. See https://gerrit.wikimedia.org/r/587299 and T99740#5977799.
Also don't forget the 3 GB of opcache memory usage. I'll post more precise numbers in a followup.
If our problem is having a local, highly available cache of these data, we can explore other avenues, like storing those data into a local memcached on all servers, as we're thinking of installing it for other reasons. On one hand, that would make the cache be slower (possibly), but it will also allow the cache to be shared between php-fpm instances.
Anyways, given the gain seems significant, I'll run more precise tests today.