⚓ T328872 Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw
Page Menu
Phabricator
Create Task
Maniphest
T328872
Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw
Open, Needs Triage
Public
PRODUCTION ERROR
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
None
Authored By
Yann
Feb 5 2023, 5:34 PM
2023-02-05 17:34:37 (UTC+0)
Tags
Commons
(Incoming)
SRE-swift-storage
(Inbox)
Wikimedia-production-error
(Feb 2023)
MediaWiki-Uploading
Unstewarded-production-error
MW-1.41-notes (1.41.0-wmf.25; 2023-09-05)
MediaWiki-File-management
(Backlog)
MW-1.45-notes (1.45.0-wmf.22; 2025-10-07)
MW-1.46-notes (1.46.0-wmf.14; 2026-02-03)
Referenced Files
F67759759: Screenshot_20251027_130809.png
Oct 27 2025, 8:31 PM
2025-10-27 20:31:42 (UTC+0)
F61153307: grafik.png
Jun 1 2025, 6:55 PM
2025-06-01 18:55:00 (UTC+0)
F36913702: proxy-server errors.png
Mar 15 2023, 11:10 PM
2023-03-15 23:10:13 (UTC+0)
F36912027: UploadChunkFileException frequency.png
Mar 15 2023, 5:23 AM
2023-03-15 05:23:25 (UTC+0)
F36909525: image.png
Mar 13 2023, 4:56 PM
2023-03-13 16:56:52 (UTC+0)
Subscribers
aaron
Aklapper
akosiaris
BeckenhamBear
CDanis
Don-vip
Dragoniez
View All 22 Subscribers
Description
I get
00024: FAILED: internal_api_error_UploadChunkFileException: [9e95f0c8-9cd8-4daf-93bf-996b77705f13] Caught exception of type UploadChunkFileException
while uploading a new version of
using
from /srv/mediawiki/php-1.40.0-wmf.21/includes/upload/UploadFromChunks.php(359)
#0 /srv/mediawiki/php-1.40.0-wmf.21/includes/upload/UploadFromChunks.php(248): UploadFromChunks->outputChunk(string)
#1 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiUpload.php(278): UploadFromChunks->addChunk(string, integer, integer)
#2 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiUpload.php(157): ApiUpload->getChunkResult(array)
#3 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiUpload.php(128): ApiUpload->getContextResult()
#4 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiMain.php(1901): ApiUpload->execute()
#5 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiMain.php(878): ApiMain->executeAction()
#6 /srv/mediawiki/php-1.40.0-wmf.21/includes/api/ApiMain.php(849): ApiMain->executeActionWithErrorHandling()
#7 /srv/mediawiki/php-1.40.0-wmf.21/api.php(90): ApiMain->execute()
#8 /srv/mediawiki/php-1.40.0-wmf.21/api.php(45): wfApiMain()
#9 /srv/mediawiki/w/api.php(3): require(string)
#10 {main}
Details
MediaWiki Version
1.40.0-wmf.21
Request URL
Related Changes in Gerrit:
Subject
Repo
Branch
Lines +/-
swift::proxy: re-try some tracing context propagation
operations/puppet
production
+2
-0
Revert "swift::proxy: attempt some tracing context propagation"
operations/puppet
production
+0
-2
swift::proxy: attempt some tracing context propagation
operations/puppet
production
+2
-0
envoyproxy::tls_terminator: request header rewriting
operations/puppet
production
+46
-0
envoy: Add 1 retry for swift services
operations/puppet
production
+6
-0
envoy: Close connections to swift after 10s of inactivity
operations/puppet
production
+2
-0
shellbox-video: Add swift envoy listeners
operations/deployment-charts
master
+5
-0
Revert^2 "Use envoy for swift inside mediawiki"
operations/mediawiki-config
master
+4
-4
services_proxy: Bump swift timeout
operations/puppet
production
+7
-2
Use envoy for swift inside mediawiki
operations/mediawiki-config
master
+4
-4
filebackend: Clean up removed config params for multi-write backends
operations/mediawiki-config
master
+4
-13
FileBackend: Clean up unused private constants
mediawiki/core
master
+0
-7
filebackend: Remove consistency check for multi-backend
mediawiki/core
wmf/1.45.0-wmf.22
+2
-156
filebackend: remove accessibility check from multi-backend
mediawiki/core
master
+37
-69
filebackend: Remove consistency check for multi-backend
mediawiki/core
master
+2
-156
filebackend: Include truncated http body for 502 on SwiftFileBackend
mediawiki/core
master
+35
-31
Provision the revised Swift dashboard
operations/puppet
production
+2 K
-0
Unprovision the "swift" dashboard
operations/puppet
production
+0
-1 K
Show related patches
Customize query in gerrit
Related Objects
Search...
Task Graph
Mentions
Duplicates
Status
Subtype
Assigned
Task
Open
PRODUCTION ERROR
None
T328872
Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw
Open
None
T360913
Swift proxy server misbehaviour (no longer calling `accept`?)
Restricted Task
Mentioned In
T423548: Page images disappearing on edit
T422868: Not able to upload files on Commons
T415504: EditCheck: Create beta feature preference
T397244: Private mitigation blocks registration from certain email domains but gives misleading error about rate limits
T411914: [Config] Deploy config change to STOP the Tone Check A/B experiment
T406812: Optimize FileBackend::preloadFileStat and fix "preserveCache" parameter
T406790: Remove fileExists() call from fileStoragePathsForOps() in FileBackendMultiWrite
T382705: High amount of 503/504 for swift uploads
T369388: Upload errors due to swift failures, 503s
T341007: An unknown error occurred in storage backend "local-swift-eqiad"
T348937: Some or all of the undeletion failed
T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad
T331773: Inconsistent time series data on mediawiki-errors Logstash dashboard
T328905: stashfailed: An unknown error occurred in storage backend "local-swift-codfw"
Mentioned Here
T397244: Private mitigation blocks registration from certain email domains but gives misleading error about rate limits
T411914: [Config] Deploy config change to STOP the Tone Check A/B experiment
T415504: EditCheck: Create beta feature preference
T382705: High amount of 503/504 for swift uploads
T369388: Upload errors due to swift failures, 503s
T341007: An unknown error occurred in storage backend "local-swift-eqiad"
T349127: Large number of fails
T206252: Spike of HTTP errors from SwiftFileBackend::doStoreInternal
T228292: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal
T348733: [MediaWiki:EnhancedStash.js gadget on Commons] TypeError while trying to publish files from Special:UploadStash
T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options
T326352: Q3:rack/setup/install ms-be207[0-3]
T328033: Pooling thumbor-k8s causes spikes in swift 500 errors
T331178: Bring ms-fe201[3-4] into service
Duplicates Merged Here
T350455: Unable to upload a specific file via UploadWizard
T349127: Large number of fails
T328905: stashfailed: An unknown error occurred in storage backend "local-swift-codfw"
Event Timeline
There are a very large number of changes, so older changes are hidden.
Show Older Changes
TheDJ
added a comment.
Dec 23 2024, 12:37 PM
2024-12-23 12:37:32 (UTC+0)
Comment Actions
Thank you for reporting
@Yann
. I created
T382705
for this one.
mdaniels5757
subscribed.
Dec 23 2024, 3:06 PM
2024-12-23 15:06:03 (UTC+0)
GPSLeo
added a comment.
Jan 17 2025, 7:59 PM
2025-01-17 19:59:16 (UTC+0)
Comment Actions
I am currently getting the following error for around the half of all my upload attempts.
{'error': {'code': 'lockmanager-fail-conflict', 'info': 'Could not acquire lock. Somebody else is doing something to this file.', 'filekey': '1biglw5ddlcc.q7c9ff.6579311.jpg', 'sessionkey': '1biglw5ddlcc.q7c9ff.6579311.jpg', 'docref': 'See https://commons.wikimedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes.'}, 'servedby': 'mw-api-ext.codfw.main-74795b8fcc-r9ftw'}.
Mike_Peel
added a comment.
Mar 29 2025, 9:15 PM
2025-03-29 21:15:25 (UTC+0)
Comment Actions
"The MediaWiki error backend-fail-internal occured: An unknown error occurred in storage backend "local-swift-eqiad"." ...
MatthewVernon
added a comment.
Mar 31 2025, 7:51 AM
2025-03-31 07:51:31 (UTC+0)
Comment Actions
In
T328872#10689851
@Mike_Peel
wrote:
"The MediaWiki error backend-fail-internal occured: An unknown error occurred in storage backend "local-swift-eqiad"." ...
We had a couple of incidents with swift over the weekend (including about the time you posted this comment).
MatthewVernon
removed
MatthewVernon
as the assignee of this task.
May 22 2025, 7:19 AM
2025-05-22 07:19:54 (UTC+0)
Ladsgroup
subscribed.
Jun 1 2025, 6:55 PM
2025-06-01 18:55:00 (UTC+0)
Comment Actions
I'm not sure where would be a good place to put this but I think I found something weird with how uploads work. I tried uploading a very small svg file in testwiki with excimer. 857ms was spent in the swift area but very little of it to do the upload. Let me break it down:
First column: First it tries to get the path for the file, which triggers a fileExists() which triggers a getFileStats(), Since this is multi-backend, I assume this is doing a cross-DC connection cause it's taking 94ms to just respond.
Then it tries to do a consistencyCheck (why? The file doesn't even exist yet?) which triggers a preLoadFileStats() call which obviously makes a new cross-DC connection taking 410ms(!) to load the file information (for something that's not even uploaded yet?) but at least since it's preLoading the information it won't make the exact same call again. Right?
In the third column, where the actual storage of the file happens, preLoadFileStats gets called twice, I think the first time is in the primary dc and the second time in the secondary dc, each call is followed by actually storing the file. The calls for uploading takes 160ms in total but getting stats takes 130ms extra on top
The last column is another extra call to fileExists() which takes another 40ms. The reason behind this expensive call is to determine whether it's an "archived" file or a "new" file.
That way, an upload operation that supposed to take 160ms gets stretched to 850ms by adding eight extra swift calls (four of which go across the United States and come back) on top of two to store the files. Probably we can't get rid of all of them but this could be made more efficient.
I also build a flame graph of upload paths from the daily log of May 30. Here it is:
you can see the same issue in the flame graph as well.
I think removing some of these calls could reduce the load on the frontend proxies and definitely reduce the time to store a file and reduce the chance of one of the calls failing and causing issues.
Apologies if I missed something obvious, I'm new to this codebase and it's not the easiest part of mediawiki to understand.
MatthewVernon
added a comment.
Jun 2 2025, 9:04 AM
2025-06-02 09:04:09 (UTC+0)
Comment Actions
I think the "check the file is in a consistent (presumed-to-be-absent) state" operation is intentional, and probably replicated across other file changing call paths; not least because we get tickets sometimes when these checks fail...
I presume this is/was an intentional decision that if the two backends are in an inconsistent state (with each other, or with MW's idea of what that state should be) then we leave it for an operator to fix rather than just going ahead and replacing that inconsistent-state with whatever the user was trying to do at the time.
[I am even less familiar with the mediawiki codebase...]
Ladsgroup
added a comment.
Jun 5 2025, 1:13 AM
2025-06-05 01:13:58 (UTC+0)
Comment Actions
Okay, I made deeper investigation. I uploaded a random file on verbose mode and here is the result:
It made more than 30 HTTP requests to swift:
out of them, 14 are to the remote datacenter:
It is clearly making the same requests multiple times back to back. For example, look at 14 calls to remote datacenter. HEAD (which is called from ::getContainerStat() or ::doGetFileStatMulti()) has been called on the same path back to back before the calls to PUT (there are more after it)
e.g.
and
are duplicates of each other
And
and
also duplicate of each other
And this is just the remote dc, the local dc is the same story.
So there are two major issues:
FileBackendMultiWrite::getFileStat() checks for read backend which with the option of
readAffinity
set in production, it should only read from the local swift. This is clearly not happening.
It is making the same expensive call again and again.
The reason for that seems to be calls to ::preloadFileStat(). For example here in ::consistencyCheck():
// Preload all of the stat info in as few round trips as possible
foreach
$this
->
backends
as
$backend
$realPaths
$this
->
substPaths
$paths
$backend
);
$backend
->
preloadFileStat
'srcs'
=>
$realPaths
'latest'
=>
true
);
Ironically, it explicitly does this to save round trips and reduce calls (see the comment).
Here is the problem: Every time
::preloadFileStat()
is called, the function doesn't care whether the information is already preLoaded or not, it (re-)preLoads the information again and again with no respect to the existing cache (see
FileBackendStore::preloadFileStat()
). It is recommended to call this function in comments (e.g.
Ideally, the file stat entry should already be preloaded via preloadFileStat().
comment) and many parts of upload code path follow the guideline and call this method and as you can see, in the above excimer profile, the same call to preLoad (and then to swift) has been seen four times, just fixing this can easily brush off a decent chunk of upload time of small files and reduce the cross-dc communication of the appservers. There are currently ~60 HEAD reqs/s to swift:
that can get much lower.
Of course I will debug more and see what other improvements I can do to make the system a bit more stable.
aaron
added a comment.
Edited
Jun 5 2025, 3:07 AM
2025-06-05 03:07:59 (UTC+0)
Comment Actions
The idea of preloadFileStat() was to allow concurrent HEAD requests to a list of objects after an relevant locks were acquired. If no locks are acquired, and "latest" is not set, maybe reusing prior loaded state entries is OK. From the perspective of FileBackend, it's mostly thinking that you call doOperations or doQuickOperations, which is supposed to do one preload (within any locking) and is done. The FileBackendMultiWrite class (itself a hack due to not having a proper regional swift cluster and swift-repl only able to do periodic reconciliation) also has to write to the remote backend and has consistency checks turned on...doing a preloads of local and remote backend. It also has to repeat the write operation on the remote backend, requiring another preload to the remote. Since FileBackendMultiWrite does it's own locking, it seems like a lot of these 'stat' entries could be reused instead of reloaded.
Another matter is FileRepo batches doing a lot of getFileStat() checks before calling doOperations(), so that's more HEAD requests since it's before the other preloads. If the batch operation locks the paths, then the FileBackendMultiWrite/FileBackendStore preloads don't need to reload over existing stuff.
Generally, if 'stat' entry in the in-memory fileStatCache was loaded after the outermost lock of that path (since locks can nest), it should just magically reuse existing values sufficient for the $latest (as done elsewhere). Of course, the FileBackendStore doesn't know about the FileBackendMultiWrite locks that FileRepo acquired...this would be easy to implement if not for FileBackendMultiWrite. Maybe a preloadStatCache() could have a $knownLockedPaths array with FileBackendMultiWrite could use (with substPaths() of course). Kind of ugly though.
Ladsgroup
added a comment.
Jun 5 2025, 11:31 PM
2025-06-05 23:31:12 (UTC+0)
Comment Actions
I understand the need to have multi write backends and doing all write operations in both dcs but from that it requires a massive leap to require practically every operation be replicated in both dcs. For example, why doing consistency check in both dcs? let's say something is corrupted in between these two swift instances, the chance of the swift reconciliation script actually finding it or overwriting the secondary dc is much higher than someone accidentally deciding to upload a new version and then getting broken. i.e. I think mediawiki at the moment should be responsible for double uploads (and other write operations) but it shouldn't try to do integrity checks of two swift clusters (doubly so during upload). To me it's like mediawiki checking primary database and a replica for data integrity during page reads, worse than that even. It tries to do that while the replica is thousands of kilometers away. File backend shouldn't do the work of the infrastructure at run time.
I don't understand discussion around batching, most uploads don't batch. I'm not seeing any batching or performance gain from batching in graphs. Can you point me to some data that batching has any effect?
I rather just remove preLoad altogether. It demonstrably makes things faster and makes the logic much simpler.
gerritbot
added a comment.
Jul 20 2025, 12:22 PM
2025-07-20 12:22:04 (UTC+0)
Comment Actions
Change #1170709 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/core@master] filebackend: Remove consistency check for multi-backend
gerritbot
added a project:
Patch-For-Review
Jul 20 2025, 12:22 PM
2025-07-20 12:22:05 (UTC+0)
aaron
added a comment.
Edited
Aug 1 2025, 1:34 AM
2025-08-01 01:34:40 (UTC+0)
Comment Actions
In
T328872#10889545
@Ladsgroup
wrote:
I understand the need to have multi write backends and doing all write operations in both dcs but from that it requires a massive leap to require practically every operation be replicated in both dcs. For example, why doing consistency check in both dcs? let's say something is corrupted in between these two swift instances, the chance of the swift reconciliation script actually finding it or overwriting the secondary dc is much higher than someone accidentally deciding to upload a new version and then getting broken. i.e. I think mediawiki at the moment should be responsible for double uploads (and other write operations) but it shouldn't try to do integrity checks of two swift clusters (doubly so during upload). To me it's like mediawiki checking primary database and a replica for data integrity during page reads, worse than that even. It tries to do that while the replica is thousands of kilometers away. File backend shouldn't do the work of the infrastructure at run time.
One reason for the consistency checks is to quarantine object paths that have uncertain values (inconsistent between the DC clusters). The effect of operations is different if the starting state of the files is already different. If we allow more operations, the uncertainty could spread to other object paths. Another reason was to trigger the autoResync logic to fix things and unblock the operation. This was initially before swift-repl existed. When swiftrepl was new, it was also useful for unblocking user operations if didn't get around to them fixing something yet. This might involve users retrying things like re-upload/move/delete/restore *soon* after a failure. AFAIK, swift-repl used to be slow and sometimes was configured to handle excess file deletion and other times not. There was once a time when 'autoResync' was just true (handling object deletions) and swift-repl deletions were not enabled.
Anyway, I see that "autoResync" is "conservative", so it's not able to sync a list of files in a way that swift-repl cannot. Indeed most actual resyncing will be from swift-repl since it doesn't need someone to happen to try to touch the files again. I think we can just lean into the assumption that whatever is in the primary is "canon" (FileBackendMultiwrite::doOperations already does this with it's Status result) and let swift-repl make that secondary cluster match up soon if something failed.
I don't understand discussion around batching, most uploads don't batch. I'm not seeing any batching or performance gain from batching in graphs. Can you point me to some data that batching has any effect?
By batching, I just mean anything that passes 2+ operations to doOperations(). File delete/restore can involve batches with operations proportionate to the number of file versions. Every path gets a HEAD operation to check the preconditions of the operations beforehand. The actual COPY/DELETE/POST/PUT operations of use concurrency and the use of preloadFileCache() in doOperations() does the same for the HEAD requests.
I rather just remove preLoad altogether. It demonstrably makes things faster and makes the logic much simpler.
Disabling consistencyCheck() would speed things up half the way. Setting "syncChecks" to 0 in config should produce immediate results.
External callers of preloadFileCache(), basically all but the one in doOperations(), should probably get "keep existing recent cache entries" behavior. Nevertheless, the cache entries would still not be reused within doOperations() without further changes. Doing that correctly would require that things calling FileRepo::fileExistsBatch() lock the paths first and use "latest" (none of the LocalFile*Batch classes do either). In addition, FileBackend would also have to only reuse cache entries on locked paths, tagged "latest", that were fetched after the lock was acquired. Getting FileBackendMultiwrite, which takes over the locking, to cooperate would involve some more pain.
I also don't understand why LocalFile*Batch even need to call removeNonexistentFiles() ->fileExistsBatch() -> preloadFileCache(). It seems like code could be simplified to use "ignoreMissingSource", which would also knock out the HEAD request spam.
MatthewVernon
added a comment.
Aug 1 2025, 9:44 AM
2025-08-01 09:44:25 (UTC+0)
Comment Actions
"swift-repl" (it's not actually that any more, but something based on rclone) runs only weekly (on Monday Europe-morning).
gerritbot
added a comment.
Oct 8 2025, 9:00 PM
2025-10-08 21:00:47 (UTC+0)
Comment Actions
Change #1170709
merged
by jenkins-bot:
[mediawiki/core@master] filebackend: Remove consistency check for multi-backend
aaron
mentioned this in
T406790: Remove fileExists() call from fileStoragePathsForOps() in FileBackendMultiWrite
Oct 8 2025, 9:35 PM
2025-10-08 21:35:39 (UTC+0)
ReleaseTaggerBot
added a project:
MW-1.45-notes (1.45.0-wmf.23; 2025-10-14)
Oct 8 2025, 10:00 PM
2025-10-08 22:00:15 (UTC+0)
gerritbot
added a comment.
Oct 8 2025, 11:32 PM
2025-10-08 23:32:26 (UTC+0)
Comment Actions
Change #1194781 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):
[mediawiki/core@master] filebackend: remove accessibility check from multi-backend
aaron
mentioned this in
T406812: Optimize FileBackend::preloadFileStat and fix "preserveCache" parameter
Oct 9 2025, 6:16 AM
2025-10-09 06:16:02 (UTC+0)
gerritbot
added a comment.
Oct 9 2025, 11:24 AM
2025-10-09 11:24:42 (UTC+0)
Comment Actions
Change #1194781
merged
by jenkins-bot:
[mediawiki/core@master] filebackend: remove accessibility check from multi-backend
Maintenance_bot
removed a project:
Patch-For-Review
Oct 9 2025, 11:32 AM
2025-10-09 11:32:21 (UTC+0)
gerritbot
added a comment.
Oct 14 2025, 11:38 AM
2025-10-14 11:38:32 (UTC+0)
Comment Actions
Change #1196018 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/core@wmf/1.45.0-wmf.22] filebackend: Remove consistency check for multi-backend
gerritbot
added a project:
Patch-For-Review
Oct 14 2025, 11:38 AM
2025-10-14 11:38:33 (UTC+0)
gerritbot
added a comment.
Oct 14 2025, 11:52 AM
2025-10-14 11:52:52 (UTC+0)
Comment Actions
Change #1196018
merged
by jenkins-bot:
[mediawiki/core@wmf/1.45.0-wmf.22] filebackend: Remove consistency check for multi-backend
Stashbot
added a comment.
Oct 14 2025, 11:54 AM
2025-10-14 11:54:47 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2025-10-14T11:54:46Z] Started scap sync-world: Backport for [[gerrit:1196018|filebackend: Remove consistency check for multi-backend (
T328872
)]]
Stashbot
added a comment.
Oct 14 2025, 11:59 AM
2025-10-14 11:59:01 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2025-10-14T11:59:00Z] ladsgroup: Backport for [[gerrit:1196018|filebackend: Remove consistency check for multi-backend (
T328872
)]] synced to the testservers (see
). Changes can now be verified there.
ReleaseTaggerBot
edited projects, added
MW-1.45-notes (1.45.0-wmf.22; 2025-10-07)
; removed
MW-1.45-notes (1.45.0-wmf.23; 2025-10-14)
Oct 14 2025, 12:00 PM
2025-10-14 12:00:38 (UTC+0)
Stashbot
added a comment.
Oct 14 2025, 12:07 PM
2025-10-14 12:07:33 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2025-10-14T12:07:32Z] Finished scap sync-world: Backport for [[gerrit:1196018|filebackend: Remove consistency check for multi-backend (
T328872
)]] (duration: 12m 46s)
Maintenance_bot
removed a project:
Patch-For-Review
Oct 14 2025, 12:31 PM
2025-10-14 12:31:22 (UTC+0)
aaron
added a comment.
Oct 27 2025, 8:31 PM
2025-10-27 20:31:42 (UTC+0)
Comment Actions
In
T328872#10873913
@Ladsgroup
wrote:
I'm not sure where would be a good place to put this but I think I found something weird with how uploads work. I tried uploading a very small svg file in testwiki with excimer. 857ms was spent in the swift area but very little of it to do the upload. Let me break it down:
I uploaded a tiny SVG and profiled it:
. It's better, but still slowish.
gerritbot
added a comment.
Jan 31 2026, 5:19 PM
2026-01-31 17:19:16 (UTC+0)
Comment Actions
Change #1235490 had a related patch set uploaded (by Func; author: Func):
[mediawiki/core@master] FileBackend: Clean up unused private constants
gerritbot
added a project:
Patch-For-Review
Jan 31 2026, 5:19 PM
2026-01-31 17:19:18 (UTC+0)
gerritbot
added a comment.
Jan 31 2026, 5:24 PM
2026-01-31 17:24:38 (UTC+0)
Comment Actions
Change #1235491 had a related patch set uploaded (by Func; author: Func):
[operations/mediawiki-config@master] filebackend: Clean up removed config params for multi-write backends
gerritbot
added a comment.
Feb 2 2026, 11:11 AM
2026-02-02 11:11:52 (UTC+0)
Comment Actions
Change #1235490
merged
by jenkins-bot:
[mediawiki/core@master] FileBackend: Clean up unused private constants
ReleaseTaggerBot
added a project:
MW-1.46-notes (1.46.0-wmf.14; 2026-02-03)
Feb 2 2026, 12:00 PM
2026-02-02 12:00:50 (UTC+0)
gerritbot
added a comment.
Feb 2 2026, 9:05 PM
2026-02-02 21:05:01 (UTC+0)
Comment Actions
Change #1235491
merged
by jenkins-bot:
[operations/mediawiki-config@master] filebackend: Clean up removed config params for multi-write backends
Stashbot
mentioned this in
T411914: [Config] Deploy config change to STOP the Tone Check A/B experiment
Feb 2 2026, 9:05 PM
2026-02-02 21:05:27 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-02-02T21:05:24Z] Started scap sync-world: Backport for [[gerrit:1235392|Edit check: turn off the tone a/b test on frwiki, jawiki, ptwiki (
T411914
)]], [[gerrit:1235111|Enable suggestions BetaFeature on beta wikis (
T415504
)]], [[gerrit:1230462|WikimediaCustomizations: Set WMCBadEmailDomainsFile (
T397244
)]], [[gerrit:1235491|filebackend: Clean up removed config params for multi-write backends (
T328872
)]]
Restricted Application
added a subscriber:
Dragoniez
View Herald Transcript
Feb 2 2026, 9:05 PM
2026-02-02 21:05:32 (UTC+0)
Stashbot
mentioned this in
T397244: Private mitigation blocks registration from certain email domains but gives misleading error about rate limits
Feb 2 2026, 9:05 PM
2026-02-02 21:05:35 (UTC+0)
Stashbot
mentioned this in
T415504: EditCheck: Create beta feature preference
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-02-02T21:07:21Z] tgr, func, kemayo, esanders: Backport for [[gerrit:1235392|Edit check: turn off the tone a/b test on frwiki, jawiki, ptwiki (
T411914
)]], [[gerrit:1235111|Enable suggestions BetaFeature on beta wikis (
T415504
)]], [[gerrit:1230462|WikimediaCustomizations: Set WMCBadEmailDomainsFile (
T397244
)]], [[gerrit:1235491|filebackend: Clean up removed config params for multi-write backends (
T328872
)]] synced to
Stashbot
added a comment.
Feb 2 2026, 9:16 PM
2026-02-02 21:16:20 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-02-02T21:16:18Z] Finished scap sync-world: Backport for [[gerrit:1235392|Edit check: turn off the tone a/b test on frwiki, jawiki, ptwiki (
T411914
)]], [[gerrit:1235111|Enable suggestions BetaFeature on beta wikis (
T415504
)]], [[gerrit:1230462|WikimediaCustomizations: Set WMCBadEmailDomainsFile (
T397244
)]], [[gerrit:1235491|filebackend: Clean up removed config params for multi-write backends (
T328872
)]] (duration: 10
Ladsgroup
added a comment.
Thu, Apr 2, 4:32 PM
2026-04-02 16:32:59 (UTC+0)
Comment Actions
I was looking into this a bit yesterday (more general note of improving efficiency and reliability of upload) and I realized requests to swfit from mediawiki are not going through envoy but they are using HTTPS and given that it makes around 30 different requests (half of which are to the remote datacenter) this means that significant if not majority of upload time is being spent on TLS handshakes. Maybe I'm missing something obvious but that's what it looks like from reading the code and config.
CDanis
subscribed.
Wed, Apr 8, 5:10 PM
2026-04-08 17:10:12 (UTC+0)
Maintenance_bot
removed a project:
Patch-For-Review
Wed, Apr 8, 5:31 PM
2026-04-08 17:31:55 (UTC+0)
gerritbot
added a comment.
Wed, Apr 8, 7:40 PM
2026-04-08 19:40:33 (UTC+0)
Comment Actions
Change #1269050 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):
[operations/mediawiki-config@master] Use envoy for swift inside mediawiki
gerritbot
added a project:
Patch-For-Review
Wed, Apr 8, 7:40 PM
2026-04-08 19:40:34 (UTC+0)
gerritbot
added a comment.
Wed, Apr 8, 8:57 PM
2026-04-08 20:57:50 (UTC+0)
Comment Actions
Change #1269050
merged
by jenkins-bot:
[operations/mediawiki-config@master] Use envoy for swift inside mediawiki
Stashbot
added a comment.
Wed, Apr 8, 8:58 PM
2026-04-08 20:58:13 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-08T20:58:12Z] Started scap sync-world: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (
T328872
)]]
Stashbot
added a comment.
Wed, Apr 8, 9:00 PM
2026-04-08 21:00:05 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-08T21:00:04Z] ladsgroup: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (
T328872
)]] synced to the testservers (see
). Changes can now be verified there.
Stashbot
added a comment.
Wed, Apr 8, 9:04 PM
2026-04-08 21:04:40 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-08T21:04:39Z] Finished scap sync-world: Backport for [[gerrit:1269050|Use envoy for swift inside mediawiki (
T328872
)]] (duration: 06m 27s)
Maintenance_bot
removed a project:
Patch-For-Review
Wed, Apr 8, 9:31 PM
2026-04-08 21:31:14 (UTC+0)
gerritbot
added a comment.
Thu, Apr 9, 12:05 PM
2026-04-09 12:05:23 (UTC+0)
Comment Actions
Change #1269420 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/puppet@production] services_proxy: Bump swift timeout
gerritbot
added a project:
Patch-For-Review
Thu, Apr 9, 12:05 PM
2026-04-09 12:05:24 (UTC+0)
gerritbot
added a comment.
Thu, Apr 9, 3:18 PM
2026-04-09 15:18:07 (UTC+0)
Comment Actions
Change #1269420
merged
by Clément Goubert:
[operations/puppet@production] services_proxy: Bump swift timeout
Maintenance_bot
removed a project:
Patch-For-Review
Thu, Apr 9, 3:31 PM
2026-04-09 15:31:57 (UTC+0)
gerritbot
added a comment.
Thu, Apr 9, 4:39 PM
2026-04-09 16:39:25 (UTC+0)
Comment Actions
Change #1269524 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):
[operations/mediawiki-config@master] Revert^2 "Use envoy for swift inside mediawiki"
gerritbot
added a project:
Patch-For-Review
Thu, Apr 9, 4:39 PM
2026-04-09 16:39:26 (UTC+0)
Comment Actions
Change #1269524
merged
by jenkins-bot:
[operations/mediawiki-config@master] Revert^2 "Use envoy for swift inside mediawiki"
Stashbot
added a comment.
Thu, Apr 9, 4:41 PM
2026-04-09 16:41:34 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-09T16:41:33Z] Started scap sync-world: Backport for [[gerrit:1269524|Revert^2 "Use envoy for swift inside mediawiki" (
T328872
)]]
Stashbot
added a comment.
Thu, Apr 9, 4:43 PM
2026-04-09 16:43:31 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-09T16:43:30Z] ladsgroup: Backport for [[gerrit:1269524|Revert^2 "Use envoy for swift inside mediawiki" (
T328872
)]] synced to the testservers (see
). Changes can now be verified there.
Stashbot
added a comment.
Thu, Apr 9, 4:48 PM
2026-04-09 16:48:37 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-09T16:48:36Z] Finished scap sync-world: Backport for [[gerrit:1269524|Revert^2 "Use envoy for swift inside mediawiki" (
T328872
)]] (duration: 07m 02s)
Stashbot
added a comment.
Thu, Apr 9, 4:51 PM
2026-04-09 16:51:49 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-09T16:51:48Z] Started scap sync-world: Backport for [[gerrit:1269524|Revert^2 "Use envoy for swift inside mediawiki" (
T328872
)]]
gerritbot
added a comment.
Thu, Apr 9, 5:15 PM
2026-04-09 17:15:42 (UTC+0)
Comment Actions
Change #1269535 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/deployment-charts@master] shellbox-video: Add swift envoy listeners
gerritbot
added a comment.
Thu, Apr 9, 5:22 PM
2026-04-09 17:22:11 (UTC+0)
Comment Actions
Change #1269535
merged
by jenkins-bot:
[operations/deployment-charts@master] shellbox-video: Add swift envoy listeners
Maintenance_bot
removed a project:
Patch-For-Review
Thu, Apr 9, 5:31 PM
2026-04-09 17:31:14 (UTC+0)
Ladsgroup
mentioned this in
T422868: Not able to upload files on Commons
Thu, Apr 9, 9:47 PM
2026-04-09 21:47:32 (UTC+0)
gerritbot
added a comment.
Fri, Apr 10, 12:57 PM
2026-04-10 12:57:47 (UTC+0)
Comment Actions
Change #1270031 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):
[operations/puppet@production] envoy: Close connections to swift after 10s of inactivity
gerritbot
added a project:
Patch-For-Review
Fri, Apr 10, 12:57 PM
2026-04-10 12:57:48 (UTC+0)
gerritbot
added a comment.
Mon, Apr 13, 4:41 PM
2026-04-13 16:41:28 (UTC+0)
Comment Actions
Change #1270031
merged
by Ladsgroup:
[operations/puppet@production] envoy: Close connections to swift after 10s of inactivity
Maintenance_bot
removed a project:
Patch-For-Review
Mon, Apr 13, 5:34 PM
2026-04-13 17:34:03 (UTC+0)
Ladsgroup
added a comment.
Tue, Apr 14, 12:47 PM
2026-04-14 12:47:20 (UTC+0)
Comment Actions
Okay, after four tries (!) we got envoy to work. Now uploads go through envoy which provides good telemetry [1] and some okay-ish performance boost (it's harder to measure given the lack of o11y tools until now. There is
but I haven't seen any noticeable change).
[1]
I'm also looking into ways to incorporate this into trace.wikimedia.org too.
This work so far has exposed another problem. MW code for calling swift doesn't have any retry set. We (SREs) restart swift frontends from time to time. And for various reasons, a HTTP request might simply fail (network partition, etc.) and if one of many many requests it is making fails, the whole upload goes up in the air leaving the system in a inconsistent state.
It is quite annoying when the mw class decides to do the job of the infrastructure (=doing file replication across different datacenters) but it's not prepared to handle basic infrastructure failure scenarios such as HTTP request simply failing and needing a retry.
gerritbot
added a comment.
Tue, Apr 14, 1:11 PM
2026-04-14 13:11:23 (UTC+0)
Comment Actions
Change #1270931 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):
[operations/puppet@production] envoy: Add 1 retry for swift services
gerritbot
added a project:
Patch-For-Review
Tue, Apr 14, 1:11 PM
2026-04-14 13:11:25 (UTC+0)
gerritbot
added a comment.
Tue, Apr 14, 5:44 PM
2026-04-14 17:44:57 (UTC+0)
Comment Actions
Change #1270931
merged
by Ladsgroup:
[operations/puppet@production] envoy: Add 1 retry for swift services
Maintenance_bot
removed a project:
Patch-For-Review
Tue, Apr 14, 6:31 PM
2026-04-14 18:31:20 (UTC+0)
Ladsgroup
added a comment.
Wed, Apr 15, 6:04 PM
2026-04-15 18:04:35 (UTC+0)
Comment Actions
Some progress report: In the past 24 hours, we had 9 cases of requests failing out of which 6 got successfully and automatically got retried via envoy. The envoy retry mechanism also useful during reboots and other issues such as one swift frontend going down as well. For the remaining three, let me investigate what is in the swift logs.
gerritbot
added a comment.
Thu, Apr 16, 3:28 PM
2026-04-16 15:28:24 (UTC+0)
Comment Actions
Change #1271926 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] envoyproxy::tls_terminator: request header rewriting
gerritbot
added a project:
Patch-For-Review
Thu, Apr 16, 3:28 PM
2026-04-16 15:28:25 (UTC+0)
Comment Actions
Change #1271927 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] swift::proxy: attempt some tracing context propagation
Stashbot
added a comment.
Thu, Apr 16, 3:29 PM
2026-04-16 15:29:02 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-16T15:29:01Z] 💔cdanis@cumin1003.eqiad.wmnet ~ 🕦☕ sudo cumin 'A:swift-fe' 'disable-puppet "cdanis deploy I3aaec0ca
T328872
"'
gerritbot
added a comment.
Thu, Apr 16, 3:31 PM
2026-04-16 15:31:06 (UTC+0)
Comment Actions
Change #1271926
merged
by CDanis:
[operations/puppet@production] envoyproxy::tls_terminator: request header rewriting
gerritbot
added a comment.
Thu, Apr 16, 3:31 PM
2026-04-16 15:31:08 (UTC+0)
Comment Actions
Change #1271927
merged
by CDanis:
[operations/puppet@production] swift::proxy: attempt some tracing context propagation
gerritbot
added a comment.
Thu, Apr 16, 4:17 PM
2026-04-16 16:17:37 (UTC+0)
Comment Actions
Change #1272773 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] Revert "swift::proxy: attempt some tracing context propagation"
gerritbot
added a comment.
Thu, Apr 16, 4:18 PM
2026-04-16 16:18:23 (UTC+0)
Comment Actions
Change #1272773
merged
by CDanis:
[operations/puppet@production] Revert "swift::proxy: attempt some tracing context propagation"
gerritbot
added a comment.
Thu, Apr 16, 4:26 PM
2026-04-16 16:26:09 (UTC+0)
Comment Actions
Change #1272775 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] swift::proxy: re-try some tracing context propagation
gerritbot
added a comment.
Thu, Apr 16, 4:29 PM
2026-04-16 16:29:17 (UTC+0)
Comment Actions
Change #1272775
merged
by CDanis:
[operations/puppet@production] swift::proxy: re-try some tracing context propagation
Stashbot
added a comment.
Thu, Apr 16, 4:30 PM
2026-04-16 16:30:12 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-16T16:30:11Z] 💙cdanis@cumin1003.eqiad.wmnet ~ 🕧☕ sudo cumin 'A:swift-fe' 'disable-puppet "cdanis deploy 8ad070a466
T328872
"'
Maintenance_bot
removed a project:
Patch-For-Review
Thu, Apr 16, 4:30 PM
2026-04-16 16:30:53 (UTC+0)
Stashbot
added a comment.
Thu, Apr 16, 4:38 PM
2026-04-16 16:38:38 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-16T16:38:37Z] 💙cdanis@cumin1003.eqiad.wmnet ~ 🕧☕ sudo cumin 'A:swift-fe' 'enable-puppet "cdanis deploy 8ad070a466
T328872
"'
Xover
mentioned this in
T423548: Page images disappearing on edit
Thu, Apr 16, 5:32 PM
2026-04-16 17:32:35 (UTC+0)
CDanis
added a comment.
Fri, Apr 17, 4:51 PM
2026-04-17 16:51:55 (UTC+0)
Comment Actions
FYI: as of my Puppet patches above, you can now use an x-request-id value to find all the intra-Swift requests associated with that request.
You do need to remove the final 4 hex digits from an x-request-id as Swift truncates them. But they appear in Swift access logs now for all the ms-fe and ms-be hosts that the request touched:
💙cdanis@cumin1003.eqiad.wmnet ~ 🕐☕ sudo cumin --force 'A:swift AND A:eqiad' 'rg a94b06fe-2eb0-40a4-9ff9-a479f633 /var/log/swift/*.log | wc -l || true'
50 hosts will be targeted:
ms-be[1064-1097].eqiad.wmnet,ms-fe[1009-1024].eqiad.wmnet
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====
(1) ms-be1082.eqiad.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
===== NODE GROUP =====
(3) ms-be[1084,1095].eqiad.wmnet,ms-fe1023.eqiad.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
===== NODE GROUP =====
(5) ms-be[1070,1078,1085,1097].eqiad.wmnet,ms-fe1017.eqiad.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
===== NODE GROUP =====
(5) ms-be[1067,1069,1087,1090,1094].eqiad.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
===== NODE GROUP =====
(36) ms-be[1064-1066,1068,1071-1077,1079-1081,1083,1086,1088-1089,1091-1093,1096].eqiad.wmnet,ms-fe[1009-1016,1018-1022,1024].eqiad.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
================
100.0% (50/50) success ratio (>= 100.0% threshold) for command #1: 'rg a94b06fe-2eb0... | wc -l || true'.
100.0% (50/50) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

💙cdanis@cumin1003.eqiad.wmnet ~ 🕐☕ sudo cumin --force 'A:swift AND A:codfw' 'rg a94b06fe-2eb0-40a4-9ff9-a479f633 /var/log/swift/*.log | wc -l || true'
51 hosts will be targeted:
ms-be[2062-2096].codfw.wmnet,ms-fe[2009-2024].codfw.wmnet
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====
(1) ms-be2083.codfw.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
===== NODE GROUP =====
(12) ms-be[2063-2065,2074-2075,2084,2090-2091,2093,2095-2096].codfw.wmnet,ms-fe2010.codfw.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
===== NODE GROUP =====
(3) ms-be[2067,2087].codfw.wmnet,ms-fe2015.codfw.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
===== NODE GROUP =====
(35) ms-be[2062,2066,2068-2073,2076-2082,2085-2086,2088-2089,2092,2094].codfw.wmnet,ms-fe[2009,2011-2014,2016-2024].codfw.wmnet
----- OUTPUT for command #1: 'rg a94b06fe-2eb0... | wc -l || true' -----
================
CDanis
added a comment.
Fri, Apr 17, 5:04 PM
2026-04-17 17:04:46 (UTC+0)
Comment Actions
BTW -- here are two canned queries for distributed traces of uploads:
and
What Jaeger sees only includes the requests from MediaWiki towards ms-fe hosts, of course -- but you can use the x-request-id from the tags to pivot into logs.
I've no idea if Swift can omit OTel, or if it could use a proxy that could.
Pppery
removed a project:
API Platform
Sat, Apr 18, 5:38 AM
2026-04-18 05:38:37 (UTC+0)
Log In to Comment
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct.
Wikimedia Foundation
Code of Conduct
Disclaimer
CC-BY-SA
GPL
Credits