The issue is exactly the problem outlined above.

People have been complaining that T330942 is not resolved, that is fixed. e.g. If you go to eqiad, the "master" filebackend is indeed swift in the primary dc (codfw) which in the previous case was the local swift (eqiad, T330942#8660782)

Here is an example, I tried using https://test.wikipedia.org/wiki/File:Wikitech-2021-blue-large-icon_(copy).png to debug (note that it's eqiad):

ladsgroup@mwdebug1002:~$ mwscript eval.php --wiki=testwiki --d 3
[debug] [memcached] MemcachedPeclBagOStuff::initializeClient: initializing new client instance.
[debug] [memcached] MainWANObjectCache using store MemcachedPeclBagOStuff
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 0/testwiki
[debug] [rdbms] Wikimedia\Rdbms\LBFactory::getChronologyProtector: request info {
    "IPAddress": "127.0.0.1",
    "UserAgent": false,
    "ChronologyProtection": false,
    "ChronologyPositionIndex": 0,
    "ChronologyClientId": false
}
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::loadSessionPrimaryPos: executed chronology callback.
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: connecting to db1112...
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reallyOpenConnection: opened new connection for 4/
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::getReaderIndex: using server db1112 for group ''
[debug] [rdbms] Wikimedia\Rdbms\LoadBalancer::reuseOrOpenConnectionForNewRef: reusing connection for 4/testwiki
> $backend = MediaWiki\MediaWikiServices::getInstance()->getFileBackendGroup()->get( 'local-multiwrite' );

> $iterator = $backend->getFileList( [ 'dir' => "mwstore://local-multiwrite/local-thumb/2/2a/Wikitech-2021-blue-large-icon_(copy).png" ] );

> foreach ( $iterator as $file ) { var_dump( $file ); }
[debug] [FileOperation] HTTP start: GET https://ms-fe.svc.codfw.wmnet/auth/v1.0
[debug] [FileOperation] HTTP complete: GET https://ms-fe.svc.codfw.wmnet/auth/v1.0 code=200 size=0 total=0.108473 connect=0.033915
[debug] [FileOperation] HTTP start: GET https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-test-local-thumb
[debug] [FileOperation] HTTP complete: GET https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-test-local-thumb code=200 size=xxx total=0.065867 connect=0.000053
string(47) "1024px-Wikitech-2021-blue-large-icon_(copy).png"
string(47) "1280px-Wikitech-2021-blue-large-icon_(copy).png"
string(46) "320px-Wikitech-2021-blue-large-icon_(copy).png"
string(46) "640px-Wikitech-2021-blue-large-icon_(copy).png"
string(46) "800px-Wikitech-2021-blue-large-icon_(copy).png"

You see it's hitting codfw (the primary dc right now)

But the thumb that is not updating is 120px and only exists in eqiad: https://upload.wikimedia.org/wikipedia/test/thumb/2/2a/Wikitech-2021-blue-large-icon_%28copy%29.png/120px-Wikitech-2021-blue-large-icon_%28copy%29.png?20230413110558

root@ms-fe1009:~# swift list --prefix "2/2a/Wikitech-2021-blue-large-icon_(copy).png" wikipedia-test-local-thumb
2/2a/Wikitech-2021-blue-large-icon_(copy).png/1200px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/120px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/600px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/800px-Wikitech-2021-blue-large-icon_(copy).png

vs.

root@ms-fe2009:~# swift list --prefix "2/2a/Wikitech-2021-blue-large-icon_(copy).png" wikipedia-test-local-thumb
2/2a/Wikitech-2021-blue-large-icon_(copy).png/1024px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/1280px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/320px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/640px-Wikitech-2021-blue-large-icon_(copy).png
2/2a/Wikitech-2021-blue-large-icon_(copy).png/800px-Wikitech-2021-blue-large-icon_(copy).png

Basically what happens is that a user hits the thumbnail in eqiad but not codfw (by uploading/loading/etc. via the secondary datacenter), in reupload mw has a part to get a list of thumbnails for an image, that list only hits the primary dc's swift (it used to be dc's local swift, which was worse) and then removes them, I assume the removal gets propagated but again, if it doesn't show up in the list of files in the thumb container, it won't be deleted in primary (because it doesn't exist there) and the outdated one stays in the secondary dc.

Solutions:

  • Make thumb generation replicate in both directions. That can be done in swift and shouldn't be too hard to implement I think, I hope. (Databases do that for some specific cases, e.g. ParserCache and x2)
  • Add a job to delete thumbs in the secondary dc, it'll be complicated because I'm not sure we run jobs in the secondary dc, and even if we run it in primary dc hitting the remote swift, there is always the chance of race condition with users, or ThumbnailRenderJob. So I prefer the first one
  • Delegate deletion of thumbs to swift and replicate that. Instead of discovering all thumbnails (that it's clearly not doing a good job at it), make mw issue a command to swift to delete any thumb it has under the given file name, swift would replicate that command to the secondary dc
  • ?

Also, we really should get rid of three sizes in the pregen thumb sizes.