⚓ T337446 Rebuild sanitarium hosts
Page Menu
Phabricator
Create Task
Maniphest
T337446
Rebuild sanitarium hosts
Closed, Resolved
Public
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
Marostegui
Authored By
Ladsgroup
May 25 2023, 5:35 AM
2023-05-25 05:35:41 (UTC+0)
Tags
DBA
(Done)
Data-Services
(Wiki replicas)
Data-Engineering
(Incoming (new tickets))
cloud-services-team
(Inbox)
TaxonBot
(Webservice/DB)
User-notice-archive
(Backlog)
Referenced Files
F37094416: image.png
Jun 5 2023, 5:49 PM
2023-06-05 17:49:33 (UTC+0)
F37084024: image.png
May 30 2023, 10:22 AM
2023-05-30 10:22:31 (UTC+0)
Subscribers
1234qwer1234qwer4
1AmNobody24
4TheWynne
AaR888
aborrero
Aklapper
ArielGlenn
View All 54 Subscribers
Description
NOTE:
Per
T337446#8882092
, replag is likely to keep increasing until mid next week. This only affects s1, s2, s3, s4, s5 and s7. The rest of the sections should be working as normal.
db1154 and db1155 has their replication broken due to different errors. For example for s5:
PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.flaggedpage_pending: Cant find record in flaggedpage_pending, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1161-bin.001646, end_log_pos 385492288
for s4 and s7 also the replication is broken but on a different table.
Section with broken replication are: s1, s2, s5, s7
Broken summary:
db1154:
s1 (caught up)
s3 (caught up)
s5 (caught up)
db1155:
s2 (caught up)
s4 (caught up)
s7 (caught up)
Recloning process
s1:
clouddb1013
clouddb1017
clouddb1021
s2:
clouddb1014
clouddb1018
clouddb1021
s3:
clouddb1013
clouddb1017
clouddb1021
s4:
clouddb1015
clouddb1019
clouddb1021
s5:
clouddb1016
clouddb1020
clouddb1021
s7:
clouddb1014
clouddb1018
clouddb1021
Details
Related Changes in Gerrit:
Subject
Repo
Branch
Lines +/-
wikireplicas: restore pybal monitoring
operations/puppet
production
+8
-12
db1155: Enable notifications
operations/puppet
production
+0
-1
db1154: Enable notifications
operations/puppet
production
+0
-1
service: Disable monitors for wikireplicas
operations/puppet
production
+15
-9
wiki-replicas.sql: Add meta_p GRANT
operations/puppet
production
+1
-0
wiki-replicas.sql: Add heartbeat_p
operations/puppet
production
+1
-0
db1156,db1161,db1196,db1212: Disable notifications
operations/puppet
production
+4
-0
wiki-replicas.sql: Create role
operations/puppet
production
+1
-0
Customize query in gerrit
Related Objects
Search...
Task Graph
Mentions
Duplicates
Status
Subtype
Assigned
Task
Resolved
Marostegui
T337446
Rebuild sanitarium hosts
Invalid
None
T337721
Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact
Resolved
Ladsgroup
T337734
Investigate if maintain-replica-indexes is still needed
Resolved
Marostegui
T337811
Check and enable GTID across sanitarium and clouddb* hosts
Mentioned In
T344608: WMCS-roots paging responsibilities
T337848: WMCS-roots wiki replica access
T339243: ServiceLVS without monitor breaks spicerack
T337888: guc.toolforge with database error s4
T337829: Requesting access to ops (or wmcs-roots) for TheresNoTime
T337791: CopyPatrol error 500
T337742: Eventmetrics fails with error 500 after login
T337734: Investigate if maintain-replica-indexes is still needed
T337721: Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact
T337682: querying metawiki on the replicas gives ERROR 2013 (HY000): Lost connection to MySQL server
T337674: Error 504 on XTools
T337645: Page created 26 May 2023 not found in XTools
T337571: Requesting access to ops group for nskaggs
T336886: Add user_is_temp field to the user table in MediaWiki core
Mentioned Here
T338172: Can't connect to analytics replicas from Toolforge
T337961: Clean up clouddb1021
T337734: Investigate if maintain-replica-indexes is still needed
P48640 dbctl commit (dc=all): 'Depool db1221 (sanitarium s4 master) T337446'
T337682: querying metawiki on the replicas gives ERROR 2013 (HY000): Lost connection to MySQL server
P48598 dbctl commit (dc=all): 'Depool sanitarium masters for s1, s2, s3, s5 T337446'
Duplicates Merged Here
T337888: guc.toolforge with database error s4
T337805: Global user contributions failed
T337733: Fetching sessions from the Trove database sometimes times out
T337706: Internal server error of xtools
T337704: X tools "fatal error" on fr-wp
T337699: PLEASE REPLACE WITH A DESCRIPTION OF THE ERROR
T337693: please replace with a description of the error
T337687: Page stats not displaying on articles
T337682: querying metawiki on the replicas gives ERROR 2013 (HY000): Lost connection to MySQL server
T337645: Page created 26 May 2023 not found in XTools
T337623: Petscan is not updating article size
T337536: Xtools does not work
Event Timeline
There are a very large number of changes, so older changes are hidden.
Show Older Changes
TheresNoTime
added a comment.
May 30 2023, 6:58 PM
2023-05-30 18:58:58 (UTC+0)
Comment Actions
In
T337446#8890117
@Marostegui
wrote:
Thanks for the report. It was only on clouddb1021 but not on the others (as I did the transfer) before we found this issue. I have fixed it on the other two, sorry for the inconveniences. Lots of moving pieces on all this.
No apologies necessary, and thank you :-) confirmed working for me/the tool in question.
SWinxy
added a comment.
May 30 2023, 7:52 PM
2023-05-30 19:52:46 (UTC+0)
Comment Actions
Is there an estimate for when things'll be fully restored? Y'all are great.
Nintendofan885
subscribed.
May 30 2023, 7:59 PM
2023-05-30 19:59:28 (UTC+0)
Marostegui
added a comment.
May 30 2023, 8:04 PM
2023-05-30 20:04:28 (UTC+0)
Comment Actions
In
T337446#8890288
@SWinxy
wrote:
Is there an estimate for when things'll be fully restored? Y'all are great.
If nothing happens, tomorrow everything should be back.
However I will probably rebuild s4 tomorrow (and it will take probably 2-3 days) as I don't fully trust its data anymore since it broke earlier today - I fixed the row manually but there could be more stuff under the hood.
Marostegui
updated the task description.
(Show Details)
May 30 2023, 8:05 PM
2023-05-30 20:05:56 (UTC+0)
Marostegui
updated the task description.
(Show Details)
JJMC89
merged a task:
T337805: Global user contributions failed
May 31 2023, 2:40 AM
2023-05-31 02:40:04 (UTC+0)
JJMC89
added a subscriber:
Lemonaka
MJL
awarded a token.
May 31 2023, 2:41 AM
2023-05-31 02:41:22 (UTC+0)
L3X1
subscribed.
May 31 2023, 4:09 AM
2023-05-31 04:09:50 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 4:55 AM
2023-05-31 04:55:41 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 4:58 AM
2023-05-31 04:58:38 (UTC+0)
Comment Actions
s1 is fully recloned, and catching up.
I am going to start with s4 to be on the safe side.
Stashbot
added a comment.
May 31 2023, 4:59 AM
2023-05-31 04:59:28 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2023-05-31T04:59:27Z] dbctl commit (dc=all): 'Depool db1221 (sanitarium s4 master)
T337446
', diff saved to
and previous config saved to /var/cache/conftool/dbconfig/20230531-045927-root.json
Marostegui
updated the task description.
(Show Details)
May 31 2023, 5:02 AM
2023-05-31 05:02:57 (UTC+0)
Marostegui
claimed this task.
May 31 2023, 5:11 AM
2023-05-31 05:11:22 (UTC+0)
FatalFit
unsubscribed.
May 31 2023, 5:55 AM
2023-05-31 05:55:21 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 6:05 AM
2023-05-31 06:05:43 (UTC+0)
gerritbot
added a comment.
May 31 2023, 6:09 AM
2023-05-31 06:09:16 (UTC+0)
Comment Actions
Change 924772 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db1154: Enable notifications
gerritbot
added a comment.
May 31 2023, 6:10 AM
2023-05-31 06:10:04 (UTC+0)
Comment Actions
Change 924772
merged
by Marostegui:
[operations/puppet@production] db1154: Enable notifications
Marostegui
updated the task description.
(Show Details)
May 31 2023, 7:24 AM
2023-05-31 07:24:44 (UTC+0)
1AmNobody24
subscribed.
May 31 2023, 7:33 AM
2023-05-31 07:33:49 (UTC+0)
TheresNoTime
mentioned this in
T337829: Requesting access to ops (or wmcs-roots) for TheresNoTime
May 31 2023, 9:23 AM
2023-05-31 09:23:18 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 10:09 AM
2023-05-31 10:09:24 (UTC+0)
Ladsgroup
awarded a token.
May 31 2023, 10:11 AM
2023-05-31 10:11:43 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 10:40 AM
2023-05-31 10:40:11 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 11:01 AM
2023-05-31 11:01:44 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 1:34 PM
2023-05-31 13:34:58 (UTC+0)
Comment Actions
s4 on clouddb1021 has been recloned, added views, heartbeat, grants etc. Once it has caught up I will reclone the other two hosts from it.
Marostegui
updated the task description.
(Show Details)
May 31 2023, 1:50 PM
2023-05-31 13:50:00 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 4:36 PM
2023-05-31 16:36:09 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 4:38 PM
2023-05-31 16:38:23 (UTC+0)
Ayoub_
subscribed.
May 31 2023, 7:23 PM
2023-05-31 19:23:23 (UTC+0)
TheresNoTime
merged a task:
T337888: guc.toolforge with database error s4
May 31 2023, 7:27 PM
2023-05-31 19:27:22 (UTC+0)
TheresNoTime
added a subscriber:
doctaxon
doctaxon
added a project:
TaxonBot
May 31 2023, 7:35 PM
2023-05-31 19:35:05 (UTC+0)
doctaxon
moved this task from
Backlog
to
Webservice/DB
on the
TaxonBot
board.
May 31 2023, 7:38 PM
2023-05-31 19:38:39 (UTC+0)
Marostegui
updated the task description.
(Show Details)
May 31 2023, 8:10 PM
2023-05-31 20:10:17 (UTC+0)
Marostegui
updated the task description.
(Show Details)
Comment Actions
s4 has been fully recloned, clouddb1019:3314 is now catching up with its master
Marostegui
mentioned this in
T337888: guc.toolforge with database error s4
May 31 2023, 8:24 PM
2023-05-31 20:24:05 (UTC+0)
Liz
added a comment.
Jun 1 2023, 12:54 AM
2023-06-01 00:54:31 (UTC+0)
Comment Actions
I'm not sure what's causing it (regarding s1), but I'm finding some bots are not returning up-to-date reports. With s1 down for 5 days, there should be a backlog of lengthy reports but I'm seeing short reports or none at all. Did every new edit since May 25th get restored and integrated? Sorry that I don't know the correct terminology.
Marostegui
added a comment.
Jun 1 2023, 1:42 AM
2023-06-01 01:42:03 (UTC+0)
Comment Actions
In
T337446#8894111
@Liz
wrote:
I'm not sure what's causing it (regarding s1), but I'm finding some bots are not returning up-to-date reports. With s1 down for 5 days, there should be a backlog of lengthy reports but I'm seeing short reports or none at all. Did every new edit since May 25th get restored and integrated? Sorry that I don't know the correct terminology.
Can you give us more details about how to debug this? s1 data is up to date now so the reports should be providing the right data unless there's a queue and/or a cache layer somewhere
Marostegui
updated the task description.
(Show Details)
Jun 1 2023, 3:58 AM
2023-06-01 03:58:03 (UTC+0)
gerritbot
added a comment.
Jun 1 2023, 4:36 AM
2023-06-01 04:36:01 (UTC+0)
Comment Actions
Change 925286 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db1155: Enable notifications
gerritbot
added a comment.
Jun 1 2023, 4:37 AM
2023-06-01 04:37:23 (UTC+0)
Comment Actions
Change 925286
merged
by Marostegui:
[operations/puppet@production] db1155: Enable notifications
LilianaUwU
subscribed.
Jun 1 2023, 5:42 AM
2023-06-01 05:42:27 (UTC+0)
Marostegui
lowered the priority of this task from
Unbreak Now!
to
High
Jun 1 2023, 8:47 AM
2023-06-01 08:47:59 (UTC+0)
Comment Actions
I am reducing the priority of this as all the hosts have been recloned now and data should be up to date.
We shouldn't be surprised if s6 and s8 (the sections that never break) end up breaking on the sanitarium hosts, as if the problem was 10.4.29, data might have been corrupted there and simply didn't show up yet.
I am going to do some data checking now on the recloned versions before closing this task, hopefully by Monday if everything goes fine in the next few days.
Things might still be slow on some of the tools as we are adding the special indexes used in wikireplicas, that can be tracked at
T337734
Marostegui
added a comment.
Jun 1 2023, 8:51 AM
2023-06-01 08:51:12 (UTC+0)
Comment Actions
Just posted on wikitech-l
Don-vip
subscribed.
Jun 1 2023, 11:24 AM
2023-06-01 11:24:56 (UTC+0)
Marostegui
closed subtask
T337811: Check and enable GTID across sanitarium and clouddb* hosts
as
Resolved
Jun 1 2023, 11:41 AM
2023-06-01 11:41:37 (UTC+0)
Quiddity
subscribed.
Jun 1 2023, 7:10 PM
2023-06-01 19:10:08 (UTC+0)
Comment Actions
In
T337446#8886722
@MusikAnimal
wrote:
Something should be in Tech News
Please could someone suggest
how
to summarize this for Tech News? Draft wording
always
helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).
From a skim of all the above, the best I can guess at (probably
very
inaccurate!) is:
For a few days last week, [readers/editors] in some regions experienced delays seeing edits being visible, which also caused problems for some tools. This was caused by problems in the
secondary databases
. This should now be fixed.
Ladsgroup
added a comment.
Jun 1 2023, 7:18 PM
2023-06-01 19:18:47 (UTC+0)
Comment Actions
Production databases didn't have lag. Only cloud replicas but lags in order of days. Basically it stopped getting any updates for around a week due to data,integrity issues.
Hope that clears it a bit. (On phone, otherwise I would have made an exact phrase to change)
Marostegui
added a comment.
Jun 1 2023, 7:25 PM
2023-06-01 19:25:24 (UTC+0)
Comment Actions
In
T337446#8897087
@Quiddity
wrote:
In
T337446#8886722
@MusikAnimal
wrote:
Something should be in Tech News
Please could someone suggest
how
to summarize this for Tech News? Draft wording
always
helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).
From a skim of all the above, the best I can guess at (probably
very
inaccurate!) is:
For a few days last week, [readers/editors] in some regions experienced delays seeing edits being visible, which also caused problems for some tools. This was caused by problems in the
secondary databases
. This should now be fixed.
Wikireplicas had outdated data and were unavailable for around 1 week. There were periods where not even old data was available.
Tools have most likely experienced intermittent unavailability since Wednesday past week until today. We are still adding indexes, so even though everything is up, slowness in certain tools can be experienced.
This outage didn't affect production.
Ladsgroup
added a comment.
Edited
Jun 1 2023, 7:29 PM
2023-06-01 19:29:21 (UTC+0)
Comment Actions
The issue of slow down is resolved by now. There are still some replicas that don't have the index yet but all of them are depooled so no user-facing slowdown anymore
Legoktm
added a comment.
Jun 1 2023, 7:42 PM
2023-06-01 19:42:22 (UTC+0)
Comment Actions
In
T337446#8897087
@Quiddity
wrote:
Please could someone suggest
how
to summarize this for Tech News? Draft wording
always
helps immensely! (1-3 short sentences, not too technical, 1-2 links for context or more details).
Some tools and bots returned outdated information due to database breakage, and may have been down entirely while it was being fixed. These issues have now been fixed.
Possibly could link to
but that's English-only.
Quiddity
added a comment.
Jun 1 2023, 8:18 PM
2023-06-01 20:18:03 (UTC+0)
Comment Actions
Thank you immensely
@Legoktm
that's exactly what I needed. :)
Now added
. If anyone has changes, please make them directly there, within the next ~23 hours. Thanks.
Quiddity
moved this task from
To Triage
to
In current Tech/News draft
on the
User-notice
board.
Jun 1 2023, 8:18 PM
2023-06-01 20:18:18 (UTC+0)
Guycn2
unsubscribed.
Jun 1 2023, 10:01 PM
2023-06-01 22:01:33 (UTC+0)
Lemonaka
awarded a token.
Jun 2 2023, 3:25 AM
2023-06-02 03:25:01 (UTC+0)
Ladsgroup
closed subtask
T337734: Investigate if maintain-replica-indexes is still needed
as
Resolved
Jun 2 2023, 10:08 AM
2023-06-02 10:08:55 (UTC+0)
Ladsgroup
closed this task as
Resolved
Jun 2 2023, 10:14 AM
2023-06-02 10:14:21 (UTC+0)
Ladsgroup
moved this task from
In progress
to
Done
on the
DBA
board.
Comment Actions
The hosts have been fully rebuilt and working as expected without any major replag anymore. The indexes have been added too. So I'm closing this. Some follow ups are needed (like
T337961
) but the user-facing parts are done. Sorry for the inconvenience and a major wikilove to
@Marostegui
who worked day and night in the last week and weekend to get everything back to normal.
MusikAnimal
awarded a token.
Jun 2 2023, 1:29 PM
2023-06-02 13:29:48 (UTC+0)
MusikAnimal
added a comment.
Jun 2 2023, 1:44 PM
2023-06-02 13:44:53 (UTC+0)
Comment Actions
In
T337446#8898231
@Ladsgroup
wrote:
The hosts have been fully rebuilt and working as expected without any major replag anymore. The indexes have been added too. So I'm closing this. Some follow ups are needed (like
T337961
) but the user-facing parts are done. Sorry for the inconvenience and a major wikilove to Marostegui who worked day and night in the last week and weekend to get everything back to normal.
Agreed, immense thanks to
@Marostegui
and also you, Ladsgroup!
I wanted to ask something I've genuinely been curious about for years -- since the wiki replicas are relied upon so heavily by the editing communities (and to some degree, readers), should we as an org treat their health with more scrutiny? This of course is insignificant compared to the production replicas going down, but nonetheless the effects were surely felt all across the movement (editathons don't have live tracking, stewards can't query for global contribs, important bots stop working, etc.). I.e. I wonder if there's any appetite to file an
incident report
, especially if we feel there are lessons to be learned to prevent similar future outages? I noticed other comparatively low-impact incidents have been documented, such as
PAWS outages
TheDJ
subscribed.
Jun 2 2023, 2:13 PM
2023-06-02 14:13:31 (UTC+0)
Comment Actions
Also much thanks to especially
@Marostegui
from my side.
In
T337446#8898661
@MusikAnimal
wrote:
I wanted to ask something I've genuinely been curious about for years -- since the wiki replicas are relied upon so heavily by the editing communities (and to some degree, readers), should we as an org treat their health with more scrutiny? This of course is insignificant compared to the production replicas going down, but nonetheless the effects were surely felt all across the movement (editathons don't have live tracking, stewards can't query for global contribs, important bots stop working, etc.). I.e. I wonder if there's any appetite to file an
incident report
, especially if we feel there are lessons to be learned to prevent similar future outages? I noticed other comparatively low-impact incidents have been documented, such as
PAWS outages
I do think that at the very least we should have some way to recover from severe incidents like these a whole lot faster. Maybe having a delayed replica that we can use as a data source to speed up recovery, or something like a puppet run that preps a 'fresh' replica instance every single day, to make sure all the parts needed for that are known to be good ?
I think this one required too much learning on the job for something this critical, and the sole reason is that it luckily doesnt happen too often, but I think the whole process was too involved for everyone. It was affecting and disrupting too many ppl, and teams, which is I think the point of reference we should be using instead of "it's not production".
SD0001
subscribed.
Edited
Jun 5 2023, 5:31 PM
2023-06-05 17:31:27 (UTC+0)
Comment Actions
s1 looks to be down again.
(Edit: Now tracked at
T338172
sd@tools-sgebastion-10:~$ sql enwiki
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11
Ladsgroup
added a comment.
Jun 5 2023, 5:49 PM
2023-06-05 17:49:33 (UTC+0)
Comment Actions
DB-wise things are good:
I think something is broken on network side of things. Please file a separate ticket.
Quiddity
moved this task from
In current Tech/News draft
to
Already announced/Archive
on the
User-notice
board.
Jun 7 2023, 5:42 PM
2023-06-07 17:42:34 (UTC+0)
Clement_Goubert
mentioned this in
T339243: ServiceLVS without monitor breaks spicerack
Jun 15 2023, 2:13 PM
2023-06-15 14:13:02 (UTC+0)
Jelto
subscribed.
Jun 15 2023, 3:33 PM
2023-06-15 15:33:17 (UTC+0)
Comment Actions
fyi: I started a incident doc at
because it was requested to have this incident in the next incident review ritual on Tuesday. I'll add some more information tomorrow and Monday, but feel free to add anything I missed.
Maintenance_bot
removed a project:
Patch-For-Review
Jun 15 2023, 4:10 PM
2023-06-15 16:10:52 (UTC+0)
Marostegui
added a subscriber:
KOfori
Jun 15 2023, 4:19 PM
2023-06-15 16:19:12 (UTC+0)
Comment Actions
You might want to sync up with
@KOfori
because he's also started one IR. And I have captured a lot more detailed timeline, so maybe we need to merge both.
ClydeFranklin
unsubscribed.
Jun 19 2023, 7:19 PM
2023-06-19 19:19:56 (UTC+0)
Maintenance_bot
edited projects, added
User-notice-archive
; removed
User-notice
Jun 29 2023, 7:31 PM
2023-06-29 19:31:14 (UTC+0)
nskaggs
mentioned this in
T337848: WMCS-roots wiki replica access
Aug 16 2023, 3:35 PM
2023-08-16 15:35:25 (UTC+0)
Marostegui
mentioned this in
T344608: WMCS-roots paging responsibilities
Aug 21 2023, 3:16 PM
2023-08-21 15:16:04 (UTC+0)
Lemonaka
unsubscribed.
Sep 6 2023, 10:27 PM
2023-09-06 22:27:37 (UTC+0)
BBlack
added a comment.
Sep 15 2023, 7:38 PM
2023-09-15 19:38:03 (UTC+0)
Comment Actions
There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas:
Is it safe to assume we're back in a sane state and can turn this back on?
SammiBrie
unsubscribed.
Sep 15 2023, 7:52 PM
2023-09-15 19:52:43 (UTC+0)
Marostegui
added a comment.
Sep 15 2023, 8:06 PM
2023-09-15 20:06:55 (UTC+0)
Comment Actions
In
T337446#9170734
@BBlack
wrote:
There's a followup commit that was never merged, to re-enable pybal health monitoring on all the wikireplicas:
Is it safe to assume we're back in a sane state and can turn this back on?
Let's go for it Brandon!
gerritbot
added a comment.
Sep 18 2023, 2:00 PM
2023-09-18 14:00:35 (UTC+0)
Comment Actions
Change 924508
merged
by BBlack:
[operations/puppet@production] wikireplicas: restore pybal monitoring
Stashbot
added a comment.
Sep 18 2023, 2:04 PM
2023-09-18 14:04:18 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2023-09-18T14:04:17Z] lvs1020, lvs1018: restarting pybal to re-enable healthchecks for wikireplicas (
T337446
->
SWinxy
unsubscribed.
Oct 3 2023, 11:25 PM
2023-10-03 23:25:28 (UTC+0)
taavi
closed subtask
T337721: Wiki-replicas: investigate why some maintenance operations can cause unwanted pybal impact
as
Invalid
Jan 22 2024, 9:03 AM
2024-01-22 09:03:13 (UTC+0)
Log In to Comment
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct.
Wikimedia Foundation
Code of Conduct
Disclaimer
CC-BY-SA
GPL
Credits