⚓ T373243 DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least)
Page Menu
Phabricator
Create Task
Maniphest
T373243
DNS on toolforge kubernetes seems to fail regularly (20-25% of the time at least)
Closed, Resolved
Public
BUG REPORT
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
dcaro
Authored By
ArthurPSmith
Aug 24 2024, 1:21 PM
2024-08-24 13:21:11 (UTC+0)
Tags
Toolforge (Toolforge iteration 14)
(Done)
Referenced Files
F57294342: image.png
Aug 26 2024, 7:53 AM
2024-08-26 07:53:15 (UTC+0)
F57294340: image.png
Aug 26 2024, 7:53 AM
2024-08-26 07:53:15 (UTC+0)
Subscribers
Aklapper
Albertoleoncio
Andrew
AntiCompositeNumber
ArthurPSmith
Chlod
Count_Count
View All 22 Subscribers
Description
Steps to replicate the issue
(include links if applicable):
Go to:
If it works, repeat half a dozen times until it fails
Note - this is an php app running on kubernetes - see /data/project/author-disambiguator etc.
What happens?
Fatal error: Uncaught mysqli_sql_exception: php_network_getaddresses: getaddrinfo for tools.db.svc.eqiad.wmflabs failed: Temporary failure in name resolution in /data/project/author-disambiguator/public_html/lib/database_tools.php:15
What should have happened instead?
You should have seen the default page for the application (after OAuth login)
Software version
(on
Special:Version
page; skip for WMF-hosted wikis like Wikipedia):
Other information
(browser name/version, screenshots, etc.):
Related Objects
Mentions
Duplicates
Mentioned In
T356163: ChieBot: Intermittent connection reset by peer errors
T373233: Refill tool stuck "waiting for an available worker"
T373293: [builds-api] quota command failing on functional tests on tools
T373269: Tech Contribs does not support parentheses in user names
T373266: failure in name resolution and Uncaught Error in stalktoy on toolforge
Mentioned Here
T373816: Cloud VPS: investigate conntrack table usage on cloudvirt1050
Duplicates Merged Here
T373319: GUC displays a database error
T373293: [builds-api] quota command failing on functional tests on tools
T373266: failure in name resolution and Uncaught Error in stalktoy on toolforge
Event Timeline
There are a very large number of changes, so older changes are hidden.
Show Older Changes
Soda
subscribed.
Aug 24 2024, 4:07 PM
2024-08-24 16:07:09 (UTC+0)
Comment Actions
CropTool has been having similar issues and is unable to connect to mediawiki.org and/or commons.wikimedia.org. See
Commons_talk:CropTool#Unable_to_open_any_image_in_CropTool
Samwilson
subscribed.
Aug 24 2024, 10:54 PM
2024-08-24 22:54:56 (UTC+0)
Don-vip
subscribed.
Aug 25 2024, 5:46 PM
2024-08-25 17:46:18 (UTC+0)
Don-vip
added a comment.
Aug 25 2024, 5:50 PM
2024-08-25 17:50:26 (UTC+0)
Comment Actions
Same for my tool (pod spacemedia-6fdcc8d798-8sncn). Started to fail at 2024-08-25T17:38:18.469Z with error message "java.net.UnknownHostException: tools.db.svc.wikimedia.cloud"
I don't see name resolution problem on bastion nor my cloud vps instances.
Don-vip
awarded a token.
Aug 25 2024, 5:51 PM
2024-08-25 17:51:02 (UTC+0)
Yann
subscribed.
Aug 25 2024, 6:04 PM
2024-08-25 18:04:01 (UTC+0)
Comment Actions
Failed on first try:
Fatal error: Uncaught mysqli_sql_exception: php_network_getaddresses: getaddrinfo for tools.db.svc.wikimedia.cloud failed: Temporary failure in name resolution in /data/project/author-disambiguator/public_html/lib/database_tools.php:15 Stack trace: #0 /data/project/author-disambiguator/public_html/lib/database_tools.php(15): mysqli->__construct() #1 /data/project/author-disambiguator/public_html/work_item_oauth.php(7): DatabaseTools->openToolDB() #2 {main} thrown in /data/project/author-disambiguator/public_html/lib/database_tools.php on line 15
AntiCompositeNumber
subscribed.
Edited
Aug 25 2024, 6:07 PM
2024-08-25 18:07:23 (UTC+0)
Comment Actions
getting this for AntiCompositeBot's nolicense task as well (
Pod/anticompositebot.nolicense-cron-28743485-x7fqt
on
tools-k8s-worker-nfs-38
):
2024-08-25 18:06:37 nolicense ERROR: (2003, "Can't connect to MySQL server on 'commonswiki.analytics.db.svc.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)")
Don-vip
mentioned this in
T373266: failure in name resolution and Uncaught Error in stalktoy on toolforge
Aug 25 2024, 6:14 PM
2024-08-25 18:14:51 (UTC+0)
JJMC89
merged a task:
T373266: failure in name resolution and Uncaught Error in stalktoy on toolforge
Aug 25 2024, 6:25 PM
2024-08-25 18:25:17 (UTC+0)
JJMC89
added a subscriber:
Jeff_G
Count_Count
subscribed.
Aug 25 2024, 6:28 PM
2024-08-25 18:28:08 (UTC+0)
Stuartyeates
subscribed.
Aug 25 2024, 7:19 PM
2024-08-25 19:19:03 (UTC+0)
mdaniels5757
subscribed.
Aug 25 2024, 7:52 PM
2024-08-25 19:52:14 (UTC+0)
Comment Actions
I think this is related:
ERROR: TjfCliError: The jobs service seems to be down – please retry in a few minutes.
ERROR: Please report this issue to the Toolforge admins if it persists: https://w.wiki/6Zuu
Krinkle
subscribed.
Aug 25 2024, 8:04 PM
2024-08-25 20:04:42 (UTC+0)
Comment Actions
tools.krinklebot is facing
Could not resolve host: commons.wikimedia.org
for production hostnames as well. This runs as scheduled toolforge job:
[2024-08-24T15:40:46+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-24T15:41:17+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-24T20:31:19+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:10:55+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/de]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:11:27+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:11:58+00:00] ERROR: Skipping [[Project:Auto-protected files/wikinews/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:12:29+00:00] ERROR: Skipping [[Project:Auto-protected files/wiktionary/en]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:13:00+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/fa]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
[2024-08-25T19:13:42+00:00] ERROR: Skipping [[Project:Auto-protected files/wikipedia/fr]] due to RuntimeException: Could not resolve host: commons.wikimedia.org in /data/project/krinklebot/src/fileprotectionsync/src/FileProtectionSyncBot.php:282
Stuartyeates
added a comment.
Aug 25 2024, 8:42 PM
2024-08-25 20:42:54 (UTC+0)
Comment Actions
Just got a different message from
?... . This may be a result of a DNS failure not being caught?
Warning: Undefined variable $http_response_header in /data/project/author-disambiguator/public_html/lib/borrowed_utilities.php on line 41
mdaniels5757
triaged this task as
Unbreak Now!
priority.
Aug 25 2024, 9:02 PM
2024-08-25 21:02:44 (UTC+0)
Daimona
awarded a token.
Aug 25 2024, 9:41 PM
2024-08-25 21:41:16 (UTC+0)
Daimona
subscribed.
Chlod
subscribed.
Aug 25 2024, 11:03 PM
2024-08-25 23:03:10 (UTC+0)
Comment Actions
Noting here that I'm unable to use Build Service, probably due to the same issue. Related log line:
[step-clone] 2024-08-25T22:59:56.754700588Z {"level":"error","ts":1724626796.754072,"caller":"git/git.go:55","msg":"Error running git [fetch --recurse-submodules=yes --depth=1 origin --update-head-ok --force ]: exit status 128\nfatal: unable to access 'https://gitlab.wikimedia.org/toolforge-repos/techcontribs/': Could not resolve host: gitlab.wikimedia.org\n","stacktrace":"github.com/tektoncd/pipeline/pkg/git.run\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:55\ngithub.com/tektoncd/pipeline/pkg/git.Fetch\n\tgithub.com/tektoncd/pipeline/pkg/git/git.go:150\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/git-init/main.go:53\nruntime.main\n\truntime/proc.go:255"}
Chlod
mentioned this in
T373269: Tech Contribs does not support parentheses in user names
Aug 25 2024, 11:05 PM
2024-08-25 23:05:27 (UTC+0)
Novem_Linguae
subscribed.
Aug 25 2024, 11:50 PM
2024-08-25 23:50:08 (UTC+0)
Andrew
subscribed.
Aug 26 2024, 12:32 AM
2024-08-26 00:32:30 (UTC+0)
Comment Actions
Are people still seeing this issue? I'm unable to produce the specific failure mentioned in the task description.
AntiCompositeNumber
added a comment.
Aug 26 2024, 12:35 AM
2024-08-26 00:35:18 (UTC+0)
Comment Actions
The last one I got was 2024-08-25 22:07:47Z. But it's been intermittent the whole time.
Andrew
added a comment.
Aug 26 2024, 12:52 AM
2024-08-26 00:52:55 (UTC+0)
Comment Actions
by 'intermittent' do you mean that it's always failing a little bit, or that every few hours it fails a lot, for a few minutes?
Stuartyeates
added a comment.
Aug 26 2024, 12:55 AM
2024-08-26 00:55:29 (UTC+0)
Comment Actions
I'm seeing failures of URLs like
"Internal Server Error / The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application."
Don-vip
added a comment.
Aug 26 2024, 5:58 AM
2024-08-26 05:58:24 (UTC+0)
Comment Actions
For me the errors are gone (toolforge job service works, I was able to build and deploy my tool. No more DNS errors, everything looks fine).
dcaro
mentioned this in
T373293: [builds-api] quota command failing on functional tests on tools
Aug 26 2024, 7:31 AM
2024-08-26 07:31:47 (UTC+0)
dcaro
merged a task:
T373293: [builds-api] quota command failing on functional tests on tools
Aug 26 2024, 7:41 AM
2024-08-26 07:41:23 (UTC+0)
dcaro
subscribed.
dcaro
added a comment.
Aug 26 2024, 7:53 AM
2024-08-26 07:53:15 (UTC+0)
Comment Actions
Coredns does not seem to have spikes in usage, cpu:
Mem
Looking
dcaro
added a comment.
Aug 26 2024, 8:06 AM
2024-08-26 08:06:00 (UTC+0)
Comment Actions
hmm... from a webservice shell, we get sometimes a
non authoritative answer
I have no name!@shell-1724659470:~$ nslookup tools-harbor.wmcloud.org
Server: 10.96.0.10

Name: tools-harbor.wmcloud.org

I have no name!@shell-1724659470:~$ nslookup tools-harbor.wmcloud.org
Server: 10.96.0.10

Non-authoritative answer:
Name: tools-harbor.wmcloud.org
dcaro
added a comment.
Aug 26 2024, 8:33 AM
2024-08-26 08:33:07 (UTC+0)
Comment Actions
Just manually scaled up the number of replicas for the coredns deployment from 2 to 4, and things seem to be improving, is anyone still seeing issues?
dcaro
added a comment.
Aug 26 2024, 10:30 AM
2024-08-26 10:30:45 (UTC+0)
Comment Actions
Yep, still having issues, looking
RhinosF1
merged a task:
T373319: GUC displays a database error
Aug 26 2024, 11:24 AM
2024-08-26 11:24:40 (UTC+0)
RhinosF1
added subscribers:
Melos
RhinosF1
dcaro
added a comment.
Aug 26 2024, 11:46 AM
2024-08-26 11:46:20 (UTC+0)
Comment Actions
Querying from a webservice shell fails pretty frequently, even for internal names (and without domain searching, ie. with trailing
):
I have no name!@shell-1724670591:~$ time nslookup api.svc.tools.eqiad1.wikimedia.cloud.
Server: 10.96.0.10

api.svc.tools.eqiad1.wikimedia.cloud canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud.
Name: k8s.svc.tools.eqiad1.wikimedia.cloud

real 0m0.041s
user 0m0.013s
sys 0m0.017s
########################################################################
I have no name!@shell-1724670591:~$ time nslookup api.svc.tools.eqiad1.wikimedia.cloud.
Server: 10.96.0.10

api.svc.tools.eqiad1.wikimedia.cloud canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud.
Name: k8s.svc.tools.eqiad1.wikimedia.cloud
;; communications error to 10.96.0.10#53: timed out

real 0m5.050s
user 0m0.018s
sys 0m0.014s
It's running on worker-104
tools.wm-lol@tools-bastion-13:~$ kubectl get pods shell-1724670591 -o yaml | grep worker
nodeName: tools-k8s-worker-104
From the coredns pod it's way more reliable:
oot@tools-k8s-control-7:~# time nsenter -n -t 1775910 nslookup api.svc.tools.eqiad1.wikimedia.cloud. 10.96.0.10
Server: 10.96.0.10

api.svc.tools.eqiad1.wikimedia.cloud canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud.
Name: k8s.svc.tools.eqiad1.wikimedia.cloud

real 0m0.049s
user 0m0.010s
sys 0m0.030s
Trying with nsenter from a few other containers/workers
dcaro
added a comment.
Aug 26 2024, 11:48 AM
2024-08-26 11:48:28 (UTC+0)
Comment Actions
I can reproduce with nsenter on the worker:
root@tools-k8s-worker-104:~# time nsenter -t 578510 -n nslookup api.svc.tools.eqiad1.wikimedia.cloud. 10.96.0.10
;; communications error to 10.96.0.10#53: timed out
Server: 10.96.0.10

api.svc.tools.eqiad1.wikimedia.cloud canonical name = k8s.svc.tools.eqiad1.wikimedia.cloud.
Name: k8s.svc.tools.eqiad1.wikimedia.cloud
;; communications error to 10.96.0.10#53: timed out

real 0m2.043s
user 0m0.021s
sys 0m0.020s
MBH
subscribed.
Aug 26 2024, 12:25 PM
2024-08-26 12:25:49 (UTC+0)
Comment Actions
When I'm trying to build an image from my github repo, I got this strange issue:
unable to access 'https://github.com/Saisengen/wikibots/': Could not resolve host: github.com\n"
Could it be related to this issue?
Stashbot
added a comment.
Aug 26 2024, 12:42 PM
2024-08-26 12:42:56 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-26T12:42:55Z] START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-104 (
T373243
Stashbot
added a comment.
Aug 26 2024, 12:44 PM
2024-08-26 12:44:12 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-26T12:44:11Z] END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-104 (
T373243
Stashbot
added a comment.
Aug 26 2024, 12:53 PM
2024-08-26 12:53:16 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-26T12:53:14Z] START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-104 (
T373243
Stashbot
added a comment.
Aug 26 2024, 12:53 PM
2024-08-26 12:53:20 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-26T12:53:19Z] END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-104 (
T373243
Stashbot
added a comment.
Aug 26 2024, 1:05 PM
2024-08-26 13:05:08 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-26T13:05:06Z] START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (
T373243
Stashbot
added a comment.
Aug 26 2024, 1:12 PM
2024-08-26 13:12:43 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-26T13:12:41Z] END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-4, tools-k8s-worker-nfs-15, tools-k8s-worker-nfs-18, tools-k8s-worker-nfs-25, tools-k8s-worker-nfs-51, tools-k8s-worker-nfs-52, tools-k8s-worker-104 (
T373243
dcaro
added a comment.
Aug 26 2024, 1:14 PM
2024-08-26 13:14:03 (UTC+0)
Comment Actions
So going around with cumin, we found some workers that fail often:
tools-k8s-worker-{nfs-{4,15,18,25,51,52},104}
# running this many times to get all the failures
root@cloudcumin1001:~# cumin --force 'O{project:tools name:.*worker.*}' 'nsenter -n -t $(pgrep calico| head -n1) dig +tries=1 tools-harbor.wmcloud.org @10.96.0.10'
The rest of workers do not seem to fail, those are restarting right now, though that did not help with worker-104 :/, so might have to find something else
dcaro
added a comment.
Aug 26 2024, 1:17 PM
2024-08-26 13:17:08 (UTC+0)
Comment Actions
The reboot did not help xd, the VMs are all running on different cloudvirts:
root@cloudcontrol1007:~# for node in tools-k8s-worker-{nfs-{4,15,18,25,51,52},104}; do echo "$node -> $(OS_PROJECT_ID=tools openstack server show $node | grep hypervisor_hostname)"; done
tools-k8s-worker-nfs-4 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1048.eqiad.wmnet |
tools-k8s-worker-nfs-15 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1034.eqiad.wmnet |
tools-k8s-worker-nfs-18 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1060.eqiad.wmnet |
tools-k8s-worker-nfs-25 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1032.eqiad.wmnet |
tools-k8s-worker-nfs-51 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1057.eqiad.wmnet |
tools-k8s-worker-nfs-52 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1032.eqiad.wmnet |
tools-k8s-worker-104 -> | OS-EXT-SRV-ATTR:hypervisor_hostname | cloudvirt1054.eqiad.wmnet |
Stashbot
added a comment.
Aug 26 2024, 2:03 PM
2024-08-26 14:03:26 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-26T14:03:24Z] START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-4 (
T373243
dcaro
added a comment.
Aug 26 2024, 2:29 PM
2024-08-26 14:29:09 (UTC+0)
Comment Actions
I have cordoned all the misbehaving workers, users should stop seeing issues right now, will try to debug in more detail and add new nodes if I can't find anything
ArthurPSmith
added a comment.
Aug 26 2024, 2:39 PM
2024-08-26 14:39:37 (UTC+0)
Comment Actions
Just to confirm I've done a few dozen actions that would have triggered this problem a few days ago, and everything is working. Thanks!
dcaro
added a comment.
Aug 26 2024, 4:05 PM
2024-08-26 16:05:51 (UTC+0)
Comment Actions
New nodes seem to not have the issue, so will continue adding new ones (added worker-nfs-57)
dcaro
lowered the priority of this task from
Unbreak Now!
to
Medium
Aug 27 2024, 7:01 AM
2024-08-27 07:01:51 (UTC+0)
Comment Actions
Currently cleaning up the old nodes, but everything seems stable
dcaro
added a comment.
Aug 27 2024, 7:02 AM
2024-08-27 07:02:46 (UTC+0)
Comment Actions
In
T373243#10091656
@MBH
wrote:
When I'm trying to build an image from my github repo, I got this strange issue:
unable to access 'https://github.com/Saisengen/wikibots/': Could not resolve host: github.com\n"
Could it be related to this issue?
Yes, that was caused by this issue, it should be gone now (if not please report otherwise)
Stashbot
added a comment.
Aug 27 2024, 8:24 AM
2024-08-27 08:24:40 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:24:38Z] START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-4 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:26 AM
2024-08-27 08:26:29 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:26:28Z] END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-4 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:26 AM
2024-08-27 08:26:55 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:26:55Z] START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-15 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:29 AM
2024-08-27 08:29:14 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:29:14Z] END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-15 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:29 AM
2024-08-27 08:29:23 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:29:23Z] START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-18 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:31 AM
2024-08-27 08:31:13 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:31:12Z] END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-18 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:31 AM
2024-08-27 08:31:22 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:31:21Z] START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-25 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:33 AM
2024-08-27 08:33:07 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:33:06Z] END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-25 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:34 AM
2024-08-27 08:34:07 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:34:07Z] START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-51 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:35 AM
2024-08-27 08:35:52 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:35:51Z] END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-51 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:37 AM
2024-08-27 08:37:10 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:37:08Z] START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-nfs-52 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:38 AM
2024-08-27 08:38:59 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:38:58Z] END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-nfs-52 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:53 AM
2024-08-27 08:53:38 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:53:37Z] START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-104 (
T373243
Stashbot
added a comment.
Aug 27 2024, 8:55 AM
2024-08-27 08:55:29 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-cloud-feed)
[2024-08-27T08:55:28Z] END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-104 (
T373243
MBH
added a comment.
Aug 27 2024, 8:56 AM
2024-08-27 08:56:14 (UTC+0)
Comment Actions
Yes, problem is fixed, thanks.
dcaro
closed this task as
Resolved
Aug 27 2024, 9:52 AM
2024-08-27 09:52:44 (UTC+0)
dcaro
claimed this task.
Comment Actions
I'll close this as it's been stable for a while and all the misbehaving nodes have been deleted :)
dcaro
moved this task from
Backlog
to
Ready to be worked on
on the
Toolforge
board.
Aug 27 2024, 9:53 AM
2024-08-27 09:53:04 (UTC+0)
dcaro
edited projects, added
Toolforge (Toolforge iteration 14)
; removed
Toolforge
Stuartyeates
added a comment.
Aug 27 2024, 9:53 AM
2024-08-27 09:53:06 (UTC+0)
Comment Actions
The issues I was seeing previously appear to have all resolved themselves, thank you.
dcaro
moved this task from
Next Up
to
Done
on the
Toolforge (Toolforge iteration 14)
board.
Aug 27 2024, 9:53 AM
2024-08-27 09:53:21 (UTC+0)
Novem_Linguae
mentioned this in
T373233: Refill tool stuck "waiting for an available worker"
Aug 27 2024, 1:09 PM
2024-08-27 13:09:05 (UTC+0)
dcaro
mentioned this in
T356163: ChieBot: Intermittent connection reset by peer errors
Aug 28 2024, 8:47 AM
2024-08-28 08:47:51 (UTC+0)
MBH
added a comment.
Aug 28 2024, 10:49 AM
2024-08-28 10:49:57 (UTC+0)
Comment Actions
@dcaro
My tool reads data from DB replica. Less than hour earlier tool was working correctly, but now it returns this error (in 100% of all tries):
Unable to connect to any of the specified MySQL hosts. ---> System.ArgumentException: The host name or IP address is invalid.
The host name is
ruwiki
dcaro
added a comment.
Aug 28 2024, 10:52 AM
2024-08-28 10:52:51 (UTC+0)
Comment Actions
In
T373243#10099254
@MBH
wrote:
@dcaro
My tool reads data from DB replica. Less than hour earlier tool was working correctly, but now it returns this error (in 100% of all tries):
Unable to connect to any of the specified MySQL hosts. ---> System.ArgumentException: The host name or IP address is invalid.
The host name is
ruwiki
Which tool is it?
Do you have the snippet of code that does the call?
dcaro
added a comment.
Aug 28 2024, 10:56 AM
2024-08-28 10:56:10 (UTC+0)
Comment Actions
All the workers seem to be responding ok (might be flaky, but no errors so far):
root@cloudcumin1001:~# cumin --force 'O{project:tools name:.*worker.*}' 'nsenter -n -t $(pgrep calico| head -n1) dig +tries=1 +short ruwiki.analytics.db.svc.wikimedia.cloud @10.96.0.10'
63 hosts will be targeted:
tools-k8s-worker-[102-103,105-108].tools.eqiad1.wikimedia.cloud,tools-k8s-worker-nfs-[1-3,5-14,16-17,19-24,26-50,53-58,60-64].tools.eqiad1.wikimedia.cloud
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====
(63) tools-k8s-worker-[102-103,105-108].tools.eqiad1.wikimedia.cloud,tools-k8s-worker-nfs-[1-3,5-14,16-17,19-24,26-50,53-58,60-64].tools.eqiad1.wikimedia.cloud
----- OUTPUT of 'nsenter -n -t $(...loud @10.96.0.10' -----
s6.analytics.db.svc.wikimedia.cloud.
172.20.255.7
================
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (63/63) [00:05<00:00, 12.16hosts/s]
FAIL | | 0% (0/63) [00:05100.0% (63/63) success ratio (>= 100.0% threshold) for command: 'nsenter -n -t $(...loud @10.96.0.10'.
100.0% (63/63) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
MBH
added a comment.
Aug 28 2024, 10:56 AM
2024-08-28 10:56:13 (UTC+0)
Comment Actions
It's a web tool.
Request:
Log path: /mnt/nfs/labstore-secondary-tools-project/mbh/error.log
Code:
, line 59. Error generates on line 60.
This tool was work (excluding errors not related to this issue) last 3 days, with this 59-60 lines.
dcaro
added a comment.
Aug 28 2024, 11:09 AM
2024-08-28 11:09:20 (UTC+0)
Comment Actions
@MBH
I'm suspecting this change:
the
wiki
parameter in the url you passed is in position
, not
(you can use the
wiki
string as index instead, less error-prone).
dcaro
added a comment.
Aug 28 2024, 11:10 AM
2024-08-28 11:10:44 (UTC+0)
Comment Actions
Ex. this works for me (putting type first):
MBH
added a comment.
Aug 28 2024, 11:28 AM
2024-08-28 11:28:19 (UTC+0)
Comment Actions
Thanks. I already used string indexation in other tools, but not this tool, because it's very old code.
Krinkle
unsubscribed.
Aug 28 2024, 9:33 PM
2024-08-28 21:33:47 (UTC+0)
fnegri
subscribed.
Sep 2 2024, 4:03 PM
2024-09-02 16:03:53 (UTC+0)
Comment Actions
This could be related to
T373816: Cloud VPS: investigate conntrack table usage on cloudvirt1050
(to be verified).
Log In to Comment
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct.
Wikimedia Foundation
Code of Conduct
Disclaimer
CC-BY-SA
GPL
Credits