⚓ T422860 Migrate Cloudelastic to OpenSearch 2.x
Page Menu
Phabricator
Create Task
Maniphest
T422860
Migrate Cloudelastic to OpenSearch 2.x
Open, Needs Triage
Public
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
bking
Authored By
bking
Thu, Apr 9, 4:35 PM
2026-04-09 16:35:16 (UTC+0)
Tags
Discovery-Search (2026.04.06 - 2026.05.01)
(Incoming)
Patch-For-Review
Data-Platform-SRE (2026-04-24 - 2026-05-15)
(In Progress)
Referenced Files
None
Subscribers
Aklapper
bking
EBernhardson
MoritzMuehlenhoff
RKemper
Description
Since we can't use our typical test cluster (relforge) as it's occupied by a Semantic Search experiment (ref
T413969
), we will start our OpenSearch 2.x migration on the cloudelastic cluster. Creating this ticket to:
Update the cluster from OpenSearch 1.x->2.x
Document any lessons learned
Details
Other Assignee
RKemper
Related Changes in Gerrit:
Subject
Repo
Branch
Lines +/-
cloudelastic: set role-level hiera for OpenSearch 2/Trixie
operations/puppet
production
+6
-6
cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2
operations/puppet
production
+56
-1
prometheus: fix wmf-elasticsearch-exporter listen address on Trixie
operations/puppet
production
+2
-1
cloudelastic1012: move back to production role
operations/puppet
production
+1
-5
cloudelastic1012: move back to insetup
operations/puppet
production
+5
-1
Cirrussearch: remove unused hiera files
operations/puppet
production
+0
-297
cloudelastic1012: Set LVS config for opensearch_2
operations/puppet
production
+13
-0
cloudelastic1012: full common_settings override for OS2
operations/puppet
production
+11
-4
cloudelastic1012: full common_settings override for OS2
operations/puppet
production
+17
-2
cloudelastic1012: full common_settings override for OS2
operations/puppet
production
+17
-2
cloudelastic1012: remove the deliberately-introduced typo
operations/puppet
production
+1
-1
opensearch: move var up so we can use it earlier
operations/puppet
production
+4
-2
OpenSearch: Control which plugins we use via systemd PrivateMounts
operations/puppet
production
+29
-0
opensearch: strip bundled plugins before WMF pkg
operations/puppet
production
+25
-0
opensearch: allowlist upstream plugins + overwrite
operations/puppet
production
+54
-13
cloudelastic: temporarily add "working typos" for plugins
operations/puppet
production
+17
-0
cloudelastic: fix java path typo
operations/puppet
production
+1
-1
opensearch: correct o11y usage in comment
operations/puppet
production
+1
-1
opensearch: hack around upstream 2.x+ packages
operations/puppet
production
+20
-0
nginx tls proxy: remove defunct directive
operations/puppet
production
+0
-1
cloudelastic: remove logstash profile
operations/puppet
production
+0
-1
opensearch: move cloudelastic1012 back into prod role
operations/puppet
production
+1
-5
cirrussearch: move cloudelastic1012 to insetup
operations/puppet
production
+6
-2
cloudelastic: Prepare for opensearch 2
operations/puppet
production
+32
-4
Show related patches
Customize query in gerrit
Related Objects
Search...
Task Graph
Mentions
Status
Subtype
Assigned
Task
Open
None
T421757
☂️ Migrate production OpenSearch clusters from 1.x-2.x ☂️
Open
bking
T422860
Migrate Cloudelastic to OpenSearch 2.x
Resolved
bking
T423291
Build new wmf-opensearch-search-plugins package for opensearch 2.x/trixie and ensure we don't install/enable any unwanted plugins in prod
Resolved
bking
T423327
Explore options for OpenSearch 2.x/3.x plugin packaging and distribution
Resolved
bking
T423523
Handle typos/possibly update opensearch-analysis-stconvert plugin
Mentioned In
T421757: ☂️ Migrate production OpenSearch clusters from 1.x-2.x ☂️
P90344 Reprepro error T422860
T421763: Migrate beta cluster to OpenSearch 2.x
Mentioned Here
T390592: Build updated opensearch-madvise .deb and update puppet with new cli argument
T368950: Consider migrating our Elastic TLS termination from nginx to envoy
T324335: Remove logstash from the CirrusSearch servers
T413969: Make semantic search accessible through Action API
Event Timeline
There are a very large number of changes, so older changes are hidden.
Show Older Changes
bking
mentioned this in
T421763: Migrate beta cluster to OpenSearch 2.x
Thu, Apr 9, 4:41 PM
2026-04-09 16:41:02 (UTC+0)
bking
updated the task description.
(Show Details)
bking
moved this task from
Backlog - project
to
In Progress
on the
Data-Platform-SRE (2026-03-27 - 2026-04-17)
board.
gerritbot
added a comment.
Thu, Apr 9, 4:45 PM
2026-04-09 16:45:14 (UTC+0)
Comment Actions
Change #1269531 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic: Prepare for opensearch 2
gerritbot
added a project:
Patch-For-Review
Thu, Apr 9, 4:45 PM
2026-04-09 16:45:15 (UTC+0)
Stashbot
added a comment.
Thu, Apr 9, 7:16 PM
2026-04-09 19:16:49 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-09T19:16:48Z]
T422860
Stashbot
added a comment.
Thu, Apr 9, 8:45 PM
2026-04-09 20:45:24 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-09T20:45:23Z]
T422860
gerritbot
added a comment.
Fri, Apr 10, 2:08 PM
2026-04-10 14:08:20 (UTC+0)
Comment Actions
Change #1269531
merged
by Bking:
[operations/puppet@production] cloudelastic: Prepare for opensearch 2
Maintenance_bot
removed a project:
Patch-For-Review
Fri, Apr 10, 2:31 PM
2026-04-10 14:31:38 (UTC+0)
ops-monitoring-bot
added a comment.
Fri, Apr 10, 3:16 PM
2026-04-10 15:16:19 (UTC+0)
Comment Actions
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS trixie
ops-monitoring-bot
added a comment.
Fri, Apr 10, 3:41 PM
2026-04-10 15:41:11 (UTC+0)
Comment Actions
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS trixie executed with errors:
cloudelastic1012 (
FAIL
Downtimed on Icinga/Alertmanager
Disabled Puppet
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced UEFI HTTP Boot for next reboot
Host rebooted via Redfish
Host up (Debian installer)
Host up (new fresh trixie OS)
Generated Puppet certificate
Signed new Puppet certificate
Run Puppet in NOOP mode to populate exported resources in PuppetDB
Found Nagios_host resource for this host in PuppetDB
Downtimed the new host on Icinga/Alertmanager
Removed previous downtime on Alertmanager (old OS)
First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202604101534_bking_2776092_cloudelastic1012.out, asking the operator what to do
First Puppet run failed and the operator skipped it
configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cloudelastic1012.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
gerritbot
added a comment.
Fri, Apr 10, 4:52 PM
2026-04-10 16:52:20 (UTC+0)
Comment Actions
Change #1270061 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cirrussearch: move cloudelastic1012 to insetup
gerritbot
added a project:
Patch-For-Review
Fri, Apr 10, 4:52 PM
2026-04-10 16:52:21 (UTC+0)
gerritbot
added a comment.
Fri, Apr 10, 4:54 PM
2026-04-10 16:54:47 (UTC+0)
Comment Actions
Change #1270061
merged
by Bking:
[operations/puppet@production] cirrussearch: move cloudelastic1012 to insetup
ops-monitoring-bot
added a comment.
Fri, Apr 10, 4:57 PM
2026-04-10 16:57:00 (UTC+0)
Comment Actions
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS trixie
ops-monitoring-bot
added a comment.
Fri, Apr 10, 5:27 PM
2026-04-10 17:27:26 (UTC+0)
Comment Actions
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS trixie completed:
cloudelastic1012 (
WARN
Downtimed on Icinga/Alertmanager
Unable to disable Puppet, the host may have been unreachable
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced UEFI HTTP Boot for next reboot
Host rebooted via Redfish
Host up (Debian installer)
Host up (new fresh trixie OS)
Generated Puppet certificate
Signed new Puppet certificate
Run Puppet in NOOP mode to populate exported resources in PuppetDB
Found Nagios_host resource for this host in PuppetDB
Downtimed the new host on Icinga/Alertmanager
Removed previous downtime on Alertmanager (old OS)
First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604101712_bking_2990479_cloudelastic1012.out
configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
Rebooted
Automatic Puppet run was successful
Forced a re-check of all Icinga services for the host
Icinga status is optimal
Icinga downtime removed
Updated Netbox data from PuppetDB
Maintenance_bot
removed a project:
Patch-For-Review
Fri, Apr 10, 5:30 PM
2026-04-10 17:30:57 (UTC+0)
gerritbot
added a comment.
Fri, Apr 10, 6:39 PM
2026-04-10 18:39:13 (UTC+0)
Comment Actions
Change #1270071 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] opensearch: move cloudelastic1012 back into prod role
gerritbot
added a project:
Patch-For-Review
Fri, Apr 10, 6:39 PM
2026-04-10 18:39:15 (UTC+0)
Comment Actions
Change #1270071
merged
by Bking:
[operations/puppet@production] opensearch: move cloudelastic1012 back into prod role
bking
added a comment.
Fri, Apr 10, 6:54 PM
2026-04-10 18:54:23 (UTC+0)
Comment Actions
Puppet is failing on the logstash installation:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, java_package: openjdk-21-jdk not yet supported (file: /srv/puppet_code/environments/production/modules/logstash/manifests/init.pp, line: 77, column: 24) on node cloudelastic1012.eqiad.wmnet
Cirrussearch hosts run a local logstash to improve log formatting before piping to the Observability logstash infra. We tried (and failed) to remove the local logstash in
T324335
, deciding it was too much work for not enough payoff. In the 2 1/2 years since that ticket, we've migrated from Elastic->OpenSearch, so now is a good time to revisit.
Maintenance_bot
removed a project:
Patch-For-Review
Fri, Apr 10, 7:30 PM
2026-04-10 19:30:53 (UTC+0)
gerritbot
added a comment.
Fri, Apr 10, 8:30 PM
2026-04-10 20:30:46 (UTC+0)
Comment Actions
Change #1270082 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic: remove logstash profile
gerritbot
added a project:
Patch-For-Review
Fri, Apr 10, 8:30 PM
2026-04-10 20:30:47 (UTC+0)
gerritbot
added a comment.
Fri, Apr 10, 8:40 PM
2026-04-10 20:40:17 (UTC+0)
Comment Actions
Change #1270082
merged
by Bking:
[operations/puppet@production] cloudelastic: remove logstash profile
bking
added a comment.
Fri, Apr 10, 9:08 PM
2026-04-10 21:08:30 (UTC+0)
Comment Actions
We might also revisit
T368950
, as our current nginx configuration is too old for the version in Trixie:
unknown directive "ssl" in /etc/nginx/sites-enabled/cloudelastic-chi-eqiad:14
Ref
this stack exchange question
For now, I'll try and update our nginx config if possible.
gerritbot
added a comment.
Fri, Apr 10, 9:24 PM
2026-04-10 21:24:44 (UTC+0)
Comment Actions
Change #1270084 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] nginx tls proxy: remove defunct directive
gerritbot
added a comment.
Mon, Apr 13, 1:31 PM
2026-04-13 13:31:33 (UTC+0)
Comment Actions
Change #1270084
merged
by Bking:
[operations/puppet@production] nginx tls proxy: remove defunct directive
pfischer
edited projects, added
Discovery-Search (2026.04.06 - 2026.05.01)
; removed
Discovery-Search (2026.03.03 - 2026.04.03)
Mon, Apr 13, 1:58 PM
2026-04-13 13:58:13 (UTC+0)
bking
added a comment.
Mon, Apr 13, 2:05 PM
2026-04-13 14:05:44 (UTC+0)
Comment Actions
We have a new puppet failure this time, related to the
opensearch-madvise
package not being available.
We'll need to republish this package to the Debian trixie
component/opensearch2
repo. Ref
T390592
for additional context.
Stashbot
added a comment.
Mon, Apr 13, 2:14 PM
2026-04-13 14:14:12 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-13T14:14:11Z]
T422860
Maintenance_bot
removed a project:
Patch-For-Review
Mon, Apr 13, 2:31 PM
2026-04-13 14:31:13 (UTC+0)
bking
added a comment.
Edited
Mon, Apr 13, 2:57 PM
2026-04-13 14:57:30 (UTC+0)
Comment Actions
I'm having issues getting the package to be recognized by the
component/opensearch2
trixie repo, ref
this phab paste
bking
added a comment.
Mon, Apr 13, 3:48 PM
2026-04-13 15:48:19 (UTC+0)
Comment Actions
I manually installed the deb on
cloudelastic1012
(the first Trixie server), now I'm getting the error:
Running OpenSearch Post-Installation Script
ERROR: Something went wrong during demo configuration installation. Please see the logs in /var/log/opensearch/install_demo_configuration.log
Unfortunately, the upstream OpenSearch deb package requires installing the security demo.
There's a CR out to fix this
, but it's 10 months old.
I just pinged the developers in OpenSearch Slack
, but we'll probably have to come up with a workaround (I've already done this with an ansible playbook in my homelab, and the linked github issue has some examples on how to do it in Puppet).
gerritbot
added a comment.
Mon, Apr 13, 6:05 PM
2026-04-13 18:05:26 (UTC+0)
Comment Actions
Change #1270511 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] opensearch: hack around upstream 2.x+ packages
gerritbot
added a project:
Patch-For-Review
Mon, Apr 13, 6:05 PM
2026-04-13 18:05:28 (UTC+0)
gerritbot
added a comment.
Tue, Apr 14, 1:25 PM
2026-04-14 13:25:23 (UTC+0)
Comment Actions
Change #1270511
merged
by Bking:
[operations/puppet@production] opensearch: hack around upstream 2.x+ packages
gerritbot
added a comment.
Tue, Apr 14, 2:15 PM
2026-04-14 14:15:25 (UTC+0)
Comment Actions
Change #1270953 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] opensearch: correct o11y usage in comment
bking
added a subtask:
T423291: Build new wmf-opensearch-search-plugins package for opensearch 2.x/trixie and ensure we don't install/enable any unwanted plugins in prod
Tue, Apr 14, 2:41 PM
2026-04-14 14:41:49 (UTC+0)
gerritbot
added a comment.
Tue, Apr 14, 4:16 PM
2026-04-14 16:16:24 (UTC+0)
Comment Actions
Change #1270953
merged
by Cwhite:
[operations/puppet@production] opensearch: correct o11y usage in comment
bking
added a subtask:
T423327: Explore options for OpenSearch 2.x/3.x plugin packaging and distribution
Tue, Apr 14, 6:22 PM
2026-04-14 18:22:07 (UTC+0)
gerritbot
added a comment.
Wed, Apr 15, 7:02 AM
2026-04-15 07:02:36 (UTC+0)
Comment Actions
Change #1271473 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] opensearch: strip bundled plugins before WMF pkg
MoritzMuehlenhoff
subscribed.
Wed, Apr 15, 7:19 AM
2026-04-15 07:19:48 (UTC+0)
Comment Actions
@bking
Wrt the issues with the broken import of the madvise package; I went ahead and rebuilt it as 0.2+deb13u1. While the only dependency of that package is in glibc with a stable ABI, it's still preferable to rebuild it with GCC 15 from Trixie. The new version also resolves the versioning/import issue. I've synced the debs to my home on cloudelastic1012, but didn't install them yet since I didn't want to meddle with any ongoing tests of you. When the time is right, please install them on 1012 and if they are fine, I'll import them to apt.w.o.
bking
added a comment.
Wed, Apr 15, 3:10 PM
2026-04-15 15:10:23 (UTC+0)
Comment Actions
@MoritzMuehlenhoff
, I've installed the packages as you requested and I can confirm they installed cleanly. Feel free to publish them to the repos.
Thanks for your help!
bking
mentioned this in
P90344 Reprepro error T422860
Wed, Apr 15, 3:12 PM
2026-04-15 15:12:03 (UTC+0)
bking
changed the status of subtask
T423291: Build new wmf-opensearch-search-plugins package for opensearch 2.x/trixie and ensure we don't install/enable any unwanted plugins in prod
from
Duplicate
to
Resolved
gerritbot
added a comment.
Wed, Apr 15, 3:19 PM
2026-04-15 15:19:45 (UTC+0)
Comment Actions
Change #1271473
merged
by Bking:
[operations/puppet@production] opensearch: strip bundled plugins before WMF pkg
CodeReviewBot
added a comment.
Wed, Apr 15, 4:42 PM
2026-04-15 16:42:20 (UTC+0)
Comment Actions
bking
merged
Remove already-installed packages
gerritbot
added a comment.
Wed, Apr 15, 5:06 PM
2026-04-15 17:06:21 (UTC+0)
Comment Actions
Change #1271818 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic: fix java path typo
gerritbot
added a comment.
Wed, Apr 15, 5:13 PM
2026-04-15 17:13:18 (UTC+0)
Comment Actions
Change #1271818
merged
by Bking:
[operations/puppet@production] cloudelastic: fix java path typo
ops-monitoring-bot
added a comment.
Wed, Apr 15, 9:06 PM
2026-04-15 21:06:17 (UTC+0)
Comment Actions
Icinga downtime and Alertmanager silence (ID=396a17ce-b27d-41be-a6ce-921c607989da) set by bking@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: still fixing Puppet
cloudelastic1012.eqiad.wmnet
gerritbot
added a comment.
Wed, Apr 15, 9:21 PM
2026-04-15 21:21:20 (UTC+0)
Comment Actions
Change #1271929 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic: temporarily add "working typos" for plugins
gerritbot
added a comment.
Wed, Apr 15, 9:27 PM
2026-04-15 21:27:26 (UTC+0)
Comment Actions
Change #1271929
merged
by Bking:
[operations/puppet@production] cloudelastic: temporarily add "working typos" for plugins
gerritbot
added a comment.
Wed, Apr 15, 10:03 PM
2026-04-15 22:03:35 (UTC+0)
Comment Actions
Change #1271947 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] opensearch: allowlist upstream-only plugins
Stashbot
added a comment.
Thu, Apr 16, 6:55 AM
2026-04-16 06:55:17 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2026-04-16T06:55:17Z]
T422860
MoritzMuehlenhoff
added a comment.
Thu, Apr 16, 6:57 AM
2026-04-16 06:57:04 (UTC+0)
Comment Actions
In
T422860#11824928
@bking
wrote:
@MoritzMuehlenhoff
, I've installed the packages as you requested and I can confirm they installed cleanly. Feel free to publish them to the repos.
Nice! I've imported the new package into component/opensearch2 for trixie-wikimedia
gerritbot
added a comment.
Fri, Apr 17, 5:03 PM
2026-04-17 17:03:54 (UTC+0)
Comment Actions
Change #1273887 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] OpenSearch: Control which plugins we use via systemd PrivateMounts
gerritbot
added a comment.
Fri, Apr 17, 7:22 PM
2026-04-17 19:22:41 (UTC+0)
Comment Actions
Change #1273887
merged
by Bking:
[operations/puppet@production] OpenSearch: Control which plugins we use via systemd PrivateMounts
gerritbot
added a comment.
Fri, Apr 17, 7:37 PM
2026-04-17 19:37:50 (UTC+0)
Comment Actions
Change #1273937 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] opensearch: move var up so we can use it earlier
gerritbot
added a comment.
Fri, Apr 17, 7:42 PM
2026-04-17 19:42:38 (UTC+0)
Comment Actions
Change #1273937
merged
by Bking:
[operations/puppet@production] opensearch: move var up so we can use it earlier
gerritbot
added a comment.
Fri, Apr 17, 7:55 PM
2026-04-17 19:55:19 (UTC+0)
Comment Actions
Change #1273943 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic1012: remove the deliberately-introduced typo
gerritbot
added a comment.
Fri, Apr 17, 7:57 PM
2026-04-17 19:57:24 (UTC+0)
Comment Actions
Change #1273943
merged
by Bking:
[operations/puppet@production] cloudelastic1012: remove the deliberately-introduced typo
gerritbot
added a comment.
Fri, Apr 17, 9:48 PM
2026-04-17 21:48:13 (UTC+0)
Comment Actions
Change #1274061 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] cloudelastic1012: override common_settings merge to first
gerritbot
added a comment.
Fri, Apr 17, 9:55 PM
2026-04-17 21:55:12 (UTC+0)
Comment Actions
Change #1274075 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] cloudelastic1012: full common_settings override for OS2
gerritbot
added a comment.
Fri, Apr 17, 9:55 PM
2026-04-17 21:55:57 (UTC+0)
Comment Actions
Change #1274075
abandoned
by Ryan Kemper:
[operations/puppet@production] cloudelastic1012: full common_settings override for OS2
Reason:
meant to update https://gerrit.wikimedia.org/r/c/operations/puppet/+/1274061; abandoning
gerritbot
added a comment.
Fri, Apr 17, 10:00 PM
2026-04-17 22:00:17 (UTC+0)
Comment Actions
Change #1274061
merged
by Ryan Kemper:
[operations/puppet@production] cloudelastic1012: full common_settings override for OS2
gerritbot
added a comment.
Fri, Apr 17, 10:54 PM
2026-04-17 22:54:13 (UTC+0)
Comment Actions
Change #1274134 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] cloudelastic1012: full common_settings override for OS2
gerritbot
added a comment.
Fri, Apr 17, 10:57 PM
2026-04-17 22:57:50 (UTC+0)
Comment Actions
Change #1274134
merged
by Ryan Kemper:
[operations/puppet@production] cloudelastic1012: full common_settings override for OS2
gerritbot
added a comment.
Mon, Apr 20, 3:04 PM
2026-04-20 15:04:40 (UTC+0)
Comment Actions
Change #1275435 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic1012: Set LVS config for opensearch_2
gerritbot
added a comment.
Mon, Apr 20, 3:09 PM
2026-04-20 15:09:17 (UTC+0)
Comment Actions
Change #1275435
merged
by Bking:
[operations/puppet@production] cloudelastic1012: Set LVS config for opensearch_2
gerritbot
added a comment.
Mon, Apr 20, 3:38 PM
2026-04-20 15:38:09 (UTC+0)
Comment Actions
Change #1275444 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] Cirrussearch: remove unused hiera files
gerritbot
added a comment.
Mon, Apr 20, 3:41 PM
2026-04-20 15:41:19 (UTC+0)
Comment Actions
Change #1275444
merged
by Bking:
[operations/puppet@production] Cirrussearch: remove unused hiera files
gerritbot
added a comment.
Mon, Apr 20, 5:00 PM
2026-04-20 17:00:06 (UTC+0)
Comment Actions
Change #1275473 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic1012: move back to insetup
gerritbot
added a comment.
Mon, Apr 20, 5:00 PM
2026-04-20 17:00:57 (UTC+0)
Comment Actions
Change #1275473
merged
by Bking:
[operations/puppet@production] cloudelastic1012: move back to insetup
ops-monitoring-bot
added a comment.
Mon, Apr 20, 5:03 PM
2026-04-20 17:03:00 (UTC+0)
Comment Actions
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS trixie
ops-monitoring-bot
added a comment.
Mon, Apr 20, 5:35 PM
2026-04-20 17:35:39 (UTC+0)
Comment Actions
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1012.eqiad.wmnet with OS trixie completed:
cloudelastic1012 (
PASS
Downtimed on Icinga/Alertmanager
Disabled Puppet
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced UEFI HTTP Boot for next reboot
Host rebooted via Redfish
Host up (Debian installer)
Host up (new fresh trixie OS)
Generated Puppet certificate
Signed new Puppet certificate
Run Puppet in NOOP mode to populate exported resources in PuppetDB
Found Nagios_host resource for this host in PuppetDB
Downtimed the new host on Icinga/Alertmanager
Removed previous downtime on Alertmanager (old OS)
First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604201718_bking_545568_cloudelastic1012.out
configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
Rebooted
Automatic Puppet run was successful
Forced a re-check of all Icinga services for the host
Icinga status is optimal
Icinga downtime removed
Updated Netbox data from PuppetDB
gerritbot
added a comment.
Mon, Apr 20, 5:45 PM
2026-04-20 17:45:26 (UTC+0)
Comment Actions
Change #1275485 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic1012: move back to production role
gerritbot
added a comment.
Mon, Apr 20, 5:46 PM
2026-04-20 17:46:28 (UTC+0)
Comment Actions
Change #1275485
merged
by Bking:
[operations/puppet@production] cloudelastic1012: move back to production role
bking
added a comment.
Edited
Mon, Apr 20, 7:36 PM
2026-04-20 19:36:01 (UTC+0)
Comment Actions
After reimaging
cloudelastic2012
, it is back up and ready for testing.
I've removed it from load balancer rotation, shut off Puppet, and have stopped all instances except
psi
(port 9600), which I'm using as our guinea pig.
So far, I've gotten the following errors:
Caused by: java.lang.IllegalStateException: index [.ltrstore/vCo9DZu5Qt-3QbtmBy1d7Q] version not supported: 6.5.4 minimum compatible index version is: 7.
This index is part
of OpenSearch's machine learning/Learning to Rank feature set
. It is not used in cloudelastic, but for production we may have to do something like*:
create a new named store
reload the data (maybe via reindex api, needs testing),
-repoint queries at the new feature store
get rid of the old one
The next error I've seen is very similar:
java.lang.IllegalStateException: index [mw_cirrus_metastore_1659365741/ugKwuXOpRjiti8dY67m9OA] version not supported: 6.8.23 minimum compatibl
Per
codesearch
, cirrussearch (the Mediawiki extension that provides OpenSearch support) uses the
mw_cirrus_metastore
index to store the state of administrative tasks. We're still working out a plan to upgrade this index gracefully as I write this.
*suggested by
@EBernhardson
in
Wikimedia-Search
IRC
bking
added a subscriber:
EBernhardson
Mon, Apr 20, 7:47 PM
2026-04-20 19:47:01 (UTC+0)
bking
added a comment.
Mon, Apr 20, 9:01 PM
2026-04-20 21:01:00 (UTC+0)
Comment Actions
We had a few more indices to delete before the existing OpenSearch 1.x clusters would allow an OpenSearch 2 node to join. We can find the problem indices with this one-liner:
curl -s localhost:${PORT}/_all/_settings | jq -r 'to_entries[] | "\(.key) \(.value.settings.index.version.created)"' | grep -v 135249827
135249827
means the index was created on OpenSearch 1, anything not matching that will be a problem).
The problem indices for Cloudelastic were:
.ltrstore
as described above
mw_cirrus_metastore
also described above
.tasks
- used internally by OpenSearch to keep track of running tasks. Safe enough to delete in most circumstances (if you just lost a bunch of data and were waiting for OpenSearch to recreate shards, probably not).
We will have to be a bit more cautious for the production clusters, but I think just need a few reimages to get Cloudelastic onto OpenSearch 2.x.
gerritbot
added a comment.
Mon, Apr 20, 9:31 PM
2026-04-20 21:31:00 (UTC+0)
Comment Actions
Change #1275535 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] prometheus: fix wmf-elasticsearch-exporter listen address on Trixie
gerritbot
added a comment.
Mon, Apr 20, 9:43 PM
2026-04-20 21:43:31 (UTC+0)
Comment Actions
Change #1275535
merged
by Ryan Kemper:
[operations/puppet@production] prometheus: fix wmf-elasticsearch-exporter listen address on Trixie
bking
added a subscriber:
RKemper
Mon, Apr 20, 10:01 PM
2026-04-20 22:01:28 (UTC+0)
Comment Actions
Note that we also ran into a problem with the prometheus exporter and Python 3.13, which comes with Trixie.
@RKemper
's patch above fixes that.
bking
mentioned this in
T421757: ☂️ Migrate production OpenSearch clusters from 1.x-2.x ☂️
Mon, Apr 20, 10:11 PM
2026-04-20 22:11:54 (UTC+0)
bking
closed subtask
T423327: Explore options for OpenSearch 2.x/3.x plugin packaging and distribution
as
Resolved
Tue, Apr 21, 10:35 PM
2026-04-21 22:35:04 (UTC+0)
gerritbot
added a comment.
Thu, Apr 23, 9:32 PM
2026-04-23 21:32:21 (UTC+0)
Comment Actions
Change #1276804 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2
gerritbot
added a comment.
Thu, Apr 23, 9:35 PM
2026-04-23 21:35:19 (UTC+0)
Comment Actions
Change #1276804
merged
by Bking:
[operations/puppet@production] cloudelastic: prepare cloudelastic1011 for Trixie/OpenSearch 2
ops-monitoring-bot
added a comment.
Thu, Apr 23, 9:36 PM
2026-04-23 21:36:46 (UTC+0)
Comment Actions
Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS trixie
gerritbot
added a comment.
Thu, Apr 23, 10:04 PM
2026-04-23 22:04:32 (UTC+0)
Comment Actions
Change #1276818 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cloudelastic: set role-level hiera for OpenSearch 2/Trixie
ops-monitoring-bot
added a comment.
Thu, Apr 23, 10:21 PM
2026-04-23 22:21:26 (UTC+0)
Comment Actions
Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1011.eqiad.wmnet with OS trixie completed:
cloudelastic1011 (
WARN
Downtimed on Icinga/Alertmanager
Disabled Puppet
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced UEFI HTTP Boot for next reboot
Host rebooted via Redfish
Host up (Debian installer)
Host up (new fresh trixie OS)
Generated Puppet certificate
Signed new Puppet certificate
Run Puppet in NOOP mode to populate exported resources in PuppetDB
Found Nagios_host resource for this host in PuppetDB
Downtimed the new host on Icinga/Alertmanager
Removed previous downtime on Alertmanager (old OS)
First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604232154_bking_3545146_cloudelastic1011.out
configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
Rebooted
Automatic Puppet run was successful
Forced a re-check of all Icinga services for the host
Icinga status is not optimal, downtime not removed
Updated Netbox data from PuppetDB
Gehel
edited projects, added
Data-Platform-SRE (2026-04-24 - 2026-05-15)
; removed
Data-Platform-SRE (2026-03-27 - 2026-04-17)
Fri, Apr 24, 9:16 AM
2026-04-24 09:16:13 (UTC+0)
Gehel
moved this task from
Backlog - project
to
In Progress
on the
Data-Platform-SRE (2026-04-24 - 2026-05-15)
board.
Log In to Comment
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct.
Wikimedia Foundation
Code of Conduct
Disclaimer
CC-BY-SA
GPL
Credits
US