Add Link - Wikitech
Jump to content
From Wikitech
This page contains information about the infrastructure used for the link recommendation service and the data pipeline used to support the
"Add a Link" structured task project
High-level summary
The
Link Recommendation Service
recommends phrases of text in an article to link to other articles on a wiki. Users can then accept or reject these recommendations.
The service is an application hosted on kubernetes with an API accessible via HTTP (see
task T258978
). It responds to a POST request containing wikitext of an article and responds with a structured response of link recommendations for the article. It does not have caching or storage; the client (MediaWiki) is responsible for doing that (
task T261411
).
The search index stores metadata about which articles have link recommendations via a field we set per article (
task T261407
task T262226
A MySQL table per wiki is used for caching the actual link recommendations (
task T261411
); each row contains serialized link recommendations for a particular article.
A maintenance script (
task T261408
) runs hourly per enabled wiki to generate link recommendations by iterating over each
Search/articletopic
and calling the Link Recommendation Service to request recommendations
the maintenance script caches the results in the MySQL table, then sends an event to
Event_Platform/EventGate
, where the
pipeline ensures that the index is updated with the links/nolinks metadata for the article.
on page edit (when the edit is not done via the Add Link UX), link recommendations are regenerated via the job queue and the same code and APIs that are utilized in the maintenance script (n.b. we might do this differently; not yet implemented)
Diagram: Fetching and completing link recommendation tasks
Source:
Add_Link/Diagram:_Fetching_and_completing_link_recommendation_tasks
Link Recommendation Service
Repository
The repository for training the link recommendation model as well as for the query service is available:
Source code:
research/mwaddlink
Machine learning model
Some explanation of how the model works can be found on the
meta-research-page
Local development
Please see the
README
in the research/mwaddlink repository for options available, including docker-compose, Vagrant, and host system setups.
API
API documentation
Sandbox
Deployment
The service is deployed in production using the
Deployment pipeline
. The configuration specific to the service is in the deployment-charts repository:
Source code:
charts/linkrecommendation
helmfile.d/services/linkrecommendation
Dataset pipeline
The link recommendation model is trained on the
stat1008
server (due to its high CPU needs and access to production systems available via stat1008) with the
run-pipeline.sh
script. That script aggregates MediaWiki data from hive into several MySQL lookup tables per wiki. (For more details, see the
Training the model
section of the readme.) Those tables (stored in the
staging
database with an
lr_
prefix) are then exported and published via
datasets.wikimedia.org
with the
publish-datasets.sh
command. The production query service (that MediaWiki interacts with) will poll for changes and import those datasets into its own MySQL instance in Kubernetes (
task T266826
).
The canonical location for training new models and publishing datasets is at
/home/mgerlach/REPOS/mwaddlink-gerrit
Monitoring
Grafana dashboard
Logstash
Resolved questions / decisions
10 December How to get a MySQL database from stat* server to a production MySQL instance (SRE/Analytics) (
task T266826
23 October: Store the link recommendations in WANObjectCache or in a MySQL table?
task T261411
(needs SRE/DBA input)
15 October: use wikitext for training model, generating dictionary data, and as input to the mwaddlink query service. Will search for phrases in VE's editable content surface rather than attempt to apply offsets from wikitext / parsoid HTML.
Deployment
The canonical documentation is at
Deployments on kubernetes
If you change the default values.yaml, you need to release a new chart version by bumping the version of Chart.yaml.
Prepare the deployment patch
Make a patch in
operations/deployment-charts
that updates the value of the
main_app.version
field in
helmfile.d/services/linkrecommendation/values.yaml
, to the new image tag was mentioned in PipelineBot's comment on the last merged
research/mwaddlink
patch (
example
).
Example commit message
linkrecommendation: Bump version
* app/api: Use locale-specific lowercasing
T308244 / I962037e614fa5cdd1fce443caf94ce84b7c7b421
Bug: T308244
Commit message guidelines
Subject line can always be: "linkrecommendation: Bump version"
Add a bullet point for patch in
research/mwaddlink
that is part of this release. The first line should specify what relevant code is affected (api, app, etc) followed by the subject line of the commit. On the second line, include a reference to the task from the patch and a link to the Gerrit Change-Id.
Finally, the last line should include "Bug: " and reference the relevant phabricator task for this deployment.
All of the above guidelines in the commit message are helpful for paper trail and for documenting what was deployed, and when.
helmfile.d/services/linkrecommendation/values.yaml
diff --git a/helmfile.d/services/linkrecommendation/values.yaml b/helmfile.d/services/linkrecommendation/values.yaml
index b843d7f..025e203 100644
--- a/helmfile.d/services/linkrecommendation/values.yaml
+++ b/helmfile.d/services/linkrecommendation/values.yaml
@@ -18,7 +18,7 @@
requests:
cpu: 1750m
memory: 500Mi # Based on data from https://grafana.wikimedia.org/goto/JKjTBSQGz
- version: 2022-05-18-231105-production
+ version: 2022-06-22-142950-production
monitoring:
enabled: true
resources:
See also
See
Deployments on kubernetes
for tips, and note that 1) self merges are OK in this repository, and 2) a cron script on the deployment server will fetch the latest contents of the repository every minute.
Deploy the patch
Now, SSH to a
Deployment server
staging
Staging
$ cd /srv/deployment-charts/helmfile.d/services/linkrecommendation/
$ git log # Make sure your deployment patch is there
$ helmfile -e staging -i apply # scan output to see if the changes are expected, press "enter"
$ service-checker-swagger staging.svc.eqiad.wmnet
-t 2 -s /apispec_1.json
# Manually verifying requests
$ curl "
# Against production
$ diff <(curl -s "
" | jq .) <(curl -s "
" | jq .)
eqiad
eqiad
$ cd /srv/deployment-charts/helmfile.d/services/linkrecommendation/
$ git log # Make sure your deployment patch is there
$ helmfile -e eqiad -i apply # scan output to see if the changes are expected, press "enter"
# Internal traffic release
$ service-checker-swagger linkrecommendation.discovery.wmnet
-t 2 -s /apispec_1.json
# External traffic release
$ service-checker-swagger linkrecommendation.discovery.wmnet
-t 2 -s /apispec_1.json
# Manually verifying requests
$ curl "
$ curl "
codfw
codfw
$ cd /srv/deployment-charts/helmfile.d/services/linkrecommendation/
$ git log # Make sure your deployment patch is there
$ helmfile -e codfw -i apply # scan output to see if the changes are expected, press "enter"
# NB the following requests will go to the active datacenter, so if eqiad is active and you're deploying to codfw, these requests will go to eqiad.
# Internal traffic release
$ service-checker-swagger linkrecommendation.discovery.wmnet
-t 2 -s /apispec_1.json
# External traffic release
$ service-checker-swagger linkrecommendation.discovery.wmnet
-t 2 -s /apispec_1.json
# Manually verifying requests
$ curl "
$ curl "
Checking output from a container
Terminal
$ kube_env linkrecommendation staging
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
linkrecommendation-staging-7476db744d-w8bms 3/3 Running 0 7h47m
tiller-974b97fc7-rq4dn 1/1 Running 0 30h
$ kubectl logs -f linkrecommendation-staging-7476db744d-w8bms
Error from server (BadRequest): a container name must be specified for pod linkrecommendation-staging-7476db744d-w8bms, choose one of: [linkrecommendation-staging staging-metrics-exporter linkrecommendation-staging-tls-proxy]
$ kubectl logs -f linkrecommendation-staging-7476db744d-w8bms -c linkrecommendation-staging
Enabling on a new wiki
Enabling on a new wiki (once the models have been set up) is a multi-step process:
First, set
$wgGENewcomerTasksLinkRecommendationsEnabled
to true for the target wikis. This will allow the pool of link recommendations to start populating, but no recommendations would be surfaced to the end users at this point.
Wait a few days (to allow the
GrowthExperiments:refreshLinkRecommendations.php
mw-cron job
to start running for the wiki).
Once enough suggestions are generated, set
$wgGELinkRecommendationsFrontendEnabled
to true. This will start surfacing the link recommendations to the end users. If needed, the size of the task pool can be verified via the
Special:NewcomerTasksInfo
special page, the
GrowthExperiments:listTaskCounts.php
maintenance script or
in Grafana
Pre-populating excluded sections configuration
Optionally, it is possible to also pre-populate the excluded sections configuration for the wiki. Until
task T345562
is resolved, this is only possible for certain wikis (those created before a certain date). Pre-generating the configuration is based on
section alignment data
, which has been formatted into the
wiki_sections.jsonl
file.
To proceed, download the
wiki_sections.jsonl
file from
F35092312 on Phabricator
to the currently active
deployment host
and run:
export
PHAB
Txxxx
export
WIKI
testwiki
jq
"select(.wiki==\"
$WIKI
\" and .probability > 0.25) | .section"
wiki_sections.jsonl
jq
--slurp
--compact-output
"unique"
mwscript-k8s
--attach
--
CommunityConfiguration:ChangeWikiConfig
--wiki
$WIKI
--summary
"machine-generated configuration for excluding sections from link recommendations ([[phab:
$PHAB
]]), feel free to improve"
--file
php://stdin
GrowthSuggestedEdits
link_recommendation.excludedSections
Then, go to Special:CommunityConfiguration/SuggestedEdits on the wiki in question and verify the configuration was stored correctly. Community-appointed admins can use the same page to edit the list of excluded sections (or to create it from scratch, if it wasn't auto-populated at all).
Updates
December 2025
Growth: Released on Chinese, Japanese, & Urdu Wikipedias (
task T407818
Growth: Released on ~30 new Wikipedias (
task T410469
9 November - 10 December 2020
Growth / Research: Continued refactoring of research/mwaddlink for production ready status
Growth: Backend patches for GrowthExperiments for consuming research/mwaddlink data
Growth / SRE: Deployed linkrecommendation service to production (no datasets yet though)
DBA: Created database and read/write users for production kubernetes instance to access
Search: Working on consuming event(s) generated by service
2 - 6 November 2020
Growth / Analytics Engineering:
Discuss pipeline for MySQL on stats1008 -> production MySQL
26 - 30 October 2020
Growth / Research: Recap architecture and discuss milestones
Growth / SRE / DBA: Agreed to use MySQL for lookup tables for the link recommendation service
Growth: Continued prototyping of the VisualEditor integration; continued work on deployment pipeline; initial work on HTTP API via Flask; addition of MySQL cache table in GrowthExperiments along with general infrastructure for reading/writing to the cache
19 - 23 October 2020
Growth / Research: Working on deployment pipeline for mwaddlink
Growth: Prototyping VisualEditor integration
Growth: Beginning work on maintenance script and supporting classes
12 - 16 October 2020
Growth / Research: Parsoid HTML vs wikitext, repo structure, MySQL vs SQLite, misc other things
Growth: Engineers meet to discuss schedule, order of tasks, etc
5 - 9 October 2020
Growth / Editing: Exploring ways to bring link recommendation data into VisualEditor
Growth / Research: Discussing repository structures in preparation for deployment pipeline setup
Growth / SRE / Research: Discussing how to get mwaddlink-query / mwaddlink into production
Teams / Contact
Growth
(primary stakeholder, technical contact for project is
Sergio Gimeno
, product owner is
Kirsten Stoller
). Other teams:
Search Platform
SRE
, Release Engineering,
Research,
Editing
Parsing
Roles / responsibilities
Growth: User facing code, integration with our existing newcomer tasks framework, plus maintenance script to populate cache with recommendations
Research: Implementing code to train models and provide a query client (research/mwaddlink repo)
SRE: Working with Growth + Research to put the link recommendation service into production
Search Platform: Implementing the event pipeline to update the search index metadata for a document when new link recommendations are generated
Release Engineering: Consulting with Growth for deployment pipeline
Editing: Consulting with Growth for VE integration
Parsing: Consulting with Growth for VE integration
Background reading
Engineering notes (public for view only)
WMF link for edit access
Summary of project architecture
Link Recommendation Project Architecture
Technical Plan for Productization of "Add a Link"
See also
Add Image
, a similar structured task project (but with fairly different architecure)
Retrieved from "
Add Link
Add topic