Wikimedia Foundation Annual Plan/2024-20

Wiki Experiences (WE) Hypotheses

Hypothesis shortname Q3 text Details & Discussion WE1.1.3 If we consult 20 event organizers and 20 WikiProject organizers on the best use of topics available via LiftWing, then we can prioritize revisions to the topic model that will improve topical connections between events and WikiProjects. WE1.1.5 If we implement at least 2 methods to discover the Collaboration List, then we will increase pageviews of the Collaboration List, thereby allowing more people to discover events and WikiProjects that interest them WE1.1.6 If we identify and then contact 20 affiliates and/or groups connected to wikis that have high organizer activity in Q2, we can build advocacy networks that will set the stage for the extension being enabled on 3 more wikis by the end of Q3. WE1.1.7 If we add at least 2 improvements to the Collaboration List for events, then at least 50% of surveyed respondents will find the Collaboration List to be more useful in finding events than before the changes were made. WE1.2.5 If we conduct an A/B/C test with the alt-text suggested edits prototype in the production version of the iOS app we can learn if adding alt-text to images is a task newcomers are successful with and ultimately, decide if it's impactful enough to implement as a suggested edit on the Web and/or in the Apps. WE1.2.7 If we deploy the Multi-Check sidebar (desktop) at all wikis where the Reference Check is available, we will unlock our ability to present multiple Edit Checks within a new "mid-edit" moment without negatively impacting the quality of new content edits newcomers publish. WE1.2.9 If we surface the ‘Add a Link’ Structured Task to new account holders who are reading Wikipedia articles through an A/B test on pilot wikis, then we expect to increase the percentage of these people who constructively activate on mobile by 10% compared to the control group. WE1.2.10 If the Structured Content team improves the code health of the Article-level Image Suggestions data pipeline to meet 90% of code deduplication, article and section level image suggestion separation on the index level; and adapt the image suggestion evaluation tool to be able to get baselines for quality of suggestions for target wikis, then the “Add an Image” task can be released to newcomers on additional Wikipedias. This will enable the Growth team to pursue a follow-up hypothesis focused on increasing constructive activation across at least 10 additional Wikipedias. WE1.2.11 If we release the “Add a Link” Structured Task to at least 5% percent of newcomers on English Wikipedia, then newcomers with access to this structured task will demonstrate a constructive activation rate on mobile that is 10% percent higher than the baseline, as measured through an A/B test. WE1.3.3 If we improve the user experience and features of the Nuke extension during Q2, we will increase administrator satisfaction of the product by 5pp by the end of the quarter. WE1.3.4 If we improve the user experience and features of Recent Changes, we will increase administrator satisfaction of the product by 5pp. WE1.5.1 If we create a strategy brief by February 2025, including a prioritized strategy and trade-offs, we can use it as one of the main inputs for APP25/26. WE1.5.2 If we develop a unified measurement strategy, we will enable evaluation of the multi-year product strategy for contributors and set the landscape for prioritization of next steps in metric development and reporting WE2.1.5 If we expose topic-based translation suggestions more broadly and analyze its initial impact, we will learn which aspects of the translation funnel to act on in order to obtain more quality translations. WE2.1.6 If we offer list-making as a service, we’ll enable at least 5 communities to make more targeted contributions in their topic areas as measured by (1) change in standard quality coverage of relevant topics on the relevant wiki and (2) a brief survey of organizer satisfaction with topic area coverage on-wiki. WE2.1.7 "If we developed a proof of concept that adds translation tasks sourced from WikiProjects and other list-building initiatives, and present them as suggestions within the CX mobile workflow, then more editors would discover and translate articles focused on topical gaps.

By introducing an option that allows editors to select translation suggestions based on topical lists, we would test whether this approach increases the content coverage in our projects.

WE2.2.4 If we document the pre-incubator, incubator, and post-incubator journeys for the five pilot wikis with quantitative and qualitative data, we will be able to better support new languages in the future. WE2.4.4 If we develop a live proof-of-concept, using MediaWiki’s async content processing pipeline, for the first use case of Wikifunctions in Wikipedia, we will be ready to switch it on in the new year for the Dagbani community. WE2.6.1 If we propagate the integration of Wikifunctions from Test2Wiki to a small production Wikipedia with the MVP user experience, we will see the feature used organically without being reverted. WE2.6.2 If we make it possible to translate sentences in Wikifunctions from something “abstract” like a function, we will see an organic increase of at least 5 multilingual functions that generate natural language sentences. This is a milestone towards building an Abstract Wikipedia. WE3.1.6 If we introduce a personalized rabbit hole feature in the Android app and recommend condensed versions of articles based on the types of topics and sections a user is interested in, we will learn if the feature is sticky enough to result in multi-day usage by 10% of users exposed to the experiment over a 30-day period, and a higher pageview rate than users not exposed to the feature. WE3.1.8 (Q2-Q3, web) If we build one feature which provides additional article-level recommendations, we will see an increase in clickthrough rate of 10% over existing recommendation options and a significant increase in external referrals for users who actively interact with the new feature. WE3.1.9 If we create a daily-use Wikipedia-based trivia game in the Android app, logged-out readers who engage with this feature will open the app on multiple days within a 20-day period at a rate at least 5% higher than those who do not engage with the feature. WE3.1.10 If we develop and test design prototypes for tabbed browsing in the Wikipedia iOS app, we will gain and incorporate actionable insights on usability, while also enabling engineers to assess technical feasibility of different approaches, building a solid foundation for adding Tabs to the app in Q4. WE3.1.11 If we make the article search bar more prominent, we will increase the number of users who initiate searches by 8%, possibly leading to a 1% increase in search retention rate for logged out users. WE3.2.3 If we make the “Donate” button in the iOS App more prominent by making it one click or less away from the main navigation screen, we will learn if discoverability was a barrier to non banner donations. WE3.2.4 If we update the contributions page for logged-in users in the app to include an active badge for someone that is an app donor and display an inactive state with a prompt to donate for someone that decided not to donate in app, we will learn if this recognition is of value to current donors and encourages behavior of donating for prospective donors, informing if it is worth expanding on the concept of donor badges or abandoning it. WE3.2.7 Increasing the prominence of entry points to donations on the logged-out experiences of the Minerva web mobile and desktop experience will increase the clickthrough rate of the donate link by 30% YoY. WE3.2.8 If we make improvements to the personalised and collective content of the iOS apps’ Year in Review, and scale its availability, we will learn if this is an effective fundraising method. WE3.4.1 If we were to explore the feasibility by doing an experiment of setting up smaller PoPs in cloud providers like Amazon, we can expand our data center map and reach more users around the world, at reduced cost and increased turn-around time. WE3.5.1 If we make it possible for Commons Data namespace pages to be categorized and surface their usage across wikis, Commons admins will have the minimum tools they need to manage the increased usage of the Data namespace, ensuring we can sustainably scale up deployment to all wikis WE3.5.2 If we improve test coverage and documentation for Charts, we will be comfortable handing off maintenance and future feature development [to reading engineering, contractors, and volunteers], allowing us to wind down the project and task force. WE3.5.3 If we seed the Community Wishlist with Charts features we know volunteers have asked for that are out of scope for the MVP, there will be a central place for volunteers and staff to discuss future Charts-related work, allowing the future maintainers to manage expectations and source input for annual planning WE4.1.3 If we deploy the Incident Reporting System MVP to x more wikis (representative sample) we will be able to gather valuable data that will help us identify patterns of harmful conduct across wikis WE4.1.4 If we engage stakeholders across key departments in structured discussions, we can collaboratively define a shared vision and realistic scope for the Incident Reporting System, aligned with organizational priorities and compliance requirements, providing valuable insights to inform annual planning. WE4.2.11a If we define a terminology and thresholds for revert risk scores across wikis, we will make it possible to use revert risk scores in a wider range of user facing anti-abuse tools. This hypothesis impacts the WE4.2 KR by doing the background work necessary to build upon revert risk scores. WE4.2.20 Implement a trial enablement which will gather data on the efficacy of the new CAPTCHA on enabled wikis at preventing sockpupppet account creation and bot-based spam edits to measure the efficacy and value of a production rollout of the new technology WE4.2.15 If we analyze attributes of blocked user accounts on multiple wikis, we will identify patterns across these accounts and assign weights based on the relative importance of each attribute on block rates to use in calculating a user account reputation score. The success of this hypothesis would be measured by whether we are successful in defining a formula for multiplying attributes of an account to provide an account reputation score that maps to blocked users. WE4.2.10 If we add two more data points to the client hints collection pipeline, we will have more entropy to better identify sockpuppets and potential ban evasion. We will know we are successful when we are able to use the client hints data to identify X% of confirmed sock puppets on en:Wikipedia:Sockpuppet investigations. Or when we are able to use the collected data to identify Y% of suspected ban invasion pair. This hypothesis directly contributes to the KR by providing new signals (browser canvas fingerprint, list of fonts) that will allow CheckUsers to more precisely target sockpuppets and accounts attempting to evade bans. WE4.2.14b "If we introduce IP reputation data variables in AbuseFilter variables, we will enable mitigations that can reduce the amount of submissions of vandalism, spam and abuse.

Context:This directly contributes to the KR goal by introducing a new signal (IP reputation) to allow for more precision in mitigations (only actions matching the variable are impacted). We could measure the impact of this hypothesis by examining the volume of reverted edits on wikis before/after the variables are introduced. (Other ideas?) We would initially introduce variables like “is likely a VPN” or “is likely a proxy”. We could also consider exposing other variables, depending on discussions in T354599: Make IP reputation available as a variable in AbuseFilter."

WE4.2.14a If we analyze IP reputation data associated with problematic editing activity and user accounts, we will be able to prioritize a set of IP reputation facets that can be provided as variables in AbuseFilter. This analysis would then be used by WE4.2.14b later Q3 to build out the variables in AbuseFilter, along with specific guidance about what mitigations would be reasonable to use alongside a given set of IP reputation variables. For example, the recommended mitigation for one IP reputation variable might be to block edits outright, while the recommended mitigation for a different IP reputation variable might be to tag the edit for further review, or to show a CAPTCHA. WE4.2.18a If we design and build a clickable component to display public data related to user account reputation to functionaries, we will be able to learn if this is useful to them by observing the number of repeat usages of the tool WE4.3.3b If we deploy a proof of concept of the 'Liberica' load balancer, we will measure a 33% improvement in our capacity to handle TCP SYN floods WE 4.3.6b If we integrate the output of the models we built in WE 4.3.1 with the dynamic thresholds of per-ip concurrency limits we've built for our TLS terminators in WE 4.3.2, we should be able to increase our ability to neutralize automatically attacks with 20% more volume, as measured with the simulation framework we're building. WE 4.3.8 If we deploy the liberica load balancers to all datacenters, we will increase the capacity to handle TCP SYN floods by 33% everywhere WE 4.3.9 If we establish and follow a verified procedure for the regular testing of large-scale abuse scenarios, then we will consistently measure and improve our ability to respond effectively to such incidents. WE 4.3.10 If we define a policy for review and maintenance of requestctl rules, we will keep the system understandable and manageable over time WE 4.3.11 If we can identify patterns and separate web scraping from general traffic, we will be able to create reporting systematically to reduce the traffic and maintain sustainability of our serving infrastructure. WE 4.4.3 If we improve the interface of the iOS app, we will be able to clearly communicate how temporary accounts work to users as they edit without logging in, and the iOS app will be prepared for the imminent release of temporary accounts to all projects. WE 4.4.4 If we update the data models in the data lake, and the corresponding data pipelines and dashboards, to accurately represent the new user account types, we'll be able to provide accurate analytics reporting related to activities of corresponding user types. WE 4.4.5 If we resolve all remaining product, design and legal blockers for the engineering work that needs to be done before the major pilots deployment, we will be able to complete the engineering work on time for the next round of pilot deployment. WE5.1.9 If we enable Parsoid on Incubator and all newly created Wikis by Q2, we’ll further ensure sustainability by not allowing the number of wikis that run on the legacy parser to grow. We will measure the success of this rollout through detailed evaluations using the Confidence Framework reports, with a particular focus on Visual Diff reports and the metrics related to performance and usability. Additionally, we will assess the reduction in the list of potential blockers, ensuring that critical issues are addressed prior to wider deployment. WE5.1.11 The Observability team aims to sunset graphite by enabling read-only mode and disabling new metric ingest by the end of Q3 FY2024/2025. To achieve this goal, the team has set a 90% coverage target of converting the remaining dashboard and retiring legacy metrics and panels that point to graphite metrics. WE5.1.12 If we release an interactive documentation sandbox for MediaWiki REST APIs, it will introduce a repeatable pattern for low maintenance, high quality API documentation while making the APIs easier to adopt for developers around the world. This will ensure that our API documentation is fully up to date, testable, and localized for generations of developers, while reducing the maintenance cost and increasing sustainability for API publishers. WE5.1.13 If we roll out SUL3 for all existing accounts and new account creation across all wikis, we will ensure compatibility with browser anti-tracking measures and improve security, by moving authentication to a dedicated domain that requires user interaction and further prevents XSS vulnerabilities. WE5.2.5 If we model at least one more page state change (e.g. PageDelete) as a PHP event and drive further adoption of in-process domain events across MediaWiki components and extensions currently utilizing event-like hooks, then we will build confidence in events as a platform sustainability pattern by improving component boundaries, improving interface flexibility, and reducing high risk boilerplate code. WE5.2.6 If we explore designing an architecture for serializing and broadcasting events generated within MediaWiki core, we will create a foundation for offering first class event support that will enable us to consume events outside of the originating MediaWiki PHP process (e.g. JobQueue, EventBus). This will make MediaWiki data more reusable beyond the MediaWiki platform. WE5.2.7 If we identify and align on a set of domains that can be used for MediaWiki platform events by the end of Q3, we will have an initial map of core component boundaries and can improve consistency across MediaWiki interfaces by utilizing the same domains for the MediaWiki REST API modules. WE5.2.8 If we clearly define the concept of extension interfaces in the MediaWiki documentation, we can make it easier to develop new functionality on top of MediaWiki and provide a clearer path for defining new extension interfaces, such as Domain Events. We will measure this by identifying places in the documentation where extension interfaces are presented as “extension types” and replacing 100% of those instances. WE5.4.3 If we enable developers with PHP8.1 MediaWiki images and infrastructure for testing them on Kubernetes, they will be able to validate and certify them to be deployed to production. If we also develop infrastructure for progressive traffic migration and use it to safely migrate production to 8.1, this helps MediaWiki drop unsupported PHP versions in the upcoming May release. Success will be observed by the ability to ramp up production traffic to PHP 8.1 instances. WE5.4.4 If we decouple the legacy dumps processes from their current bare-metal hosts and instead run them as workloads on the DSE Kubernetes cluster, this will bring about demonstrable benefit to the maintainability of these data pipelines and facilitate the upgrade of PHP to version 8.1 by using shared mediawiki containers. WE5.4.6 If the beta cluster is configured to run MediaWiki with PHP 8.1 then the Data Platform Engineering group and their SRE team will be able to validate whether the existing dumps code functions correctly, or whether any significant functional changes would be required. WE5.5.1 If, by the end of January, we are able to measure and monitor Wikimedia hosted dumps traffic using log data, we will have clarity on how users are consuming the different dumps formatting options and access points. This will unblock additional metrics for overall consumption across streams, and improve our understanding of what users care about in terms of recency, data completion, and structure, so that we can tailor the overall API strategy accordingly. WE5.5.2 If, by the end of Q3, we create a consolidated view of developer personas and use cases collected through a listening and discovery tour, then we will uncover lesser understood gaps and opportunities in this space. This will leverage existing work completed by stakeholder teams in their respective areas (eg: Dumps, WME), in addition to creating new insights by conducting interviews with WMF staff, technical volunteers, and high impact content reuse partners (eg: WME customers and prospects). WE6.1.7 If we review the user feedback, decide on a code search and code browsing solution, deploy it to the production infrastructure as an officially supported service and enable indexing of both existing and new repositories from both code tracking systems, we will increase the scope of code that is indexed and searchable and simplify the process of locating code in day to day operations as well as during incident response. WE6.1.8 If we analyze the documentation metrics scores from our test dataset, we can evaluate the usefulness and effectiveness of the draft metrics, collect feedback, and provide actionable insights for implementing automated metrics computation WE6.1.9 If we transition 5 additional access groups to management within the Identity Management system, it will enhance access governance by improving efficiency, significantly reducing TOIL and improving the onboarding experience for incoming Wikimedia staff and new members of the technical communities. WE6.2.2 If we replace the backend infrastructure of our existing shared MediaWiki development and testing environments (from apache virtual servers to kubernetes), it will enable us to extend its uses by enabling MediaWiki services in addition to the existing ability to develop MediaWiki core, extensions, and skins in an isolated environment. We will develop one environment that includes MediaWiki, one or more Extensions, and one or more Services. WE6.2.3 If we create a new deployment UI that provides a web interface for deployments that is open to existing deployers it will allow backporters to have a shared view of deployments in progress and provide greater visibility for deployments in progress. WE6.2.5 If we publish a planning doc to move single-version routing out of MediaWiki and gather comments from stakeholders on the implementation, then we will reduce friction during implementation. WE6.2.6 If we gather feedback from QTE, SRE, and individuals with domain specific knowledge and use their feedback to write a design document for deploying and using the wmf/next OCI container, then we will reduce friction during when we start deploying that container. WE6.2.7 If we make a deployment web UI available behind our single sign-on system and open it to the Wikimedia development community it will increase the number of backport deployers. WE6.2.8 Continuing on the capabilities of Catalyst to deliver pre-merge test environments of MediaWiki and its extensions & skins on Kubernetes, if we facilitate deployments of pre-merge patches for MediaWiki services, by running pre-merge tests for Wikifunctions, then contributors will be able test more MediaWiki projects with stable, well-defined, isolated test environments. WE6.2.9 If we test the proposed MediaWiki routing implementation with a single wiki, we will have proven the plan works and can proceed with an accelerated rollout to other wikis and we will be able to route a single version container to Wikimedia’s wiki hosting infrastructure. WE6.3.7 By establishing detailed measurement criteria and evolution guidelines for our sustainability framework, we will create an actionable scoring system for platform improvements. WE6.3.8 Engaging with prospective users to explore Toolforge UI’s early design prototype will help us uncover improvement opportunities and risks to be addressed in a follow-up iteration.

Wikimedia Foundation Annual Plan/2024-2025/Product & Technology OKRs - Meta-Wiki