Tone Check AB Test Report
Overview
The Wikimedia Foundation’s
Editing team
is working on a set of improvements for the visual editor to help new volunteers understand and follow some of the policies necessary to make constructive changes to Wikipedia projects.
In this AB test, we are evaluating the impact of Tone Check. Tone Check is an Edit Check that uses a language model to prompt people adding promotional, derogatory, or otherwise subjective language to consider “neutralizing” the tone of what they are writing. Tone Check is the first Edit Check that uses machine learning. In this case, a
BERT language model
initially selected and fine-tuned by the Research team to identify biased language within the new text people are attempting to publish to Wikipedia.
This A/B test will help us make the following decision:
What – if any – changes in the Tone Check UX, and/or the model that enables it, will we make before we can be confident in the following?
Newcomers and Junior Contributors that encounter Tone Check are more likely to publish new content edits in the main namespace that are devoid of biased language.
Newcomers and Junior Contributors will intuitively interact with the Tone Check experience in ways that are NOT disruptive to them or the wikis
This work is guided by the Wikimedia Foundation Annual Plan, specifically by the Wiki Experiences 1.1 objective key result: Increase the rate at which editors with ≤100 cumulative edits publish constructive edits on mobile web by 4%, as measured by controlled experiments (by the end of Q2).
You can find more details about this check on the
Project Page
The Tone Check A/B test was deployed on 3 September 2025 to French, Japanese, and Portuguese Wikipedias.
Methodology
AB Test Design
The team ran an AB test from 3 September 2025 through 28 January 2026 to determine the impact of presenting Tone Check to eligible editing sessions and evaluate the extent to which the feature, in its current form, warrants being deployed to all wikis.
Specifically, we want to test the following hypothesis:
If we prompt newcomers and Junior Contributors to reconsider the tone they are writing in when software detects them using – what experienced volunteers would agree is – then non-neutral/peacock language, then we will decrease the percentage of new content edits newcomers publish that are reverted on the grounds of WP:NPOV (and related policies).
During this experiment, 50% of users editing a desktop or mobile main namespace page using Visual Editor were randomly assigned to the test group and could be shown Tone Check if their edit met the specified requirements during their edit, and 50% were randomly assigned to the control group and could not be shown Tone Check.
The test included all mobile web and desktop contributors (both registered and unregistered) to the 3 participating wikis that started an edit with Visual Editor. Users remained in the same test group for the duration of the test. We limited the analysis to edits completed by unregistered users and users with 100 or fewer edits as those are the users that would be shown Tone Check under the default
config settings
Figure 1: Tone Check AB Test Bucketing Overview
As shown in Figure 1, not all edits bucketed in the AB test experiment met the requirements for being shown Tone Check. Tone Check was shown at about
11% of all published new content edits in the test group (989 edits)
. It was shown at similar rates on both desktop and mobile web.
In this analysis, we compared all new content edits that were shown Tone Check to edits that were eligible but not shown Tone Check in the control group (based on instrumentation added in this
task
. This comparison was done to ensure the analysis is focused on the actual effects of the feature.
Evaluation Plan
We used a set of primary and secondary metrics to evaluate the impact of this feature. We also reviewed a set of guardrails to ensure that Tone Check was not disruptive to the contributor or to the Wikipedias. These metrics are documented in the
task
For each metric, we reviewed the following dimensions: overall by experiment group (test and control), by platform (mobile web or desktop), by user experience and status, and by partner Wikipedia. We also reviewed some indicators such as edit completion rate by the number of checks shown within a single editing session to determine if there was a significant impact at a certain number of checks presented.
Note: For the user experience analysis, we split newer editors into three experience level groups: (1) unregistered, (2) newcomer (registered user making their first edit on Wikipedia), and (3) Junior Contributor (user that has made between 1 and 100 edits).
Please refer to the
data collection notebook
notebook for more details on the steps to collect the data reviewed in this report.
Summary of Results
New content edits published without biased language
Tone Check successfully decreases the frequency of non-neutral language in published content. Users with access to Tone Check were
-15.6% less likely to publish edits containing non-neutral language
(falling from 9.6% to 8.1%; a -1.5 pp decrease) compared to the control group. We have 99.8% confidence that this improvement is directly attributable to the tool.
However, Tone Check’s level of impact depends heavily on the platform. Results confirm a highly significant impact on Desktop, where we observed the highest reduction in revert rate. In contrast, there was no detectable effect yet on Mobile Web.
New content edits revert rate
Edits made by users shown Tone Check are also
15% less likely to be reverted
than eligible control edits (29.5% → 25.1%; a -4.4 pp decrease).
This reduction is primarily driven by Junior Contributors. While we observed a statistically significant
-33% relative [-10.2 pp] decrease in reverts for Junior Contributors
, we did not confirm any change for in the revert rate of newcomers or unregistered users. These trends indicate that Tone Check may be more effective for people who have already succeeded in completing at least one edit on a Wikipedia namespace. Since these users are more experienced, their edits are less likely to be reverted for other policy violations compared to registered users completing their first edit or unregistered users.
New Content edit revert rate: impact of removing non-neutral language
When a user removes non-neutral language in response to a Tone Check, the likelihood of that edit being reverted decreases significantly. Across both platforms, there was a
-44.1% decrease in the revert rate for edits where the prompt was addressed
. This confirms that Tone Check is highly effective at helping people identify and correct edits that would otherwise be reverted.
We observed decreases on both platforms, but there is a larger impact on desktop compared to mobile web. On desktop, we observed a significant
-47% decrease [-13.4 pp] in revert rate
for people who revised their text in response to Tone Check.
On mobile web, there was
-14.8% [-4.8pp] decrease in revert rate
for edits where non-neutral language was removed. Mobile web edits appear to be inherently trickier for newcomers and are still more likely to be reverted compared to desktop edits, even when non-neutral language is removed.
Edit Completion Rate
Tone Check does not appear to be causing any significant disruption to most people’s editing experience. Edit completion rates for people shown Tone Check
decreased only slightly by -3.2% (-1.6) percentage points
. This decrease was primarily concentrated on Desktop (-2.6%), with no significant change on Mobile Web.
The decrease in completion rate does not exceed over 10% until more than 10 tone checks are presented in a single editing session. For these edits, edit completion rate decreased to 44.3% (a -12% decrease from the control). These edits represent only 3% of edits and potentially low quality edits that we’d want to deter.
While completion rates slightly decreased for newcomers and unregistered users, they slightly increased for Junior Contributors, suggesting the check is encouraging and helps a portion of people complete their edit successfully.
Constructive Edit Rate
Tone Check
improved the rate of constructive edits by +6.2% [4.4] percentage points
. We observed improvements in overall edit quality at each of the three partner Wikipedias.
Aligned with the revert rate findings, the magnitude of impact varies by platform. On desktop,
constructive edit rate increased by +6.4%
while we observed no statistically significant change in mobile web constructive edits.
Tone Check appears especially effective at increasing the constructive edit rate of a registered Junior Contributors, where we observed a
+14.8% increase [10.2 pp]
in constructive edit rates. When limited to desktop edits, there was a
+19.7% increase
in constructive edits by Junior Contributors.
Retention Rate
We further found that people shown Tone Check were more likely to return, indicating that the feature results in a positive editing experience for most contributors.
People who encountered Tone Check are
24% more likely to return again to make a constructive edit in their second week
. Retention rates increased from 5.8% to 7.2% when Tone Check was shown (+1.4 percentage points).
We observed increases for both mobile web and desktop users and across all user types as well.
Guardrails
. Tone check is not causing significant disruption on either desktop or mobile web based on analysis of identified guardrails. The decline rate is lower than other existing Edit Checks, and there was no spike in user blocks or revert rates.
Code
# load packages
shhh
<-
function
(expr)
suppressPackageStartupMessages
suppressWarnings
suppressMessages
(expr)))
shhh
({
library
(lubridate)
library
(ggplot2)
library
(dplyr)
library
(gt)
library
(IRdisplay)
library
(tidyr)
# Modeling completed used relax package developed by Mikhail Popov (WMF)
library
(relax)
# https://gitlab.wikimedia.org/repos/product-analytics/experimentation-lab/relax)
set.seed
})
#set preferences
options
dplyr.summarise.inform =
FALSE
options
repr.plot.width =
15
repr.plot.height =
10
# colorblind color friendly pallette:
cbPalette
<-
"#999999"
"#E69F00"
"#56B4E9"
"#009E73"
"#F0E442"
"#0072B2"
"#D55E00"
"#CC79A7"
Data Cleaning
Code
# load tone check save data (initial dataset)
tone_check_publish_data_1
<-
read.csv
file =
'data/tone_check_save_data_AB.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
TRUE
# load tone check save data (second dataset)
# Second dataset was created to obtain updated event data while preserving initial aggregated dataset that could no loner
# be queried in Data Lake due to data retention policies.
tone_check_publish_data_2
<-
read.csv
file =
'data/tone_check_save_data_AB_pt2.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
TRUE
# Combine the two datasets
tone_check_publish_data
<-
rbind
(tone_check_publish_data_1, tone_check_publish_data_2)
Code
# Cleaning up dataset and renaming fields to clarify meanings
# Set experience level group and factor levels
tone_check_publish_data
<-
tone_check_publish_data
|>
mutate
experience_level_group =
case_when
user_edit_count
==
user_status
==
'registered'
'Newcomer'
user_edit_count
==
user_status
==
'unregistered'
'Unregistered'
user_edit_count
user_edit_count
<=
100
"Junior Contributor"
user_edit_count
100
"Non-Junior Contributor"
#these users should already be filterd out of dataset but adding to confirm
),
experience_level_group =
factor
(experience_level_group,
levels =
"Unregistered"
"Newcomer"
"Non-Junior Contributor"
"Junior Contributor"
))
#rename test group field to clarify groups
tone_check_publish_data
<-
tone_check_publish_data
|>
mutate
test_group =
factor
(test_group,
levels =
'2025-09-editcheck-tone-control'
'2025-09-editcheck-tone-test'
),
labels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
)))
#rename platform from phone to mobile web to clarify meaning
tone_check_publish_data
<-
tone_check_publish_data
|>
mutate
platform =
factor
(platform,
levels =
'phone'
'desktop'
),
labels =
"mobile web"
"desktop"
)))
# rename Wiki values to human readable form
wiki_name_lookup
<-
"jawiki"
"Japanese Wikipedia"
"ptwiki"
"Portuguese Wikipedia"
"frwiki"
"French Wikipedia"
tone_check_publish_data
<-
tone_check_publish_data
%>%
mutate
wiki =
recode
(wiki,
!!!
wiki_name_lookup)
Code
#Set fields and factor levels to assess number of checks shown
tone_check_publish_data
<-
tone_check_publish_data
|>
mutate
multiple_checks_shown =
case_when
test_group
==
"test (tone check shown)"
n_checks_shown
==
"one tone check"
test_group
==
"test (tone check shown)"
n_checks_shown
"multiple tone checks"
TRUE
"no tone checks"
#default if no conditions met
) ,
multiple_checks_shown =
factor
(multiple_checks_shown ,
levels =
'no tone checks'
'one tone check'
'multiple tone checks'
))
# note these buckets can be adjusted as needed based on distribution of data
tone_check_publish_data
<-
tone_check_publish_data
|>
mutate
checks_shown_bucket =
case_when
test_group
==
"test (tone check shown)"
is.na
(n_checks_shown)
'0'
test_group
==
"test (tone check shown)"
n_checks_shown
==
'1'
test_group
==
"test (tone check shown)"
n_checks_shown
==
'2'
test_group
==
"test (tone check shown)"
n_checks_shown
n_checks_shown
<=
"3-5"
test_group
==
"test (tone check shown)"
n_checks_shown
n_checks_shown
<=
10
"6-10"
test_group
==
"test (tone check shown)"
n_checks_shown
10
"over 10"
),
checks_shown_bucket =
factor
(checks_shown_bucket ,
levels =
"0"
"1"
"2"
"3-5"
"6-10"
"over 10"
))
# define set of all eligible edits to review (eligible in control and shown tone check in test)
# Note there's 5 edits in the control group that were identiifed as eligible in VEFU instrumentation
# but did not have eligible tag applied
tone_check_publish_data
<-
tone_check_publish_data
|>
mutate
is_test_eligible =
ifelse
(test_group
==
'test (tone check shown)'
was_tone_check_shown_tag
==
(test_group
==
'control (eligible but not shown tone check)'
is_tone_check_eligible
==
) ,
'eligible'
'not eligible'
),
is_test_eligible =
factor
is_test_eligible,
levels =
"eligible"
"not eligible"
))
# use tone check eligible tag to define test edits where tone check was addressed (tone_check_eligible == 0)
tone_check_publish_data
<-
tone_check_publish_data
|>
mutate
is_tone_check_addressed =
case_when
test_group
==
'control (eligible but not shown tone check)'
is_tone_check_eligible
==
'Eligible control edits'
test_group
==
'test (tone check shown)'
was_tone_check_shown_tag
==
is_tone_check_eligible
==
'Tone check shown and addressed'
TRUE
"Tone check shown but not addressed"
),
is_tone_check_addressed =
factor
is_tone_check_addressed,
levels =
'Eligible control edits'
'Tone check shown but not addressed'
'Tone check shown and addressed'
))
#We also removed all edits that were published before the model returned in an evaluation.
# These events would not have the `editcheck-tone` tag applied to indicate if the published edit
# includes promotional language.
# This was done using events added in [T388716](https://phabricator.wikimedia.org/T388716#10872915).
tone_check_publish_data
<-
tone_check_publish_data
|>
filter
(was_saved_before_check
==
New content edits published without biased language (Primary Metric)
Hypothesis:
The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.
Methodology
: As part of this hypothesis, we first evaluated if Tone Check reduces the frequency of non-neutral language in published edits.
We reviewed the proportion of all new content edits published without biased language (identified by the
editcheck-tone
tag, created in
T388716
to identify when the model detected non-neutral language at the time of publishing).
Overall
Code
tone_issue_edits_overall
<-
tone_check_publish_data
|>
filter
(is_new_content
==
|>
#limit to new content edits
group_by
(test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_tone_issues =
n_distinct
(editing_session[is_tone_check_eligible
==
]))
|>
# tone issues detected
mutate
non_neutral_language_rate =
paste0
round
(n_tone_issues
n_edits
100
),
"%"
))
Code
# plot visualization of non-neutral edits
dodge
<-
position_dodge
width=
0.9
<-
tone_issue_edits_overall
|>
ggplot
aes
x=
test_group,
y =
n_tone_issues
n_edits,
fill =
test_group))
geom_col
position =
'dodge'
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste
(non_neutral_language_rate,
\n
, n_tone_issues,
"edits
\n
with non-netural language"
),
fontface=
),
vjust=
1.2
size =
color =
"white"
scale_fill_manual
values=
"#999999"
"dodgerblue4"
),
name =
"Experiment Group"
scale_x_discrete
breaks =
"control (eligible but not shown tone check)"
"test (tone check shown)"
),
labels =
"Control (no tone check)"
"Test (tone check available)"
))
#renaming as this metric is not limited to shown tone checks
labs
y =
"Percent of new content edits "
x =
"Experiment Group"
title =
"New content edits with non-neutral language"
caption =
"Limited to published new content edits"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
24
),
axis.text.x =
element_text
size =
24
),
axis.title.x =
element_text
margin =
margin
t =
20
unit =
"pt"
)),
legend.position=
"none"
axis.line =
element_line
colour =
"black"
))
Tone Check successfully decreases the prevalence of non-neutral language in published content. Across both platforms, there was a
-15.6% decrease [-1.5 percentage points]
in the proportion of new content edits published with non-neutral language for the test group where Tone Check was available.
Note: The rate observed for the control (9.6%) is similar to the rates we observed in an initial
baseline analysis
estimating the frequency of these types of edits and rates identified in the
leading indicator analysis
By Platform
Code
tone_issue_edits_byplatform
<-
tone_check_publish_data
|>
filter
(is_new_content
==
|>
group_by
(platform, test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_tone_issues =
n_distinct
(editing_session[is_tone_check_eligible
==
]))
|>
# reverted within 48 hours
mutate
non_neutral_language =
paste0
round
(n_tone_issues
n_edits
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns
gt
()
|>
tab_header
title =
md
"New Content edits with non-neutral language by
\n
platform"
|>
opt_stylize
|>
cols_label
platform =
"Platform"
test_group =
"Experiment Group"
#n_edits = "Number of published edits",
#n_tone_issues = "Number of edits with non-neutral language",
non_neutral_language =
"Proportion of edits with non-neutral language"
|>
tab_source_note
gt
::
md
'Limited to published new content edits'
display_html
as_raw_html
(tone_issue_edits_byplatform))
New Content edits with non-neutral language by platform
Experiment Group
Proportion of edits with non-neutral language
mobile web
control (eligible but not shown tone check)
9.2%
test (tone check shown)
9.4%
desktop
control (eligible but not shown tone check)
9.7%
test (tone check shown)
7.6%
Limited to published new content edits
Trends vary by platform. On desktop, we observed a
-21% decrease
[-2 pp] in the proportion of edits with non-neutral language. There was no statistically significant change on mobile web.
By User Experience
Code
tone_issue_edits_byuserexp
<-
tone_check_publish_data
|>
filter
(is_new_content
==
|>
group_by
(experience_level_group, test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_tone_issues =
n_distinct
(editing_session[is_tone_check_eligible
==
]))
|>
# tone check issues detected
mutate
non_neutral_language =
paste0
round
(n_tone_issues
n_edits
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns
gt
()
|>
tab_header
title =
"New content edits with non-neutral language by user experience"
|>
opt_stylize
|>
cols_label
experience_level_group =
"User Experience"
test_group =
"Experiment Group"
#n_edits = "Number of published edits",
#n_tone_issues = "Number of edits with non-neutral language",
non_neutral_language =
"Proportion of edits with non-neutral language"
|>
tab_source_note
gt
::
md
'Limited to published new content edits'
display_html
as_raw_html
(tone_issue_edits_byuserexp ))
New content edits with non-neutral language by user experience
Experiment Group
Proportion of edits with non-neutral language
Unregistered
control (eligible but not shown tone check)
13.2%
test (tone check shown)
12.5%
Newcomer
control (eligible but not shown tone check)
13.1%
test (tone check shown)
10.6%
Junior Contributor
control (eligible but not shown tone check)
8%
test (tone check shown)
6.6%
Limited to published new content edits
Tone check decreases the frequency of non-neutral language for all reviewed user types.
We saw the highest absolute decrease in proportion of non-neutral edits published by newcomers (19% decrease [2.5pp]). Unregistered users saw the smallest change (-5.3% [0.7pp]).
By Wikipedia
Code
tone_issue_edits_bywiki
<-
tone_check_publish_data
|>
filter
(is_new_content
==
|>
group_by
(wiki, test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_tone_issues =
n_distinct
(editing_session[is_tone_check_eligible
==
]))
|>
# reverted within 48 hours
mutate
non_neutral_language =
paste0
round
(n_tone_issues
n_edits
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns
gt
()
|>
tab_header
title =
"New content edits with non-neutral language by Wikipedia"
|>
opt_stylize
|>
cols_label
wiki =
"Wikipedia"
test_group =
"Experiment Group"
#n_edits = "Number of published edits",
#n_tone_issues = "Number of edits with non-neutral language",
non_neutral_language =
"Proportion of edits with non-neutral language"
|>
tab_source_note
gt
::
md
'Limited to new content edits'
display_html
as_raw_html
(tone_issue_edits_bywiki ))
New content edits with non-neutral language by Wikipedia
Experiment Group
Proportion of edits with non-neutral language
French Wikipedia
control (eligible but not shown tone check)
11.5%
test (tone check shown)
10.5%
Japanese Wikipedia
control (eligible but not shown tone check)
6.7%
test (tone check shown)
3.2%
Portuguese Wikipedia
control (eligible but not shown tone check)
6.9%
test (tone check shown)
6.1%
Limited to new content edits
We also observed decreases in the proportion of edits with non-neutral language at each partner Wikipedia. At Japanese Wikipedia, there was a significant
-52.2% decrease [-3.5 pp]
in the proportion of edits with non-neutral language when tone check was shown to eligible edits.
Confirming the impact of Tone Check on edits published without biased language
We analyzed the above results using two complementary statistical frameworks (Bayesian and Frequentist) to correctly infer the impact of offering Tone Check on decreasing the likelihood a new content edit includes biased language when published. This allows us to confirm if the observed changes detailed above are statistically significant (did not occur due to random chance).
Since multiple edits can be made by the same user, we first calculated the rates for each user (proportion of all edits saved by a user that include non-neutral language).
Note: This is an implementation of Bayesian and Frequentist engines also used in Test Kitchen’s
automated analytics
Code
# calculate the proportion for each user
tone_issue_edits_overall_byuser
<-
tone_check_publish_data
|>
filter
(is_new_content
==
|>
group_by
(test_group, platform, user_id)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_tone_issues =
n_distinct
(editing_session[is_tone_check_eligible
==
]))
|>
# reverted within 48 hours
mutate
non_neutral_language_rate =
n_tone_issues
n_edits)
Code
# rename field names to align with relax package naming convention
tone_issue_edits_overall_byuser
<-
tone_issue_edits_overall_byuser
|>
mutate
variation =
factor
(test_group,
levels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
),
labels =
"control"
"treatment"
)))
tone_issue_edits_overall_byuser
outcome
tone_issue_edits_overall_byuser
non_neutral_language_rate
Code
overall_impact_toneissues
<-
tone_issue_edits_overall_byuser
|>
analyze_relative_lift
metric_type =
"proportion"
|>
gt
()
|>
tab_header
title =
md
"**Evaluating Tone Check impact on edits published with non-neutral language**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"95% CI Lower"
),
cred_upper =
md
"95% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"95% CI Lower"
),
conf_upper =
md
"95% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Highlight key finding (Inconclusive) ---
tab_footnote
footnote =
md
"The 95% intervals do not cross zero indicating the results are statistically significant."
),
locations =
cells_column_labels
columns =
(cred_lower, conf_lower))
%>%
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(overall_impact_toneissues))
Evaluating Tone Check impact on edits published with non-neutral language
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower
95% CI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
−0.136
0.002
−0.228
−0.043
−0.139
0.004
−0.232
−0.046
The 95% intervals do not cross zero indicating the results are statistically significant.
Analysis of the A/B test data confirms a statistically significant reduction in non-neutral language across all platforms. We have high confidence (>99.8%) that this effect is driven by Tone Check.
Code
# check by platform numbers
platform_impact_toneissues
<-
tone_issue_edits_overall_byuser
|>
group_by
(platform)
|>
group_modify
analyze_relative_lift
(.x,
metric_type =
"proportion"
))
|>
gt
()
|>
tab_header
title =
md
"**Evaluating Tone Check impact on edits published in non-neutral language**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
platform =
md
"Platform"
),
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"95% CI Lower"
),
cred_upper =
md
"95% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"95% CI Lower"
),
conf_upper =
md
"95% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Highlight key finding (Inconclusive) ---
tab_footnote
footnote =
md
"The 95% intervals do not cross zero indicating results are statistically signficant."
),
locations =
cells_column_labels
columns =
(cred_lower, conf_lower))
%>%
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(platform_impact_toneissues))
Evaluating Tone Check impact on edits published in non-neutral language
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower
95% CI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
mobile web
−0.013
0.444
−0.192
0.167
−0.014
0.883
−0.203
0.174
desktop
−0.186
0.000
−0.291
−0.081
−0.192
0.000
−0.299
−0.086
The 95% intervals do not cross zero indicating results are statistically signficant.
However, Tone Check’s effectiveness depends heavily on the platform. Results confirm a highly significant impact on Desktop (p < 0.001), where the reduction was most pronounced. In contrast, there was no detectable effect on Mobile Web (p = 0.883), suggesting that people respond to Tone Check differently when making mobile edits.
New Content Edit Revert Rate (Primary Metric]
Hypothesis:
The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.
Methodology
In addition to evaluating if Tone Check reduces the frequency of non-neutral language, we also wanted to assess the impact of Tone Check on edit revert rate.
To do this, we reviewed the proportion of all published new content edits where tone check was shown at least once in an editing session (identified by
editCheck-tone-shown
tag) and were reverted within 48 hours. This was compared to the revert rate of edits in the control group identified as eligible for Tone Check (identified by
editcheck-tone
tag).
Note:
This metric does not consider the final text of the published edit. It’s possible edits shown Tone Check still included non-neutral language at the time of publishing if the Tone Check was not addressed. It’s also possible that non-neutral language was removed but the edit was still reverted for other reasons. This purpose of this metric is to evaluate if presenting a tone check to a user while editing will increase the overall quality of new content edits.
## Overall
Code
tone_check_reverts_overall
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
%>%
#limit to edit shown or eligible to be shown tone check
group_by
(test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
%>%
# reverted within 48 hours
mutate
revert_rate =
paste0
round
(n_reverts
n_edits
100
),
"%"
))
Code
# plot visualization of overall edit revert rates
dodge
<-
position_dodge
width=
0.9
<-
tone_check_reverts_overall
|>
ggplot
aes
x=
test_group,
y =
n_reverts
n_edits,
fill =
test_group))
geom_col
position =
'dodge'
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste
(revert_rate,
\n
, n_reverts,
"reverted edits"
),
fontface=
),
vjust=
1.2
size =
10
color =
"white"
scale_fill_manual
values=
"#999999"
"dodgerblue4"
),
name =
"Experiment Group"
labs
y =
"Percent of edits reverted "
x =
"Experiment Group"
title =
"New content edit revert rate"
caption =
"Limited to published new content edits shown or eligible to be shown Tone Check"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
24
),
axis.text.x =
element_text
size =
24
),
axis.title.x =
element_text
margin =
margin
t =
20
unit =
"pt"
)),
legend.position=
"none"
axis.line =
element_line
colour =
"black"
))
People show Tone Check are less likely to be reverted. Across both platform, there was a
-15% decrease [-4.4 ppp]
in the revert rate of edits shown Tone Check in the test group compared to edits eligible but not shown Tone Check in the control group.
By if multiple checks were shown
Code
tone_check_reverts_bymultiple
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
multiple_checks_shown
!=
"no tone checks"
# Removing 3 events where eligible edits in control were incorrectly tagged as being shown checks
test_group
==
'test (tone check shown)'
|>
group_by
( multiple_checks_shown)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
mutate
revert_rate =
paste0
round
(n_reverts
n_edits
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"New content edit revert rate by if multiple checks were shown"
|>
opt_stylize
|>
cols_label
multiple_checks_shown =
"Multiple Check"
#n_edits = "Number of published new content edits",
#n_reverts = "Number of edits reverted ",
revert_rate =
"Proportion of new content edits that were reverted"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to shown tone check'
display_html
as_raw_html
(tone_check_reverts_bymultiple))
New content edit revert rate by if multiple checks were shown
Multiple Check
Proportion of new content edits that were reverted
one tone check
25.7%
multiple tone checks
25.2%
Limited to published new content edits shown or eligible to shown tone check
The numbers of Tone Checks shown within a single editing session does not impact the likelihood an edit is reverted. The revert rate for edits shown one or multiple tone checks is about the same (~25%).
While we initially observed a lower revert rate for edits shown a single tone check in the
leading indicator analysis
, additional test data indicates that the revert rate of these edits is similar to edits shown multiple tone checks.
By Platform
Code
tone_check_publish_byplatform
<-
tone_check_publish_data
|>
filter
(is_test_eligible
==
'eligible'
|>
group_by
(platform, test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
#reverted within 48 hours
mutate
revert_rate =
paste0
round
(n_reverts
n_edits
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"New content edit revert rate by platform"
|>
opt_stylize
|>
cols_label
test_group =
"Experiment Group"
platform =
"Platform"
#n_edits = "Number of published new content edits",
#n_reverts = "Number of edits reverted",
revert_rate =
"Proportion of new content edits that were reverted"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to shown Tone Check'
display_html
as_raw_html
(tone_check_publish_byplatform))
New content edit revert rate by platform
Experiment Group
Proportion of new content edits that were reverted
mobile web
control (eligible but not shown tone check)
34.5%
test (tone check shown)
34.6%
desktop
control (eligible but not shown tone check)
25%
test (tone check shown)
20.2%
Limited to published new content edits shown or eligible to shown Tone Check
The decrease in new content edit revert rate is primarily driven by a decrease in the revert rate of desktop edits.
We observed
-19% [-4.8pp]
statistically significant decrease in the revert rate of desktop edits shown Tone Check. On mobile web, there was no statistically significant change.
By User Experience
Code
tone_check_revert_byuserexp
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
group_by
(experience_level_group,test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
#reverted within 48 hours
mutate
revert_rate =
paste0
round
(n_reverts
n_edits
100
),
"%"
))
|>
select
))
|>
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"New content edit revert rate by user experience"
|>
opt_stylize
|>
cols_label
test_group =
"Experiment Group"
experience_level_group =
"User Status"
#n_edits = "Number of published new content edits",
#n_reverts = "Number of edits reverted",
revert_rate =
"Proportion of new content edits that were reverted"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to shown Tone Check'
display_html
as_raw_html
(tone_check_revert_byuserexp))
New content edit revert rate by user experience
Experiment Group
Proportion of new content edits that were reverted
Unregistered
control (eligible but not shown tone check)
34.7%
test (tone check shown)
37%
Newcomer
control (eligible but not shown tone check)
21%
test (tone check shown)
26%
Junior Contributor
control (eligible but not shown tone check)
31%
test (tone check shown)
20.8%
Limited to published new content edits shown or eligible to shown Tone Check
Results vary based on user experience. While we observed a statistically significant
-33% relative [-10.2 pp] decrease in reverts for Junior Contributors
, we did not confirm any change for in the revert rate of newcomers or unregistered users.
These trends indicate that Tone Check may be more effective for people who have already succeeded in completing at least one edit on a Wikipedia namespace. Since these users are more experienced, their edits are less likely to be reverted for other policy violations compared to users completing their first edit.
By partner Wikipedia
Code
tone_check_revert_bywiki
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
group_by
(wiki, test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
mutate
revert_rate =
paste0
round
(n_reverts
n_edits
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"New content edit revert rate by partner Wikipedia"
|>
opt_stylize
|>
cols_label
test_group =
"Experiment Group"
wiki =
"Wikipedia"
#n_edits = "Number of published new content edits",
#n_reverts = "Number of edits reverted",
revert_rate =
"Proportion of new content edits that were reverted"
|>
tab_source_note
gt
::
md
'Limited to wikis with > 100 published new content edits'
display_html
as_raw_html
(tone_check_revert_bywiki))
New content edit revert rate by partner Wikipedia
Experiment Group
Proportion of new content edits that were reverted
French Wikipedia
control (eligible but not shown tone check)
30.8%
test (tone check shown)
29%
Japanese Wikipedia
control (eligible but not shown tone check)
31.2%
test (tone check shown)
10.9%
Portuguese Wikipedia
control (eligible but not shown tone check)
21.4%
test (tone check shown)
16.9%
Limited to wikis with > 100 published new content edits
New content edit revert rate decreased for users shown Tone Check at all three partner Wikipedias by at least -5%.
We again see an especially high impact on edit quality at Japanese Wikipedia, where there was a
-65% decrease in revert rate of edits shown tone check compared to eligible edits in the control group
Due to the small sample size of per Wikipedia edits, we are currently not able confirm statistical significance of the decreases at any of these Wikipedias but the direction and magnitude of change indicate that Tone Check is having a positive effect on edit quality at each partner Wikipedia.
Confirming the impact of Tone Check on revert rate
Code
# calculate the proportion for each user
tone_check_reverts_overall_byuser
<-
tone_check_publish_data
|>
filter
(is_test_eligible
==
'eligible'
|>
#limit to edit shown or eligible to be shown tone check
group_by
(test_group, platform, user_id)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
# reverted within 48 hours
mutate
revert_rate =
n_reverts
n_edits)
# proportion for each user
Code
# rename field names to align with relax package naming convention
tone_check_reverts_overall_byuser
<-
tone_check_reverts_overall_byuser
|>
mutate
variation =
factor
(test_group,
levels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
),
labels =
"control"
"treatment"
)))
tone_check_reverts_overall_byuser
outcome
tone_check_reverts_overall_byuser
revert_rate
Code
overall_impact_reverts
<-
tone_check_reverts_overall_byuser
|>
analyze_relative_lift
metric_type =
"proportion"
ci_level =
0.90
|>
gt
()
|>
tab_header
title =
md
"**Evaluating Tone Check impact on new content revert rate**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"90% CI Lower"
),
cred_upper =
md
"90% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"90% CI Lower"
),
conf_upper =
md
"90% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Highlight key finding (Inconclusive) ---
# tab_footnote(
# footnote = md("The 95% intervals cross zero, indicating no statistically conclusive difference."),
# locations = cells_column_labels(columns = c(cred_lower, conf_lower))
# ) %>%
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(overall_impact_reverts))
Evaluating Tone Check impact on new content revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
90% CI Lower
90% CI Upper
Point Estimate
-value
90% CI Lower
90% CI Upper
−0.080
0.043
−0.158
−0.003
−0.082
0.083
−0.161
−0.004
Results confirm a slight but statistically significant decrease in the revert rate of edits shown Tone Check across all edits. Analysis indicates a 95.7% chance that Tone Check successfully reduces the likelihood of an edit being reverted. At a 90% confidence level, the results are statistically significant (p = 0.083).
Code
# check by platform numbers
platform_impact_reverts
<-
tone_check_reverts_overall_byuser
|>
group_by
(platform)
|>
group_modify
analyze_relative_lift
(.x,
metric_type =
"proportion"
))
|>
gt
()
|>
tab_header
title =
md
"**Evaluating Tone Check impact on revert rate by platform**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
platform =
md
"Platform"
),
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"95% CI Lower"
),
cred_upper =
md
"95% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"95% CI Lower"
),
conf_upper =
md
"95% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Highlight key finding (Inconclusive) ---
tab_footnote
footnote =
md
"The 95% intervals cross zero, indicating no statistically conclusive difference."
),
locations =
cells_column_labels
columns =
(cred_lower, conf_lower))
%>%
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(platform_impact_reverts))
Evaluating Tone Check impact on revert rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower
95% CI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
mobile web
−0.024
0.370
−0.166
0.118
−0.026
0.732
−0.172
0.121
desktop
−0.092
0.063
−0.210
0.026
−0.096
0.119
−0.217
0.025
The 95% intervals cross zero, indicating no statistically conclusive difference.
While we do not have sufficient data to confirm statistical significance at the strict 95% level on a per-platform basis, the results strongly indicate that Tone Check is decreasing the revert rate on Desktop.
The Bayesian analysis shows a 93.7% probability that the tool reduces Desktop reverts, with a projected impact of -9.6%. On mobile web, there was almost no change on the overall new content revert rate.
New Content edit revert rate: Impact of removing non-neutral language (Primary Metric)
Hypothesis:
The quality of new content edits newcomers and Junior Contributors make in the main namespace will increase because a greater percentage of these edits will not contain non-neutral language.
Methodology:
As the final piece to evaluate this hypothesis, we reviewed the revert rate of new content edits in the test group for people that removed non-neutral language in response to Tone Check. Here were are measuring the impact of a person making the change Tone Check is prompting. Does removing non-neutral language decrease the likelihood that an edit is reverted?
In this section, we isolated the direct impact of Tone Check by comparing a specific subset: Control edits that contained non-neutral language versus Test edits where the user actively removed that language in response to a Tone Check. To do this, we used the revision tag created in
T388716
to identify when the model detects non-neutral language within new content edit at the time of publishing.
Overall
Code
tone_check_eligible_revert_overall
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
#limit to edit shown or eligible to be shown tone check
group_by
(is_tone_check_addressed )
|>
#group by prescence of non-neutral language
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
# reverted within 48 hours and tone check issues addressed
mutate
revert_rate =
round
(n_reverts
n_edits,
))
|>
ungroup
()
|>
mutate
n_edits =
ifelse
(n_edits
50
"<50"
, n_edits),
n_reverts =
ifelse
(n_reverts
50
"<50"
, n_reverts))
#sanitizing per data publication guidelines
Code
# plot visualization of overall edit revert rates
dodge
<-
position_dodge
width=
0.9
<-
tone_check_eligible_revert_overall
|>
filter
(is_tone_check_addressed
!=
'Tone check shown but not addressed'
|>
#removing edits in test group where tone issues were not addressed for this analysis
ggplot
aes
x=
is_tone_check_addressed ,
y =
revert_rate,
fill =
is_tone_check_addressed ))
geom_col
position =
'dodge'
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste
(revert_rate
100
"%"
\n
, n_reverts,
"reverted edits"
),
fontface=
),
vjust=
1.2
size =
color =
"white"
scale_fill_manual
values=
"#999999"
"dodgerblue4"
),
name =
"Experiment Group"
labs
y =
"Percent of edits reverted "
x =
"Experiment Group"
title =
"New Content revert rate: Impact of removing non-neutral language"
caption =
"Limited to published new content edits by unregistered users or users with 100 or fewer edits"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
20
),
axis.text.x =
element_text
size =
20
),
axis.title.x =
element_text
margin =
margin
t =
20
unit =
"pt"
)),
legend.position=
"none"
axis.line =
element_line
colour =
"black"
))
When the Tone Check successfully prompts a user to remove non-neutral language, the likelihood of that edit being reverted drops significantly. There was a
-44.1%
decrease in the revert rate of edits where people removed non-neutral language in response to a Tone Check prompt.
By Platform
Code
tone_check_eligible_revert_byplatform
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
is_tone_check_addressed
!=
'Tone check shown but not addressed'
|>
#limit to edits where tone check was addressed
group_by
(platform, is_tone_check_addressed )
|>
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
#look at reverted
mutate
revert_rate =
paste0
round
(n_reverts
n_edits
100
),
"%"
))
|>
select
))
|>
# removing granular data columns
gt
()
|>
tab_header
title =
"New Content revert rate: Impact of removing non-neutral language by platform"
|>
opt_stylize
|>
cols_label
platform =
"Platform"
is_tone_check_addressed =
"Were tone issues detected at time of save?"
#n_edits = "Number of published edits",
#n_reverts = "Number of edits reverted",
revert_rate =
"Proportion of edits that were reverted"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to be shown Tone Check'
display_html
as_raw_html
(tone_check_eligible_revert_byplatform))
New Content revert rate: Impact of removing non-neutral language by platform
Were tone issues detected at time of save?
Proportion of edits that were reverted
mobile web
Eligible control edits
32.4%
Tone check shown and addressed
27.6%
desktop
Eligible control edits
28.5%
Tone check shown and addressed
15.1%
Limited to published new content edits shown or eligible to be shown Tone Check
We observed decreases on both platforms, but there is a larger impact on desktop compared to mobile web. On desktop, we observed a significant
-47% decrease [-13.4 pp]
in revert rate for people who revised their text in response to Tone Check.
On mobile web, there was
-14.8% [-4.8pp]
decrease in revert rate for edits where non-neutral language was removed. Mobile web edits appear to be inherently trickier for newcomers and are still more likely to be reverted compared to desktop edits, even when non-neutral language is removed.
By User Experience
Code
tone_check_eligible_revert_byuserexp
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
is_tone_check_addressed
!=
'Tone check shown but not addressed'
|>
#limit to edits where tone check was addressed
group_by
(experience_level_group, is_tone_check_addressed)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
#look at reverted
mutate
revert_rate =
paste0
round
(n_reverts
n_edits
100
),
"%"
))
|>
select
))
|>
# removing granular data columns
gt
()
|>
tab_header
title =
"New Content revert rate: Impact of removing non-neutral language by user experience"
|>
opt_stylize
|>
cols_label
experience_level_group =
"User Experience"
is_tone_check_addressed =
"Were tone issues detected at time of save?"
#n_edits = "Number of published edits",
#n_reverts = "Number of edits reverted",
revert_rate =
"Proportion of edits that were reverted"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to be shown Tone Check'
display_html
as_raw_html
(tone_check_eligible_revert_byuserexp))
New Content revert rate: Impact of removing non-neutral language by user experience
Were tone issues detected at time of save?
Proportion of edits that were reverted
Unregistered
Eligible control edits
34.7%
Tone check shown and addressed
23.8%
Newcomer
Eligible control edits
21%
Tone check shown and addressed
21.5%
Junior Contributor
Eligible control edits
31%
Tone check shown and addressed
13.7%
Limited to published new content edits shown or eligible to be shown Tone Check
We observed the highest impact for Junior Contributors, where there was
-55.8% decrease [ -17.3 pp]
in revert rate compared to a slight +2.5%[0.5pp] increase for newcomers and a -31.4% decrease for unregistered users.
For Newcomers and unregistered users, addressing tone issues may have less of an impact because their edits are frequently reverted for other policy violations that Tone Check is not designed to catch. Junior contributors have already successfully completed at least one edit and are more likely to publish an edit where non-neutral language is the only issue.
By Partner Wikipedia
Code
tone_check_eligible_revert_bywiki
<-
tone_check_publish_data
|>
filter
(is_new_content
is_test_eligible
==
'eligible'
is_tone_check_addressed
!=
'Tone check shown but not addressed'
|>
#limit to edits where tone check was addressed
group_by
(wiki, is_tone_check_addressed)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
#look at reverted
mutate
revert_rate =
paste0
round
(n_reverts
n_edits
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns
gt
()
|>
tab_header
title =
"New Content revert rate: Impact of removing non-neutral language by partner Wikipedia"
|>
opt_stylize
|>
cols_label
wiki =
"Wikipedia"
is_tone_check_addressed =
"Were tone issues detected at time of save?"
#n_edits = "Number of published edits",
#n_reverts = "Number of edits reverted",
revert_rate =
"Proportion of edits that were reverted"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to be shown Tone Check'
display_html
as_raw_html
(tone_check_eligible_revert_bywiki))
New Content revert rate: Impact of removing non-neutral language by partner Wikipedia
Were tone issues detected at time of save?
Proportion of edits that were reverted
French Wikipedia
Eligible control edits
30.8%
Tone check shown and addressed
22.5%
Japanese Wikipedia
Eligible control edits
31.2%
Tone check shown and addressed
8%
Portuguese Wikipedia
Eligible control edits
21.4%
Tone check shown and addressed
4.5%
Limited to published new content edits shown or eligible to be shown Tone Check
We also observed decreases across all three partner Wikipedias; however, the magnitude of impact varies highlighting different revert behavior at each community.
At Japanese and Portuguese Wikipedias, removing non-neutral language from edits reduces the revert rate to less than 10% (over 70% relative decrease) while there was less of an impact on French Wikipedia. See specific changes below:
Japanese Wikipedia: -74.4%[-23.2pp]
Portuguese Wikipedia: -79% [-16.9]
French Wikipedia: -26.9% [-8.3pp]
Note: There is a smaller sample size of published edits eligible for Tone Check on a per Wikipedia basis (< 300 edits) so these rates are more susceptible to noise.
Confirming the impact of removing non-neutral language on new content revert rate
I then evaluated the impact of removing non-neutral language on new content revert rate, controlling for variances by user and wiki. For this analysis, I specifically compared edits with non-neutral edits in the control group (eligible control edits) to edits in the test group where Tone check was shown and non-neutral language was removed.
Code
tone_check_eligible_revert_overall_byuser
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
is_tone_check_addressed
!=
'Tone check shown but not addressed'
|>
#directly comparing eligible control to test where tone check addressed
group_by
(is_tone_check_addressed, platform, user_id)
|>
#use prescence of non_neutral language as the variation
summarise
n_edits =
n_distinct
(editing_session),
n_reverts =
n_distinct
(editing_session[was_reverted
==
]))
|>
# reverted within 48 hours
mutate
revert_rate =
n_reverts
n_edits)
Code
# rename field names to align with relax package naming convention
tone_check_eligible_revert_overall_byuser
<-
tone_check_eligible_revert_overall_byuser
|>
mutate
variation =
factor
(is_tone_check_addressed,
levels =
"Eligible control edits"
"Tone check shown and addressed"
),
labels =
"control"
"treatment"
)))
tone_check_eligible_revert_overall_byuser
outcome
tone_check_eligible_revert_overall_byuser
revert_rate
Code
overall_impact_toneeligible
<-
tone_check_eligible_revert_overall_byuser
|>
analyze_relative_lift
metric_type =
"proportion"
|>
gt
()
|>
tab_header
title =
md
"**Evaluating the impact of removing non-netural language on revert rate**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"95% CI Lower"
),
cred_upper =
md
"95% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"95% CI Lower"
),
conf_upper =
md
"95% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Highlight key finding (Inconclusive) ---
tab_footnote
footnote =
md
"The 95% intervals do not cross zero indicating the results are statistically significant."
),
locations =
cells_column_labels
columns =
(cred_lower, conf_lower))
%>%
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(overall_impact_toneeligible))
Evaluating the impact of removing non-netural language on revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower
95% CI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
−0.350
0.000
−0.534
−0.165
−0.388
0.000
−0.583
−0.193
The 95% intervals do not cross zero indicating the results are statistically significant.
We confirmed a statistically significant reduction in revert rate for edits where non-neutral language was removed in the final published edit.
For users that removed non-neutral language in response to a Tone Check, we observed a likely 38 percentage point reduction in the likelihood of an edit being reverted. Both the Bayesian “Chance to Win” and the Frequentist p-value are at the maximum possible significance level (0.000). This confirms that the reduction is not due to chance, but is a direct result of the language being improved.
Code
# check by platform numbers
platform_impact_toneeligible
<-
tone_check_eligible_revert_overall_byuser
|>
group_by
(platform)
|>
group_modify
analyze_relative_lift
(.x,
metric_type =
"proportion"
))
|>
gt
()
|>
tab_header
title =
md
"**Evaluating the impact of removing non-netural language on platform revert rate**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
platform =
md
"Platform"
),
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"95% CI Lower"
),
cred_upper =
md
"95% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"95% CI Lower"
),
conf_upper =
md
"95% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(platform_impact_toneeligible))
Evaluating the impact of removing non-netural language on platform revert rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower
95% CI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
mobile web
−0.088
0.330
−0.479
0.303
−0.158
0.559
−0.699
0.384
desktop
−0.342
0.001
−0.551
−0.133
−0.392
0.001
−0.616
−0.167
Per platform findings:
Desktop: The model shows a highly significant
34.2 percentage point drop
in the revert rate for edits where non-neutral language was removed (p = 0.001). There is a 99.9% probability that Desktop users who address Tone Check suggestions are less likely to be reverted.
Mobile web: Results show directional signs that removing non-neutral language decreases reverts, with a projected impact of -8.8 percentage points. While there is a 67% chance that addressing Tone Check issues decreases reverts, we cannot confirm statistical significance (p = 0.559). This is likely due to a smaller sample size and higher “signal noise” on mobile, where other factors (such as technical errors or other policy violations) often results in reverts even if non-neutral language is removed.
Edit Completion Rate (Primary Metric)
Hypothesis
: Newcomers and Junior Contributors will experience Tone Check as encouraging because it will offer them more clarity about what is expected of the new information they add to Wikipedia
Methodology
We reviewed the proportion of edits attempted that were successfully published (not reverted). For this analysis, we are limiting to edits that reached the point where Tone check was or would be shown to reduce noise from edits abandoned earlier in the editing workflow.
We excluded edits that were reverted to ensure we were measuring the Tone Check’s impact on productive contributions.
Code
# load data for assessing edit completion rate
tone_check_completion_rates_1
<-
read.csv
file =
'data/tone_check_completion_data.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
FALSE
Code
# load edit completion rate (second dataset)
# Second dataset was created to obtain updated event data while preserving initial aggregated dataset that could no loner
# be queried in Data Lake due to data retention policies.
tone_check_completion_rates_2
<-
read.csv
file =
'data/tone_check_completion_data_pt2.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
TRUE
Code
# Combine the two datasets
tone_check_completion_rates
<-
rbind
(tone_check_completion_rates_1, tone_check_completion_rates_2)
Code
# Set experience level group and factor levels
tone_check_completion_rates
<-
tone_check_completion_rates
|>
mutate
experience_level_group =
case_when
user_edit_count
==
user_status
==
'registered'
'Newcomer'
user_edit_count
==
user_status
==
'unregistered'
'Unregistered'
user_edit_count
user_edit_count
<=
100
"Junior Contributor"
user_edit_count
100
"Non-Junior Contributor"
),
experience_level_group =
factor
(experience_level_group,
levels =
"Unregistered"
"Newcomer"
"Non-Junior Contributor"
"Junior Contributor"
))
#rename experiment field to clarfiy
tone_check_completion_rates
<-
tone_check_completion_rates
|>
mutate
test_group =
factor
(test_group,
levels =
'2025-09-editcheck-tone-control'
'2025-09-editcheck-tone-test'
),
labels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
)))
#rename platform from phone to mobile web to clarify meaning
tone_check_completion_rates
<-
tone_check_completion_rates
|>
mutate
platform =
factor
(platform,
levels =
'phone'
'desktop'
),
labels =
"mobile web"
"desktop"
)))
tone_check_completion_rates
<-
tone_check_completion_rates
|>
mutate
wiki =
recode
(wiki,
!!!
wiki_name_lookup)
Code
#Set fields and factor levels to assess number of checks shown
tone_check_completion_rates
<-
tone_check_completion_rates
|>
mutate
multiple_checks_shown =
ifelse
(n_checks_shown
"multiple checks shown"
"one check shown"
),
multiple_checks_shown =
factor
( multiple_checks_shown ,
levels =
"one check shown"
"multiple checks shown"
)))
# note these buckets can be adjusted as needed based on distribution of data
tone_check_completion_rates
<-
tone_check_completion_rates
|>
mutate
checks_shown_bucket =
case_when
is.na
(n_checks_shown)
'0'
n_checks_shown
==
'1'
n_checks_shown
==
'2'
n_checks_shown
n_checks_shown
<=
"3-5"
n_checks_shown
n_checks_shown
<=
10
"6-10"
n_checks_shown
10
"over 10"
),
checks_shown_bucket =
factor
(checks_shown_bucket ,
levels =
"0"
"1"
"2"
"3-5"
"6-10"
"over 10"
))
Overall
Code
tone_check_completion_rate_overall
<-
tone_check_completion_rates
%>%
filter
(tone_check_shown
==
%>%
#limit to sessions where tone check was shown
group_by
(test_group)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_saves =
n_distinct
(editing_session[saved_edit
was_reverted
==
]))
%>%
#saved and not reverted
mutate
completion_rate =
paste0
round
(n_saves
n_edits
100
),
"%"
))
Code
# plot visualization of overall edit completion rates
dodge
<-
position_dodge
width=
0.9
<-
tone_check_completion_rate_overall
%>%
ggplot
aes
x=
test_group,
y =
n_saves
n_edits,
fill =
test_group))
geom_col
position =
'dodge'
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste
(completion_rate,
\n
, n_saves,
"saved edits"
),
fontface=
),
vjust=
1.2
size =
10
color =
"white"
scale_fill_manual
values=
"#999999"
"dodgerblue4"
),
name =
"Experiment Group"
labs
y =
"Percent of edit attempts completed "
x =
"Experiment Group"
title =
"Edit completion rate"
caption =
"Limited to edits shown or eligible to be shown at least one Tone Check and not reverted"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
24
),
axis.text.x =
element_text
size =
24
),
axis.title.x =
element_text
margin =
margin
t =
20
unit =
"pt"
)),
legend.position=
"none"
axis.line =
element_line
colour =
"black"
))
Edit completion rates for people shown tone check
decreased only slightly by -3.2% (-1.6) percentage points
By if multiple checks were shown
Code
tone_check_completion_rate_bymulti
<-
tone_check_completion_rates
%>%
filter
(tone_check_shown
==
test_group
==
'test (tone check shown)'
%>%
group_by
(test_group, multiple_checks_shown)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_saves =
n_distinct
(editing_session[saved_edit
]))
%>%
mutate
completion_rate =
paste0
round
(n_saves
n_edits
100
),
"%"
))
%>%
gt
()
%>%
tab_header
title =
"Tone Check edit completion rate by if multiple checks were shown"
%>%
opt_stylize
%>%
cols_label
test_group =
"Experiment group"
multiple_checks_shown =
"Multiple Tone Checks shown"
n_edits =
"Number of edit attempts shown Tone Check"
n_saves =
"Number of published edits"
completion_rate =
"Proportion of edits saved"
%>%
tab_source_note
gt
::
md
'Limited to edits shown at least one Tone Check and not reverted'
display_html
as_raw_html
(tone_check_completion_rate_bymulti))
Tone Check edit completion rate by if multiple checks were shown
Multiple Tone Checks shown
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
test (tone check shown)
one check shown
2049
1287
62.8%
multiple checks shown
2798
1782
63.7%
Limited to edits shown at least one Tone Check and not reverted
By number of checks shown
Code
tone_check_completion_rate_bynchecks
<-
tone_check_completion_rates
%>%
filter
(tone_check_shown
==
test_group
==
'test (tone check shown)'
%>%
#limit to paste checks shown and test group
group_by
(test_group, checks_shown_bucket)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_saves =
n_distinct
(editing_session[saved_edit
was_reverted
==
]))
%>%
mutate
completion_rate =
paste0
round
(n_saves
n_edits
100
),
"%"
))
%>%
ungroup
()
%>%
mutate
n_edits =
ifelse
(n_edits
50
"<50"
, n_edits),
n_saves =
ifelse
(n_saves
50
"<50"
, n_saves))
%>%
#sanitizing per data publication guidelines
group_by
(test_group)
%>%
gt
()
%>%
tab_header
title =
"Tone Check edit completion rate by the number of checks shown"
%>%
opt_stylize
%>%
cols_label
checks_shown_bucket =
"Number of Tone Check shown"
n_edits =
"Number of edit attempts shown Tone Check"
n_saves =
"Number of published edits"
completion_rate =
"Proportion of edits saved"
%>%
tab_source_note
gt
::
md
'Limited to edits shown at least one Tone Check in the test group and not reverted'
display_html
as_raw_html
(tone_check_completion_rate_bynchecks))
Tone Check edit completion rate by the number of checks shown
Number of Tone Check shown
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
test (tone check shown)
2049
1011
49.3%
1452
718
49.4%
3-5
834
394
47.2%
6-10
327
150
45.9%
over 10
185
82
44.3%
Limited to edits shown at least one Tone Check in the test group and not reverted
The majority of published new content edits (73%) were shown two or fewer Tone Check within a single editing session. When 2 or fewer checks are presented, we see only about a 1.6% decrease in edit completion rate.
The decrease in completion rate does not exceed over 10% until more than 10 tone checks are presented in a single editing session. For these edits, edit completion rate decreased to 44.3% (a -12% decrease from the control). However, these editing sessions with over 10 tone checks represent only 3% of published edits where Tone Check was shown and are likely an indicator of very low quality edits that we’d want to deter.
By Platform
Code
tone_check_completion_rate_byplatform
<-
tone_check_completion_rates
%>%
filter
(tone_check_shown
==
%>%
group_by
(platform, test_group)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_saves =
n_distinct
(editing_session[saved_edit
was_reverted
==
]))
%>%
mutate
completion_rate =
paste0
round
(n_saves
n_edits
100
),
"%"
))
%>%
#mutate(n_saves = ifelse(n_saves < 50, "<50", n_saves))%>% #sanitizing per data publication guideline
#select(-c(3,4)) %>%
gt
()
%>%
tab_header
title =
"Tone Check edit completion rate by platform"
%>%
opt_stylize
%>%
cols_label
test_group =
"Experiment Group"
platform =
"Platform"
n_edits =
"Number of edit attempts shown Tone Check"
n_saves =
"Number of published edits"
completion_rate =
"Proportion of edits saved"
%>%
tab_source_note
gt
::
md
'Limited to edits shown or eligible to be shown at least one Tone Check and not reverted'
display_html
as_raw_html
(tone_check_completion_rate_byplatform))
Tone Check edit completion rate by platform
Experiment Group
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
mobile web
control (eligible but not shown tone check)
864
336
38.9%
test (tone check shown)
1266
489
38.6%
desktop
control (eligible but not shown tone check)
2983
1597
53.5%
test (tone check shown)
3581
1866
52.1%
Limited to edits shown or eligible to be shown at least one Tone Check and not reverted
This decrease was primarily concentrated on Desktop (-2.6%; -1.4pp), with no significant change in completion rates observed for mobile web. Mobile users are nearly as likely to publish their edit whether they see a Tone Check or not.
By User Experience
Code
tone_check_completion_rate_byuserstatus
<-
tone_check_completion_rates
%>%
filter
(tone_check_shown
==
%>%
group_by
(experience_level_group, test_group)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_saves =
n_distinct
(editing_session[saved_edit
was_reverted
==
]))
%>%
mutate
completion_rate =
paste0
round
(n_saves
n_edits
100
),
"%"
))
%>%
#select(-c(3,4)) %>% #data sanitizing for publication
gt
()
%>%
tab_header
title =
"Tone check edit completion rate by user experience"
%>%
opt_stylize
%>%
cols_label
test_group =
"Experiment Group"
experience_level_group =
"Experiment Group"
n_edits =
"Number of edit attempts shown Tone Check"
n_saves =
"Number of published edits"
completion_rate =
"Proportion of edits saved"
%>%
tab_source_note
gt
::
md
'Limited to edits shown or eligible to be shown at least one Tone Check and not reverted'
display_html
as_raw_html
(tone_check_completion_rate_byuserstatus))
Tone check edit completion rate by user experience
Experiment Group
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
Unregistered
control (eligible but not shown tone check)
1144
433
37.8%
test (tone check shown)
1613
555
34.4%
Newcomer
control (eligible but not shown tone check)
712
323
45.4%
test (tone check shown)
861
367
42.6%
Junior Contributor
control (eligible but not shown tone check)
1991
1177
59.1%
test (tone check shown)
2373
1433
60.4%
Limited to edits shown or eligible to be shown at least one Tone Check and not reverted
The impacts of Tone Check on edit completion rate vary based on user experience. See relative changes below:
Unregistered: -9.0% decrease [-3.4pp]
Newcomers: -6.4% decrease [-2.9pp]
Junior Contributors: +2.2% increase [1.3pp]
While Tone Check resulted in slight decreases in edit completion rates for newcomer and unregistered users, it caused minimal disruption to Junior Contributors. We actually observed a +2.2% relative increase in the completion rate of Junior Contributors shown Tone Check, suggesting the check is encouraging and facilitates successful publishing for a subset of users.
By Partner Wikipedia
Code
tone_check_completion_rate_bywiki
<-
tone_check_completion_rates
%>%
filter
(tone_check_shown
==
%>%
group_by
(wiki, test_group)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_saves =
n_distinct
(editing_session[saved_edit
was_reverted
==
]))
%>%
mutate
completion_rate =
paste0
round
(n_saves
n_edits
100
),
"%"
))
%>%
#filter(n_edits > 200) %>% #limit to wikis with sufficient events
#select(-c(3,4)) %>% #data sanitizing for publication
gt
()
%>%
tab_header
title =
"Tone Check edit completion rate by Wikipedia"
%>%
opt_stylize
%>%
cols_label
test_group =
"Experiment Group"
wiki =
"Wikipedia"
n_edits =
"Number of edit attempts shown Tone Check"
n_saves =
"Number of published edits"
completion_rate =
"Proportion of edits saved"
%>%
tab_source_note
gt
::
md
'Limited to Wikipedias with at least 200 edit attempts during reviewed timeframe'
display_html
as_raw_html
(tone_check_completion_rate_bywiki ))
Tone Check edit completion rate by Wikipedia
Experiment Group
Number of edit attempts shown Tone Check
Number of published edits
Proportion of edits saved
French Wikipedia
control (eligible but not shown tone check)
2579
1245
48.3%
test (tone check shown)
3332
1600
48%
Japanese Wikipedia
control (eligible but not shown tone check)
704
420
59.7%
test (tone check shown)
902
476
52.8%
Portuguese Wikipedia
control (eligible but not shown tone check)
564
268
47.5%
test (tone check shown)
613
279
45.5%
Limited to Wikipedias with at least 200 edit attempts during reviewed timeframe
The most significant decrease in edit completion rate was at Japanese Wikipedia (-11.6% decrease [-6.9pp]) while there was slight increase at French Wikipedia.
Results indicate an inverse correlation between edit completion and revert rates at each Wikipedia. The most significant decrease in completion occurred on the Japanese Wikipedia, which also saw the most substantial decrease in revert rate. In contrast, the French Wikipedia saw almost no change in edit completion rates and only a small decrease in revert rates.
This correlation suggests that the Tone Check is effectively deterring some lower-quality edits that would have been reverted.
Confirming impact of Tone Check on edit completion rate
Code
# calculate the proportion for each user
tone_check_completion_rate_overall_byuser
<-
tone_check_completion_rates
%>%
filter
(tone_check_shown
==
%>%
#limit to sessions where tone check was shown
group_by
(test_group, platform, user_id)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_saves =
n_distinct
(editing_session[saved_edit
was_reverted
==
]))
%>%
mutate
completion_rate =
n_saves
n_edits)
Code
# rename field names to align with relax package naming convention
tone_check_completion_rate_overall_byuser
<-
tone_check_completion_rate_overall_byuser
|>
mutate
variation =
factor
(test_group,
levels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
),
labels =
"control"
"treatment"
)))
tone_check_completion_rate_overall_byuser
outcome
tone_check_completion_rate_overall_byuser
completion_rate
Code
overall_impact_completes
<-
tone_check_completion_rate_overall_byuser
|>
analyze_relative_lift
metric_type =
"proportion"
|>
gt
()
|>
tab_header
title =
md
"**Evaluating Tone Check impact on edit completion rate**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"95% CI Lower"
),
cred_upper =
md
"95% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"95% CI Lower"
),
conf_upper =
md
"95% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Highlight key finding (Inconclusive) ---
tab_footnote
footnote =
md
"The 95% intervals does not cross zero, indicating no statistically conclusive difference"
),
locations =
cells_column_labels
columns =
(cred_lower, conf_lower))
%>%
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(overall_impact_completes))
Evaluating Tone Check impact on edit completion rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower
95% CI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
−0.078
0.000
−0.123
−0.033
−0.079
0.001
−0.124
−0.034
The 95% intervals does not cross zero, indicating no statistically conclusive difference
Results indicate that Tone Check introduced a small but statistically significant (p =0.001) level of friction into the editing process. Estimates indicate that Tone Check likely decreased edit completion rate by 7.9 percentage points.
Code
# check by platform numbers
platform_impact_completes
<-
tone_check_completion_rate_overall_byuser
|>
group_by
(platform)
|>
group_modify
analyze_relative_lift
(.x,
metric_type =
"proportion"
))
|>
gt
()
|>
tab_header
title =
md
"**Evaluating Tone Check impact on complation rate by platform**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
platform =
md
"Platform"
),
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"95% CI Lower"
),
cred_upper =
md
"95% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"95% CI Lower"
),
conf_upper =
md
"95% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Highlight key finding (Inconclusive) ---
tab_footnote
footnote =
md
"The 95% intervals cross zero, indicating no statistically conclusive difference."
),
locations =
cells_column_labels
columns =
(cred_lower, conf_lower))
%>%
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(platform_impact_completes))
Evaluating Tone Check impact on complation rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
95% CI Lower
95% CI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
mobile web
−0.024
0.342
−0.140
0.092
−0.025
0.679
−0.143
0.093
desktop
−0.082
0.000
−0.131
−0.034
−0.083
0.001
−0.131
−0.035
The 95% intervals cross zero, indicating no statistically conclusive difference.
The per platform analysis reveals that 3.2% overall decrease in completion rates is driven by desktop editors.
On desktop, we confirmed a statistically significant decrease in edit completion rate. Tone check shown to users on desktop likely caused a decrease of around 8.3%. On mobile web, we observed a small but statistically insignificant decrease in edit completion rate.
Constructive Edit Rate
Hypothesis
: A larger proportion of new content edits by Newcomers and Junior Contributors will be constructive because they will be made aware the new text they’re attempting to publish needs to be written in a neutral tone, when they don’t first think/know to write in this way themselves.
Methodology
: The proportion of all published edits by users with ≤100 cumulative edits on a mobile web main namespace that are constructive (not reverted with 48 hours). Similar to revert rate, the analysis was limited to new content edits shown or eligible to be shown Tone Check so we can isolate data to edits that would be impacted by this feature.
Note: This metric is also the
WE 1.1 Key Result
. We will include Tone Check’s impact on this metric as part of our evaluation of the collective impact of interventions deployed under WE 1.1 on this metric.
Overall
Code
tone_check_constructive_overall
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
#limit to eligible edits
group_by
(test_group, user_id)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_const =
n_distinct
(editing_session[was_reverted
==
]))
|>
#limit to new content edits without a refernece
mutate
constructive_edit_rate =
n_const
n_edits)
|>
group_by
(test_group)
|>
summarise
avg_rate =
mean
(constructive_edit_rate))
Code
# plot visualization of overall edit completion rates
dodge
<-
position_dodge
width=
0.9
<-
tone_check_constructive_overall
|>
ggplot
aes
x=
test_group,
y =
n_const
n_edits,
fill =
test_group))
geom_col
position =
'dodge'
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste
(constructive_edit_rate,
\n
, n_const,
"constructive edits"
),
fontface=
),
vjust=
1.2
size =
10
color =
"white"
scale_fill_manual
values=
"#999999"
"dodgerblue4"
),
name =
"Experiment Group"
labs
y =
"Percent of edits that were constructive "
x =
"Experiment Group"
title =
"Constructive edit rate"
caption =
"Limited to published new content edits shown or eligible to be shown Tone Check"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
24
),
axis.text.x =
element_text
size =
24
),
axis.title.x =
element_text
margin =
margin
t =
20
unit =
"pt"
)),
legend.position=
"none"
axis.line =
element_line
colour =
"black"
))
Overall, constructive edit rates increased by
+6.2% increase [4.4 percentage points]
for people shown Tone Check in the test group.
By platform
Code
tone_check_constructive_byplatform
<-
tone_check_publish_data
|>
filter
( is_test_eligible
==
'eligible'
|>
group_by
(platform, test_group)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_const =
n_distinct
(editing_session[was_reverted
==
]))
|>
mutate
constructive_edit_rate =
paste0
round
(n_const
n_edits
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"Constructive edit rate by platform"
|>
opt_stylize
|>
cols_label
test_group =
"Test Group"
platform =
"Platform"
#n_edits = "Number of published new content edits",
# n_const = "Number of constructive edits",
constructive_edit_rate =
"Proportion of new content edits that were constructive"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to shown Tone Check'
display_html
as_raw_html
(tone_check_constructive_byplatform))
Constructive edit rate by platform
Test Group
Proportion of new content edits that were constructive
mobile web
control (eligible but not shown tone check)
65.5%
test (tone check shown)
65.4%
desktop
control (eligible but not shown tone check)
75%
test (tone check shown)
79.8%
Limited to published new content edits shown or eligible to shown Tone Check
We continue to see differing trends on mobile web compared to desktop. Constructive edit rates on desktop increased while they decreased on mobile web.
On desktop,
constructive edit rate increased by 6.4%
while we observed no statistically significant change in mobile web constructive edits.
By user experience
Code
tone_check_constructive_byexp
<-
tone_check_publish_data
|>
filter
(platform
==
'desktop'
is_new_content
==
is_test_eligible
==
'eligible'
|>
group_by
(experience_level_group, test_group)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_const =
n_distinct
(editing_session[was_reverted
==
]))
|>
mutate
constructive_edit_rate =
paste0
round
(n_const
n_edits
100
),
"%"
))
|>
select
))
|>
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"Constructive edit rate by user experience"
%>%
opt_stylize
|>
cols_label
test_group =
"Test Group"
experience_level_group =
"User type"
#n_edits = "Number of published new content edits",
# n_const = "Number of constructive edits",
constructive_edit_rate =
"Proportion of new content edits that were constructive"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to shown Tone Check'
display_html
as_raw_html
(tone_check_constructive_byexp ))
Constructive edit rate by user experience
Test Group
Proportion of new content edits that were constructive
Unregistered
control (eligible but not shown tone check)
74.4%
test (tone check shown)
69.1%
Newcomer
control (eligible but not shown tone check)
79.6%
test (tone check shown)
77.6%
Junior Contributor
control (eligible but not shown tone check)
68%
test (tone check shown)
81.4%
Limited to published new content edits shown or eligible to shown Tone Check
The increase in constructive edit rate appears to be primarily due to an increase in constructive edits by Junior Contributors shown Tone Check, where we observed a
+14.8% increase [10.2 pp]
. When limited to desktop edits, there was a 19.7% increase in constructive edits by Junior Contributors.
By Partner Wikipedia
Code
tone_check_constructive_bywiki
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
group_by
(wiki, test_group)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_const =
n_distinct
(editing_session[was_reverted
==
]))
|>
mutate
constructive_edit_rate =
paste0
round
(n_const
n_edits
100
),
"%"
))
|>
#filter(n_edits > 100) %>% #limit to wikis with sufficient events
select
))
|>
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"Constructive edit rate by Wikipedia"
|>
opt_stylize
|>
cols_label
test_group =
"Test Group"
wiki =
md
"**Wikipedia**"
),
#n_edits = "Number of published new content edits",
#n_const = "Number of constructive edits",
constructive_edit_rate =
"Proportion of new content edits that were constructive"
display_html
as_raw_html
(tone_check_constructive_bywiki ))
Constructive edit rate by Wikipedia
Test Group
Proportion of new content edits that were constructive
French Wikipedia
control (eligible but not shown tone check)
69.2%
test (tone check shown)
71%
Japanese Wikipedia
control (eligible but not shown tone check)
68.8%
test (tone check shown)
89.1%
Portuguese Wikipedia
control (eligible but not shown tone check)
78.6%
test (tone check shown)
83.1%
Tone Check increased constructive edit rates at all three partner Wikipedias.
Aligned with the decreased revert rate findings, we confirmed that Tone Check has the biggest impact on constructive edit rates at Japanese Wikipedia, where there was a
+29.5% increase in the constructive edit rate for users shown Tone Check compared to eligible edits in the control group
Due to the small sample size of per Wikipedia edits, we are currently not able confirm statistical significance of the increases at any of these Wikipedias but the direction and magnitude of change indicate that Tone Check is having a positive effect on edit quality at each partner Wikipedia.
Confirming impact of Tone Check on constructive edit rate
We also modeled the impact of Tone Check on constructive edits rates to confirm the magnitude and direction of Tone Check’s effect on a user completing a higher proportion of constructive edits. This helps account for random effects of the user and wiki.
Code
tone_check_constructive_overall_byuser
<-
tone_check_publish_data
|>
filter
(is_test_eligible
==
'eligible'
|>
#limit to eligible edits
group_by
(test_group, platform, user_id)
|>
summarise
n_edits =
n_distinct
(editing_session),
n_const =
n_distinct
(editing_session[was_reverted
==
]))
|>
#edits not reverted
mutate
constructive_edit_rate =
n_const
n_edits)
Code
# rename test group field names to align with relax package naming convention
tone_check_constructive_overall_byuser
<-
tone_check_constructive_overall_byuser
|>
mutate
variation =
factor
(test_group,
levels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
),
labels =
"control"
"treatment"
)))
Code
# create new column name to align with relax package naming
tone_check_constructive_overall_byuser
outcome
tone_check_constructive_overall_byuser
constructive_edit_rate
Code
# overall impact
overall_impact_const_edits
<-
tone_check_constructive_overall_byuser
|>
analyze_relative_lift
metric_type =
"proportion"
ci_level =
0.9
|>
gt
()
|>
tab_header
title =
md
"**Evaluating Tone Check impact on overall constructive edit rate**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"90% CI Lower"
),
cred_upper =
md
"90% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"90% CI Lower"
),
conf_upper =
md
"90% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(overall_impact_const_edits))
Evaluating Tone Check impact on overall constructive edit rate
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
90% CI Lower
90% CI Upper
Point Estimate
-value
90% CI Lower
90% CI Upper
0.032
0.948
0.000
0.064
0.032
0.103
0.000
0.064
The Tone Check feature resulted in a slight but statistically significant increase in the overall constructive edit rate (p = 0.103). The Chance to Win indicates a 95% probability that the Tone Check increases the likelihood a user completes a constructive edit.
Code
# check by platform numbers
platform_impact_constr_edits
<-
tone_check_constructive_overall_byuser
|>
group_by
(platform)
|>
group_modify
analyze_relative_lift
(.x,
metric_type =
"proportion"
ci_level =
0.9
))
|>
gt
()
|>
tab_header
title =
md
"**Evaluating Tone Check impact on constructive edit rate by platform**"
),
subtitle =
md
"Difference in Metric (Test Group - Control Group)"
|>
tab_spanner
label =
md
"**Bayesian Analysis**"
),
columns =
(estimate_bayes, chance_to_win, cred_lower, cred_upper)
|>
tab_spanner
label =
md
"**Frequentist Analysis**"
),
columns =
(estimate_freq, p_value, conf_lower, conf_upper)
|>
# Rename Columns for clarity ---
cols_label
platform =
md
"Platform"
),
estimate_bayes =
md
"Point Estimate"
),
chance_to_win =
md
"Chance to Win"
),
cred_lower =
md
"90% CI Lower"
),
cred_upper =
md
"90% CI Upper"
),
estimate_freq =
md
"Point Estimate"
),
p_value =
md
"*p*-value"
),
conf_lower =
md
"90% CI Lower"
),
conf_upper =
md
"90% CI Upper"
|>
# pply Formatting (Decimals and CI Grouping) ---
fmt_number
columns =
everything
(),
decimals =
# Use 3 decimals for precision
|>
# Style the table ---
tab_options
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
px
),
data_row.padding =
px
display_html
as_raw_html
(platform_impact_constr_edits))
Evaluating Tone Check impact on constructive edit rate by platform
Difference in Metric (Test Group - Control Group)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
90% CI Lower
90% CI Upper
Point Estimate
-value
90% CI Lower
90% CI Upper
mobile web
0.015
0.630
−0.060
0.091
0.016
0.737
−0.061
0.092
desktop
0.030
0.926
−0.004
0.064
0.030
0.147
−0.004
0.064
Results confirm that Tone Check is most effective at increasing the quality of edits on desktop. On Desktop, we see a strong positive trend. The Bayesian 92.6% Chance to Win suggests the tool is highly likely to be increasing constructive edits.
Mobile Web results are slightly directionally positive (+1.5 pp) but are not yet statistically significant. Consistent with other findings, presenting Tone Check on mobile web does not appear to be disruptive but has less of an impact on user behavior compared to desktop.
Constructive Retention Rate (Second Week)
Hypothesis
: Newcomers and Junior Contributors will be more likely to return to publish a new content edit in the future that does not include non-neutral language because Tone Check will have caused them to realize when they are at risk of of this not being true.
Methodology
: First we reviewed the proportion of newcomers and Junior Contributors that publish an edit on a main namespace where Tone Check was shown and successfully return to make an unreverted edit to a main namespace 7 and 14 days after their first edit (second week retention).
Code
# load retention
constructive_retention_rate_1
<-
read.csv
file =
'data/constructive_retention_data_14day.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
FALSE
Code
# load constructive retention rate (second dataset)
constructive_retention_rate_2
<-
read.csv
file =
'data/constructive_retention_data_14day_pt2.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
TRUE
Code
# Combine the two datasets
constructive_retention_rate
<-
rbind
(constructive_retention_rate_1, constructive_retention_rate_2)
Code
# Cleaning up dataset and renaming fields to clarify meanings
# Set experience level group and factor levels
constructive_retention_rate
<-
constructive_retention_rate
%>%
mutate
experience_level_group =
case_when
user_edit_count
==
user_status
==
'registered'
'Newcomer'
user_edit_count
==
user_status
==
'unregistered'
'Unregistered'
user_edit_count
user_edit_count
<=
100
"Junior Contributor"
user_edit_count
100
"Non-Junior Contributor"
),
experience_level_group =
factor
(experience_level_group,
levels =
"Unregistered"
"Newcomer"
"Non-Junior Contributor"
"Junior Contributor"
))
#rename experiment field to clarify
constructive_retention_rate
<-
constructive_retention_rate
%>%
mutate
test_group =
factor
(test_group,
levels =
'2025-09-editcheck-tone-control'
'2025-09-editcheck-tone-test'
),
labels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
)))
#rename platform from phone to mobile web to clarify meaning
constructive_retention_rate
<-
constructive_retention_rate
%>%
mutate
platform =
factor
(platform,
levels =
'phone'
'desktop'
),
labels =
"mobile web"
"desktop"
)))
Overall
Code
constructive_retention_overall
<-
constructive_retention_rate
%>%
group_by
(test_group)
%>%
summarise
return_editors =
sum
(return_editors),
editors =
sum
(editors),
retention_rate =
paste0
round
(return_editors
editors
100
),
"%"
))
Code
constructive_retention_overall_table
<-
constructive_retention_overall
%>%
gt
()
%>%
tab_header
title =
"Constructive second week retention rate"
%>%
cols_label
test_group =
"Experiment group"
return_editors =
"Number of editors that returned second week"
editors =
"Number of first week editors"
retention_rate =
"Retention rate"
%>%
opt_stylize
%>%
tab_footnote
footnote =
"Limited to users shown or eligible to be shown at least one Tone Check their first week"
locations =
cells_column_labels
columns =
'retention_rate'
display_html
as_raw_html
(constructive_retention_overall_table))
Constructive second week retention rate
Experiment group
Number of editors that returned second week
Number of first week editors
Retention rate
control (eligible but not shown tone check)
115
1995
5.8%
test (tone check shown)
167
2309
7.2%
Limited to users shown or eligible to be shown at least one Tone Check their first week
People who encountered Tone Check are
24% more likely to return again to make a constructive edit in their second week
. 7.2% of people in shown Tone Check the test group returned to make a subsequent constructive edit, compared to 5.8% in the control group. (+1.4 percentage points).
This suggests that rather than discouraging users, Tone check may be make them feel more supported or successful in their contributions, leading them to return at higher rates.
By Platform
Code
constructive_retention_byplatform
<-
constructive_retention_rate
%>%
group_by
(platform, test_group)
%>%
summarise
return_editors =
sum
(return_editors),
editors =
sum
(editors),
retention_rate =
paste0
round
(return_editors
editors
100
),
"%"
))
Code
constructive_retention_byplatform_table
<-
constructive_retention_byplatform
%>%
select
))
|>
# removing granular data columns for publication
gt
()
%>%
tab_header
title =
"Constructive second week retention rate by platform"
%>%
cols_label
test_group =
"Experiment group"
platform =
"Platform"
#return_editors = "Number of editors that returned second week",
#editors = "Number of first week editors",
retention_rate =
"Retention rate"
%>%
opt_stylize
%>%
tab_footnote
footnote =
"Limited to users shown or eligible to be shown at least one Tone Check"
locations =
cells_column_labels
columns =
'retention_rate'
display_html
as_raw_html
(constructive_retention_byplatform_table))
Constructive second week retention rate by platform
Experiment group
Retention rate
mobile web
control (eligible but not shown tone check)
3.1%
test (tone check shown)
5.2%
desktop
control (eligible but not shown tone check)
6.8%
test (tone check shown)
7.9%
Limited to users shown or eligible to be shown at least one Tone Check
While we don’t have sufficient sample size to confirm statistical significance on a per platform basis, we observed high relative increases in constructive retention rate on both platforms.
By User Experience
Code
constructive_retention_byuserexp
<-
constructive_retention_rate
%>%
group_by
(experience_level_group, test_group)
%>%
summarise
return_editors =
sum
(return_editors),
editors =
sum
(editors),
retention_rate =
paste0
round
(return_editors
editors
100
),
"%"
))
Code
constructive_retention_byuserexp_table
<-
constructive_retention_byuserexp
%>%
select
))
|>
# removing granular data columns for publication
gt
()
%>%
tab_header
title =
"Constructive second week retention rate by user experience"
%>%
cols_label
test_group =
"Experiment group"
experience_level_group =
"Experience level group"
#return_editors = "Number of editors that returned second week",
#editors = "Number of first week editors",
retention_rate =
"Retention rate"
%>%
opt_stylize
%>%
tab_footnote
footnote =
"Limited to users shown or eligible to be shown at least one Tone Check"
locations =
cells_column_labels
columns =
'retention_rate'
display_html
as_raw_html
(constructive_retention_byuserexp_table))
Constructive second week retention rate by user experience
Experiment group
Retention rate
Unregistered
control (eligible but not shown tone check)
0.4%
test (tone check shown)
1.3%
Newcomer
control (eligible but not shown tone check)
2.7%
test (tone check shown)
3.8%
Junior Contributor
control (eligible but not shown tone check)
9.5%
test (tone check shown)
11%
Limited to users shown or eligible to be shown at least one Tone Check
We observed increases in retention rate across all user groups as well.
Confirming Impact on Retention Rate
Because retention rate can be assumed to be idependent of one another (a user can only be retained once), we’ll just use a simple test of proportions to confirm significance.
Code
# reframe data for model
constructive_retention_overall
<-
constructive_retention_rate
%>%
group_by
(test_group)
%>%
summarise
return_editors =
sum
(return_editors),
editors =
sum
(editors),
retention_rate =
return_editors
editors )
Code
#Extract the vectors
successes
<-
constructive_retention_overall
return_editors
totals
<-
constructive_retention_overall
editors
#Run Proportion Test
res
<-
prop.test
x =
successes,
n =
totals,
conf.level =
0.90
res
The result is directionally positive and shows a clear improvement in constructive retention.
We are 90% confident that the Tone Check results in a relative increase in retention of somewhere between 3.3% and 47.7%.
Guardrails
We identified a set of 5 guardrails to make sure that Tone Check is not negatively impacting peoples’ experience completing an edit or causing disruption on the wikis. These were identified through a pre-mortem
task
completed at the beginning of the project. We’ve confirmed that Tone Check did not cause any edit quality decreases (See New Content edit rate section) or significant decreases in edit completion rate >20% (See edit completion rate section). We also confirmed that Tone Check did not result in a high block rates or false positive rates (see sections below).
Guardrail #1: False Positive Rate
Description:
Proportion of contributors that decline revising the text they have drafted and indicate that it was irrelevant.
Methodology
: For this check were are defining false positive as the proportion of contributors that decline revising the text they have drafted (
event.feature = 'editCheck-tone' AND event.action = 'action-dismiss'
) and selected “The tone is appropriate” when declining the check. We further limited the analysis to any edits that were not reverted within 48 hours (indicator of a quality edit).
Overall
Code
# overall dismissal rate
tone_check_false_positive_overall
<-
tone_check_reject_data
%>%
filter
(was_tone_check_shown
==
is_new_content
==
%>%
#limit to where shown
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
reject_reason
==
'The tone is appropriate'
was_reverted
==
]))
%>%
# at least one paste check declined and edit not reverted
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
gt
()
%>%
tab_header
title =
"Edits where user declined to revise text because the tone was appropriate"
%>%
opt_stylize
%>%
cols_label
n_edits =
"Number of edits shown Tone check"
n_rejects =
"Number of edits that declined Tone Check as irrelevant"
dismissal_rate =
"Decline Rate"
%>%
tab_source_note
gt
::
md
'Limited to unreverted published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_false_positive_overall))
Edits where user declined to revise text because the tone was appropriate
Number of edits shown Tone check
Number of edits that declined Tone Check as irrelevant
Decline Rate
1729
283
16.4%
Limited to unreverted published edits where at least one Tone Check was shown
Editors declined a tone check and selected “the tone is appropriate” at 16.4% of all published edits where Tone Check was shown. This excludes edits that were reverted within 48 hours.
For comparison, this is higher than the rates observed for
Reference Check
(6.6% of editors indicated that the content they were adding did not require a reference) and lower than
Paste Check
(30% of editors indicated that they wrote the content).
By Platform
Code
# platform false postive rate
tone_check_false_positive_byplatform
<-
tone_check_reject_data
%>%
group_by
(platform)
%>%
filter
(was_tone_check_shown
==
is_new_content
==
%>%
#limit to where shown
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
reject_reason
==
'The tone is appropriate'
was_reverted
==
]))
%>%
# at least one paste check declined and edit not reverted
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
select
))
|>
# removing granular data columns for publication
gt
()
%>%
tab_header
title =
"Edits where user declined to revise text because the tone was appropriate by platform"
%>%
opt_stylize
%>%
cols_label
platform =
"Platform"
#n_edits = "Number of edits shown Tone check",
#n_rejects = "Number of edits that declined Tone Check as irrelevant",
dismissal_rate =
"Decline Rate"
%>%
tab_source_note
gt
::
md
'Limited to unreverted published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_false_positive_byplatform))
Edits where user declined to revise text because the tone was appropriate by platform
Platform
Decline Rate
mobile web
15%
desktop
16.8%
Limited to unreverted published edits where at least one Tone Check was shown
There are similar rates on mobile web and desktop.
By User Experience
Code
# platform false postive rate
tone_check_false_positive_byuserexp
<-
tone_check_reject_data
%>%
group_by
(experience_level_group)
%>%
filter
(was_tone_check_shown
==
is_new_content
==
%>%
#limit to where shown
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
reject_reason
==
'The tone is appropriate'
was_reverted
==
]))
%>%
# at least one paste check declined and edit not reverted
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
select
))
|>
# removing granular data columns for publication
gt
()
%>%
tab_header
title =
"Edits where user declined to revise text because the tone was appropriate by platform"
%>%
opt_stylize
%>%
cols_label
experience_level_group =
"User experience"
#n_edits = "Number of edits shown Tone check",
#n_rejects = "Number of edits that declined Tone Check as irrelevant",
dismissal_rate =
"Decline Rate"
%>%
tab_source_note
gt
::
md
'Limited to unreverted published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_false_positive_byuserexp))
Edits where user declined to revise text because the tone was appropriate by platform
User experience
Decline Rate
Unregistered
18.2%
Newcomer
16%
Junior Contributor
16%
Limited to unreverted published edits where at least one Tone Check was shown
By Partner Wikipedia
Code
# per wiki false postive rate
tone_check_false_positive_bywiki
<-
tone_check_reject_data
%>%
group_by
(wiki)
%>%
filter
(was_tone_check_shown
==
is_new_content
==
%>%
#limit to where shown
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
reject_reason
==
'The tone is appropriate'
was_reverted
==
]))
%>%
# at least one paste check declined and edit not reverted
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
select
))
|>
# removing granular data columns for publication
gt
()
%>%
tab_header
title =
"Edits where user declined to revise text because the tone was appropriate by partner Wikipedia"
%>%
opt_stylize
%>%
cols_label
wiki =
"Wikipedia"
#n_edits = "Number of edits shown Tone check",
#n_rejects = "Number of edits that declined Tone Check as irrelevant",
dismissal_rate =
"Decline Rate"
%>%
tab_source_note
gt
::
md
'Limited to unreverted published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_false_positive_bywiki))
Edits where user declined to revise text because the tone was appropriate by partner Wikipedia
Wikipedia
Decline Rate
French Wikipedia
17.4%
Japanese Wikipedia
7.3%
Portuguese Wikipedia
19.1%
Limited to unreverted published edits where at least one Tone Check was shown
At Japanese Wikipedia, editors declined Tone Check as irrelevant (tone was appropriate) at only 7% of edits where shown. This is notably lower than the rates observed for French Wikipedia (17.4%) and Portuguese Wikipedia (19.1%).
Notably, Japanese Wikipedia is also where we observed the highest increase in edit quality as measured by decrease in revert rates.
Guardrail #2: Block Rate
Description
Proportion of contributors blocked after publishing an edit where Paste Check was shown, compared to contributors eligible but not shown Paste Check.
Methodology
: We gathered all edits where edit check was shown from the
mediawiki_revision_change_tag
table and joined with
mediawiki_private_cu_changes
to gather user name info. We then reviewed both global and local blocks made within 6 hours of the Paste Check event as identified in the logging table.
Code
# load data for assessing blocks
edit_check_blocks
<-
read.csv
file =
'data/tone_check_eligible_users_blocked.csv'
header =
TRUE
sep =
","
stringsAsFactors =
FALSE
Code
#rename experiment field to clarify
edit_check_blocks
<-
edit_check_blocks
%>%
mutate
test_group =
factor
(bucket,
levels =
'2025-09-editcheck-tone-control'
'2025-09-editcheck-tone-test'
),
labels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
)))
Code
edit_check_local_blocks_overall
<-
edit_check_blocks
%>%
#filter(user_id == 0) %>% #filter to identify logged out users
group_by
(test_group)
%>%
summarise
blocked_users =
n_distinct
(ip[is_local_blocked
==
'True'
is_global_blocked
==
'True'
]),
all_users =
n_distinct
(ip))
%>%
#look at blocks
mutate
prop_blocks =
paste0
round
(blocked_users
all_users
100
),
"%"
))
%>%
select
))
%>%
#removing granular data columns
gt
()
%>%
tab_header
title =
"Proportion of users blocked by experiment group"
%>%
opt_stylize
%>%
cols_label
test_group =
"Test Group"
prop_blocks =
"Proportion of users blocked"
%>%
tab_source_note
gt
::
md
'Limited to users blocked 6 hours after publishing an edit where Tone Check was shown'
display_html
as_raw_html
(edit_check_local_blocks_overall))
Proportion of users blocked by experiment group
Test Group
Proportion of users blocked
control (eligible but not shown tone check)
0.8%
test (tone check shown)
0.8%
Limited to users blocked 6 hours after publishing an edit where Tone Check was shown
People shown Tone Check are not blocked at higher rates than users in the control group. 0.8% of users were blocked in both the test and control groups.
Appendix
We reviewed a number of additional secondary metrics or curiosities. These are used to learn more about the impact of Tone Check on editing behavior, but are not primary targets of the intervention.
Tone Check Decline Rates and Reasons
Hypothesis
Knowing the reasons why people do not elect to revise tone when the Check prompts them to do so (by platform), will help us to decide what (if anything) can be done to decrease the proportion of people on desktop who do so
Methodology
: We reviewed the proportion of published edits new content edits shown Tone Check wherein people elected to not revise the tone of the text they added (i.e. the Tone Check was dismissed) by the decline reason the user selected.
This was determined by edits where the user dismissed a Tone Check at least once in a session (
event.feature = 'editCheck-tone' AND event.action = 'action-dismiss'
). The analysis includes splits by the reason the user selected for dismissing the check.
Code
# load data for assessing edit reject frequency
tone_check_reject_data_1
<-
read.csv
file =
'data/tone_check_rejects_data_ab.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
FALSE
Code
# load constructive retention rate (second dataset)
tone_check_reject_data_2
<-
read.csv
file =
'data/tone_check_rejects_data_ab_pt2.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
FALSE
Code
# Combine the two datasets
tone_check_reject_data
<-
rbind
(tone_check_reject_data_1, tone_check_reject_data_2)
Code
# Set experience level group and factor levels
tone_check_reject_data
<-
tone_check_reject_data
%>%
mutate
experience_level_group =
case_when
user_edit_count
==
user_status
==
'registered'
'Newcomer'
user_edit_count
==
user_status
==
'unregistered'
'Unregistered'
user_edit_count
user_edit_count
<=
100
"Junior Contributor"
user_edit_count
100
"Non-Junior Contributor"
),
experience_level_group =
factor
(experience_level_group,
levels =
"Unregistered"
"Newcomer"
"Non-Junior Contributor"
"Junior Contributor"
))
#rename experiment field to clarify
tone_check_reject_data
<-
tone_check_reject_data
%>%
mutate
test_group =
factor
(test_group,
levels =
'2025-09-editcheck-tone-control'
'2025-09-editcheck-tone-test'
),
labels =
"control (eligible but not shown Tone Check)"
"test (shown Tone Check)"
)))
#rename platform from phone to mobile web to clarify meaning
tone_check_reject_data
<-
tone_check_reject_data
%>%
mutate
platform =
factor
(platform,
levels =
'phone'
'desktop'
),
labels =
"mobile web"
"desktop"
)))
# rename Wiki names
tone_check_reject_data
<-
tone_check_reject_data
%>%
mutate
wiki =
recode
(wiki,
!!!
wiki_name_lookup)
Code
#Set fields and factor levels to assess number of checks shown
tone_check_reject_data
<-
tone_check_reject_data
%>%
mutate
multiple_checks_shown =
ifelse
(n_checks_shown
"multiple checks shown"
"single check shown"
),
multiple_checks_shown =
factor
( multiple_checks_shown ,
levels =
"single check shown"
"multiple checks shown"
)))
# note these buckets can be adjusted as needed based on distribution of data
tone_check_reject_data
<-
tone_check_reject_data
%>%
mutate
checks_shown_bucket =
case_when
is.na
(n_checks_shown)
'0'
n_checks_shown
==
'1'
n_checks_shown
==
'2'
n_checks_shown
n_checks_shown
<=
"3-5"
n_checks_shown
n_checks_shown
<=
10
"6-10"
n_checks_shown
10
"over 10"
),
checks_shown_bucket =
factor
(checks_shown_bucket ,
levels =
"0"
"1"
"2"
"3-5"
"6-10"
"over 10"
))
Code
# shorten and clarify reason field names
tone_check_reject_data
<-
tone_check_reject_data
%>%
mutate
reject_reason =
case_when
reject_reason
==
'no_reject_reason'
'No reason provided'
reject_reason
==
'edit-check-feedback-reason-other'
'None applies'
reject_reason
==
'edit-check-feedback-reason-appropriate'
'The tone is appropriate'
reject_reason
==
'edit-check-feedback-reason-uncertain'
'Not sure how to revise tone'
),
reject_reason =
factor
(reject_reason ,
levels =
"No reason provided"
"None applies"
"The tone is appropriate"
"Not sure how to revise tone"
))
Overall
Code
# overall dismissal rate
tone_check_dismissal_overall
<-
tone_check_reject_data
%>%
filter
(was_tone_check_shown
==
is_new_content
==
%>%
#limit to where shown
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
]))
%>%
# at least one paste check declined and edit not reverted
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
gt
()
%>%
tab_header
title =
"Tone Check decline rate"
%>%
opt_stylize
%>%
cols_label
n_edits =
"Number of edits shown Tone check"
n_rejects =
"Number of edits that declined Tone Check"
dismissal_rate =
"Proportion of edits where Tone Check was declined"
%>%
tab_source_note
gt
::
md
'Limited to published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_dismissal_overall ))
Tone Check decline rate
Number of edits shown Tone check
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
1729
646
37.4%
Limited to published edits where at least one Tone Check was shown
Tone check was
declined at 37.4% of all new content edits
where at least Tone Check was shown at least one during an editing session. This is lower than the rates reported for other available checks including
Paste Check
(54.8% decline rate)
Code
### By decline reason
tone_check_dismissal_byreason_overall
<-
tone_check_reject_data
%>%
filter
(is_new_content
==
was_tone_check_shown
==
n_rejects
%>%
#limit to where shown and user elect to not revise test
group_by
(reject_reason)
%>%
summarise
n_edits_rejected =
n_distinct
(editing_session))
%>%
mutate
select_rate =
paste0
round
(n_edits_rejected
sum
(n_edits_rejected)
100
),
"%"
))
Code
# plot bar chart of reason selection
dodge
<-
position_dodge
width=
0.9
<-
tone_check_dismissal_byreason_overall
%>%
ggplot
aes
x=
reject_reason,
y =
n_edits_rejected
sum
(n_edits_rejected)))
geom_col
position =
'dodge'
fill =
'dodgerblue4'
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste
(select_rate,
\n
, n_edits_rejected,
"edits"
),
fontface=
),
vjust=
1.2
size =
color =
"white"
scale_fill_manual
values=
cbPalette,
name =
"Reason"
labs
y =
"Percent of edits "
x =
"Selected reason"
title =
"Reasons users selected for not revising text"
caption =
"Limited to published edits where a user elected to not revise text"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
24
),
axis.text.x =
element_text
size =
18
),
axis.title.x =
element_text
margin =
margin
t =
20
unit =
"pt"
)),
legend.position=
"none"
axis.line =
element_line
colour =
"black"
))
Editors selected “The tone is appropriate” in over half (58.9%)
of all published new content edits where the user elected to not revise their text.
By if multiple checks were shown
Code
tone_check_dismissal_bymultiple
<-
tone_check_reject_data
%>%
filter
(is_new_content
==
was_tone_check_shown
==
%>%
#limit to where shown
group_by
(multiple_checks_shown)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
was_reverted
==
]))
%>%
#limit to new content edits without a refernece
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
gt
()
%>%
tab_header
title =
"Tone Check decline rate by if multiple checks shown"
%>%
opt_stylize
%>%
cols_label
multiple_checks_shown =
"Multiple Checks"
n_edits =
"Number of edits shown Tone Check"
n_rejects =
"Number of edits that declined Tone Check"
dismissal_rate =
"Proportion of edits where Tone Check was declined"
%>%
tab_source_note
gt
::
md
'Limited to published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_dismissal_bymultiple ))
Tone Check decline rate by if multiple checks shown
Multiple Checks
Number of edits shown Tone Check
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
single check shown
512
178
34.8%
multiple checks shown
1217
292
24%
Limited to published edits where at least one Tone Check was shown
As we also observed in the
leading indicators report
, the decline rate slightly decreases if multiple checks were shown.
Edits where multiple checks are shown are likely longer edits where the user may have more incentive to ensure their edit does not get reverted.
By Platform
Code
tone_check_dismissal_byplatform
<-
tone_check_reject_data
%>%
filter
(is_new_content
==
was_tone_check_shown
==
%>%
#limit to where shown
group_by
(platform)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
]))
%>%
#limit to new content edits without a refernece
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
ungroup
()
%>%
#mutate(n_edits = ifelse(n_edits < 50, "<50", n_edits),
#n_rejects = ifelse(n_rejects < 50, "<50", n_rejects)) %>% #sanitizing per data publication guidelines
#select(-2) %>%
gt
()
%>%
tab_header
title =
"Tone Check decline rate by platform"
%>%
opt_stylize
%>%
cols_label
platform =
"Platform"
n_edits =
"Number of edits shown Tone check"
n_rejects =
"Number of edits that declined Tone Check"
dismissal_rate =
"Proportion of edits where Tone Check was declined"
%>%
tab_source_note
gt
::
md
'Limited to published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_dismissal_byplatform ))
Tone Check decline rate by platform
Platform
Number of edits shown Tone check
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
mobile web
380
151
39.7%
desktop
1349
496
36.8%
Limited to published edits where at least one Tone Check was shown
Tone checks are declined only slightly more frequently on mobile compared to desktop. 36.8% of all published desktop edits where Tone Check was shown include at least one check that was declined compared to 39.7% of all published mobile edits.
This suggests that the lower impact Tone Check has on mobile web edit quality is not due to users explicitly rejecting the check.
Code
### Decline reason by platform
tone_check_dismissal_byreason_byplatform
<-
tone_check_reject_data
%>%
filter
(is_new_content
==
was_tone_check_shown
==
n_rejects
%>%
#limit to where shown and user did not revise text
group_by
(platform, reject_reason)
%>%
summarise
n_edits_rejected =
n_distinct
(editing_session))
%>%
mutate
select_rate =
round
(n_edits_rejected
sum
(n_edits_rejected),
))
Code
# plot bar chart of reason selection
dodge
<-
position_dodge
width=
0.9
# slightly larger chart needed here
options
repr.plot.width =
18
repr.plot.height =
10
<-
tone_check_dismissal_byreason_byplatform
%>%
ggplot
aes
x=
reject_reason,
y =
select_rate,
fill =
reject_reason))
geom_col
position =
'dodge'
,)
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste0
(select_rate
100
"%"
),
fontface=
),
vjust=
1.2
size =
10
color =
"white"
facet_grid
platform )
labs
y =
"Percent of edits "
x =
"Selected reason"
title =
"Reasons users selected for not revising text"
scale_fill_manual
values=
cbPalette,
name =
"Reason"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
24
),
legend.position=
"bottom"
legend.text=
element_text
size=
18
),
axis.text.x =
element_blank
(),
axis.ticks.x =
element_blank
(),
axis.line =
element_line
colour =
"black"
))
On both mobile web and desktop, “the tone is appropriate” is the most frequently selected reason for electing to not revise text. The other decline option, including “None applies” and “Not sure how to revise tone”, see similar rates of selection on both platforms.
By User Experience
Code
tone_check_dismissal_byuserexp
<-
tone_check_reject_data
%>%
filter
(is_new_content
==
was_tone_check_shown
==
%>%
#limit to where shown
group_by
(experience_level_group)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
]))
%>%
#limit to new content edits without a refernece
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
ungroup
()
%>%
mutate
n_edits =
ifelse
(n_edits
50
"<50"
, n_edits),
n_rejects =
ifelse
(n_rejects
50
"<50"
, n_rejects))
%>%
#sanitizing per data publication guidelines
#select(-2) %>%
gt
()
%>%
tab_header
title =
"Tone Check decline rate by user experience"
%>%
opt_stylize
%>%
cols_label
experience_level_group =
"User Experience"
n_edits =
"Number of edits shown Tone check"
n_rejects =
"Number of edits that declined Tone Check"
dismissal_rate =
"Proportion of edits where Tone Check was declined"
%>%
tab_source_note
gt
::
md
'Limited to published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_dismissal_byuserexp ))
Tone Check decline rate by user experience
User Experience
Number of edits shown Tone check
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
Unregistered
291
141
48.5%
Newcomer
406
141
34.7%
Junior Contributor
1032
366
35.5%
Limited to published edits where at least one Tone Check was shown
Unregistered users are most likely to decline a Tone Check compared to registered users. 48.5% of all published new content edits were declined by unregistered users.
Code
### Dismissal reason by user experience
tone_check_dismissal_byreason_byuserexp
<-
tone_check_reject_data
%>%
filter
(is_new_content
==
was_tone_check_shown
==
n_rejects
%>%
#limit to where shown and user did not revise their ext
group_by
(experience_level_group, reject_reason)
%>%
summarise
n_edits_rejected =
n_distinct
(editing_session))
%>%
mutate
select_rate =
round
(n_edits_rejected
sum
(n_edits_rejected),
))
Code
# plot bar chart of reason selection
dodge
<-
position_dodge
width=
0.9
# slightly larger chart needed here
options
repr.plot.width =
18
repr.plot.height =
10
<-
tone_check_dismissal_byreason_byuserexp
%>%
ggplot
aes
x=
reject_reason,
y =
select_rate,
fill =
reject_reason))
geom_col
position =
'dodge'
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste0
(select_rate
100
"%"
),
fontface=
),
vjust=
1.2
size =
10
color =
"white"
facet_grid
experience_level_group)
labs
y =
"Percent of edits "
x =
"Selected reason"
title =
"Reasons users selected for not revising their text"
scale_fill_manual
values=
cbPalette,
name =
"Reason"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
24
),
legend.position=
"bottom"
axis.text.x =
element_blank
(),
axis.ticks.x =
element_blank
(),
axis.line =
element_line
colour =
"black"
))
There were no significant differences in the distribution of decline reasons across the three reviewed user groups.
A higher proportion of unregistered editors (62%) selected “The Tone is appropriate” compared to registered users (~57%). Unregistered editors are also slightly more likely to select “I am not sure how to revise tone” and less likely to select “No reason provided” compared to registered editors.
By partner Wikipedia
Code
tone_check_dismissal_bywiki
<-
tone_check_reject_data
%>%
filter
(is_new_content
==
was_tone_check_shown
==
%>%
#limit to where shown
group_by
(wiki)
%>%
summarise
n_edits =
n_distinct
(editing_session),
n_rejects =
n_distinct
(editing_session[n_rejects
was_reverted
==
]))
%>%
mutate
dismissal_rate =
paste0
round
(n_rejects
n_edits
100
),
"%"
))
%>%
#filter(n_edits > 50) %>% # limit to wikis with over 50 edits.
ungroup
()
%>%
mutate
n_edits =
ifelse
(n_edits
50
"<50"
, n_edits),
n_rejects =
ifelse
(n_rejects
50
"<50"
, n_rejects))
%>%
#sanitizing per data publication guidelines
select
%>%
gt
()
%>%
tab_header
title =
"Tone Check decline rate by partner Wikipedia"
%>%
opt_stylize
%>%
cols_label
wiki =
"Wikipedia"
#n_edits = "Number of edits shown Tone check",
n_rejects =
"Number of edits that declined Tone Check"
dismissal_rate =
"Proportion of edits where Tone Check was declined"
%>%
tab_source_note
gt
::
md
'Limited to published edits where at least one Tone Check was shown'
display_html
as_raw_html
(tone_check_dismissal_bywiki ))
Tone Check decline rate by partner Wikipedia
Wikipedia
Number of edits that declined Tone Check
Proportion of edits where Tone Check was declined
French Wikipedia
335
27.3%
Japanese Wikipedia
61
27.9%
Portuguese Wikipedia
75
26.5%
Limited to published edits where at least one Tone Check was shown
Code
Decline rates are very similar across all three Partner Wikipedias.
Distinct users that publish a reverted edit
Hypothesis
: Newcomers and Junior Contributors will be more aware of the need to write in a neutral tone when contributing new text because the visual editor will prompt them to do so in cases where they have written text that contains non-neutral language.
Methodology
: The proportion of newcomers and Junior Contributors shown or eligible to be shown Tone Check that publish at least one new content edit that was reverted.
This metric is similar to the revert rate analysis except that it looks at proportion of distinct editors versus distinct edits. There were no significant differences in the results reported in Primary Metric 1: Revert rate section as the majority of newcomers and Junior Contributors posted just one new content edit during the reviewed time period. See details below.
Overall
Code
tone_check_reverts_byuser_overall
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
#limit to eligible edits
group_by
(test_group)
|>
summarise
n_users =
n_distinct
(user_id),
n_users_revert =
n_distinct
(user_id[was_reverted
==
]))
|>
#reverted within 48 hours
mutate
revert_rate =
paste0
round
(n_users_revert
n_users
100
),
"%"
))
Code
tone_check_reverts_byuser_overall
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
#limit to eligible edits
group_by
(test_group)
|>
summarise
n_users =
n_distinct
(user_id),
n_users_revert =
n_distinct
(user_id[was_reverted
==
]))
|>
#reverted within 48 hours
mutate
revert_rate =
paste0
round
(n_users_revert
n_users
100
),
"%"
))
Code
# plot visualization of overall users reverted
dodge
<-
position_dodge
width=
0.9
<-
tone_check_reverts_byuser_overall
|>
ggplot
aes
x=
test_group,
y =
n_users_revert
n_users,
fill =
test_group))
geom_col
position =
'dodge'
scale_y_continuous
labels =
scales
::
percent)
geom_text
aes
label =
paste
(revert_rate,
\n
, n_users_revert,
"users reverted"
),
fontface=
),
vjust=
1.2
size =
10
color =
"white"
scale_fill_manual
values=
"#999999"
"dodgerblue4"
),
name =
"Experiment Group"
labs
y =
"Percent of distinct users reverted "
x =
"Experiment Group"
title =
"Proportion of users with at least one reverted edit"
caption =
"Limited to published new content edits shown or eligible to be shown Tone Check"
theme
panel.grid.minor =
element_blank
(),
panel.background =
element_blank
(),
plot.title =
element_text
hjust =
0.5
),
text =
element_text
size=
24
),
axis.text.x =
element_text
size =
24
),
axis.title.x =
element_text
margin =
margin
t =
20
unit =
"pt"
)),
legend.position=
"none"
axis.line =
element_line
colour =
"black"
))
By If multiple checks were shown
Code
tone_check_revert_byuser_bymultiple
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
test_group
==
'test (tone check shown)'
multiple_checks_shown
!=
"no tone checks"
|>
group_by
( multiple_checks_shown)
%>%
summarise
n_users =
n_distinct
(user_id),
n_revert_users =
n_distinct
(user_id[was_reverted
==
]))
|>
#limit to new content edits without a refernece
mutate
revert_rate =
paste0
round
(n_revert_users
n_users
100
),
"%"
))
|>
select
))
|>
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"Users with at least one reverted edit by if multiple checks were shown"
|>
opt_stylize
|>
cols_label
multiple_checks_shown =
"Multiple Check"
#n_edits = "Number of published new content edits",
#n_reverts = "Number of edits reverted ",
revert_rate =
"Proportion of new content edits that were reverted"
|>
tab_source_note
gt
::
md
'Limited to published new content edits shown or eligible to shown Tone Check'
display_html
as_raw_html
(tone_check_revert_byuser_bymultiple))
Users with at least one reverted edit by if multiple checks were shown
Multiple Check
Proportion of new content edits that were reverted
one tone check
26%
multiple tone checks
26.6%
Limited to published new content edits shown or eligible to shown Tone Check
By Platform
Code
tone_check_revert_byuser_byplatform
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
group_by
(platform, test_group)
|>
summarise
n_users =
n_distinct
(user_id),
n_revert_users =
n_distinct
(user_id[was_reverted
==
]))
%>%
#limit to new content edits without a refernece
mutate
revert_rate =
paste0
round
(n_revert_users
n_users
100
),
"%"
))
%>%
select
))
%>%
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"Users with at least one reverted edit by platform"
|>
opt_stylize
%>%
cols_label
test_group =
"Experiment Group"
platform =
"Platform"
#n_users = "Number of Users",
#n_revert_users = "Number of users reverted ",
revert_rate =
"Proportion of distinct users that were reverted"
|>
tab_source_note
gt
::
md
'Limited to users who published new content edits shown or eligible to shown Tone Check'
display_html
as_raw_html
(tone_check_revert_byuser_byplatform))
Users with at least one reverted edit by platform
Experiment Group
Proportion of distinct users that were reverted
mobile web
control (eligible but not shown tone check)
33%
test (tone check shown)
37.4%
desktop
control (eligible but not shown tone check)
24.1%
test (tone check shown)
22.7%
Limited to users who published new content edits shown or eligible to shown Tone Check
By User Experience
Code
tone_check_revert_byuser_byuserexp
<-
tone_check_publish_data
|>
filter
(is_new_content
==
is_test_eligible
==
'eligible'
|>
group_by
(experience_level_group, test_group)
|>
summarise
n_users =
n_distinct
(user_id),
n_revert_users =
n_distinct
(user_id[was_reverted
==
]))
%>%
#limit to new content edits without a refernece
mutate
revert_rate =
paste0
round
(n_revert_users
n_users
100
),
"%"
))
|>
select
))
%>%
# removing granular data columns for publication
gt
()
|>
tab_header
title =
"Users with at least one reverted edit by user experience"
|>
opt_stylize
|>
cols_label
test_group =
"Test group"
experience_level_group =
"User experience"
#n_users = "Number of users",
#n_revert_users = "Number of users reverted ",
revert_rate =
"Proportion of distinct users that were reverted"
|>
tab_source_note
gt
::
md
'Limited to users who published new content edits shown or eligible to shown Tone Check'
display_html
as_raw_html
(tone_check_revert_byuser_byuserexp))
Users with at least one reverted edit by user experience
Test group
Proportion of distinct users that were reverted
Unregistered
control (eligible but not shown tone check)
34.9%
test (tone check shown)
37.2%
Newcomer
control (eligible but not shown tone check)
21.5%
test (tone check shown)
26.4%
Junior Contributor
control (eligible but not shown tone check)
25.5%
test (tone check shown)
22.5%
Limited to users who published new content edits shown or eligible to shown Tone Check
Constructive Retention Rate (Tone Check not shown again)
We also reviewed the proportion of newcomers and Junior Contributors that publish an edit Tone Check was activated within and return to make a new content edit where Tone Check was not shown 7 to 14 days after.
Code
# load retention
retention_rate_norepeat_check_1
<-
read.csv
file =
'data/retention_notone_data.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
FALSE
Code
# load constructive retention rate (second dataset)
retention_rate_norepeat_check_2
<-
read.csv
file =
'data/retention_notone_data_pt2.tsv'
header =
TRUE
sep =
\t
stringsAsFactors =
TRUE
Code
# Combine the two datasets
retention_rate_norepeat_check
<-
rbind
(retention_rate_norepeat_check_1, retention_rate_norepeat_check_2)
Code
# Cleaning up dataset and renaming fields to clarify meanings
# Set experience level group and factor levels
retention_rate_norepeat_check
<-
retention_rate_norepeat_check
%>%
mutate
experience_level_group =
case_when
user_edit_count
==
user_status
==
'registered'
'Newcomer'
user_edit_count
==
user_status
==
'unregistered'
'Unregistered'
user_edit_count
user_edit_count
<=
100
"Junior Contributor"
user_edit_count
100
"Non-Junior Contributor"
),
experience_level_group =
factor
(experience_level_group,
levels =
"Unregistered"
"Newcomer"
"Non-Junior Contributor"
"Junior Contributor"
))
#rename experiment field to clarify
retention_rate_norepeat_check
<-
retention_rate_norepeat_check
%>%
mutate
test_group =
factor
(test_group,
levels =
'2025-09-editcheck-tone-control'
'2025-09-editcheck-tone-test'
),
labels =
"control (eligible but not shown tone check)"
"test (tone check shown)"
)))
#rename platform from phone to mobile web to clarify meaning
retention_rate_norepeat_check
<-
retention_rate_norepeat_check
%>%
mutate
platform =
factor
(platform,
levels =
'phone'
'desktop'
),
labels =
"mobile web"
"desktop"
)))
Overall
Code
retention_rate_norepeat_overall
<-
retention_rate_norepeat_check
%>%
group_by
(test_group)
%>%
summarise
return_editors =
sum
(return_editors),
editors =
sum
(editors),
retention_rate =
paste0
round
(return_editors
editors
100
),
"%"
))
Code
retention_rate_norepeat_overall_table
<-
retention_rate_norepeat_overall
%>%
select
))
|>
# removing granular data columns for publication
gt
()
%>%
tab_header
title =
"Constructive second week retention rate (tone check not shown again)"
%>%
cols_label
test_group =
"Experiment group"
#return_editors = "Number of editors that returned second week",
#editors = "Number of first week editors",
retention_rate =
"Retention rate"
%>%
opt_stylize
%>%
tab_footnote
footnote =
"Limited to users shown or eligible to be shown at least one Tone Check"
locations =
cells_column_labels
columns =
'retention_rate'
display_html
as_raw_html
(retention_rate_norepeat_overall_table))
Constructive second week retention rate (tone check not shown again)
Experiment group
Retention rate
control (eligible but not shown tone check)
1.6%
test (tone check shown)
1.7%
Limited to users shown or eligible to be shown at least one Tone Check
Less than 2% of editors in both the control and test group returned to make another new content edit where Tone Check was not shown or not eligible to be shown in their second week. While we see a slight increase in retention rate for editors shown Tone Check, we are unable to confirm statistical significance due to the small effect and sample size.
By Platform
Code
retention_rate_norepeat_byplatform
<-
retention_rate_norepeat_check
%>%
group_by
(platform, test_group)
%>%
summarise
return_editors =
sum
(return_editors),
editors =
sum
(editors),
retention_rate =
paste0
round
(return_editors
editors
100
),
"%"
))
Code
retention_rate_norepeat_byplatform_table
<-
retention_rate_norepeat_byplatform
%>%
select
))
%>%
# removing granular data columns for publication
gt
()
%>%
tab_header
title =
"Constructive second week retention rate (tone check not shown again) by platform"
%>%
cols_label
test_group =
"Experiment group"
platform =
"Platform"
#return_editors = "Number of editors that returned second week",
#editors = "Number of first week editors",
retention_rate =
"Retention rate"
%>%
opt_stylize
%>%
tab_footnote
footnote =
"Limited to users shown or eligible to be shown at least one Tone Check"
locations =
cells_column_labels
columns =
'retention_rate'
display_html
as_raw_html
(retention_rate_norepeat_byplatform_table))
Constructive second week retention rate (tone check not shown again) by platform
Experiment group
Retention rate
mobile web
control (eligible but not shown tone check)
0.5%
test (tone check shown)
1.7%
desktop
control (eligible but not shown tone check)
2.1%
test (tone check shown)
1.7%
Limited to users shown or eligible to be shown at least one Tone Check
By User Experience
Code
retention_rate_norepeat_byuserexp
<-
retention_rate_norepeat_check
%>%
group_by
(experience_level_group, test_group)
%>%
summarise
return_editors =
sum
(return_editors),
editors =
sum
(editors),
retention_rate =
paste0
round
(return_editors
editors
100
),
"%"
))
Code
retention_rate_norepeat_byuserexp_table
<-
retention_rate_norepeat_byuserexp
%>%
select
))
%>%
# removing granular data columns for publication
gt
()
%>%
tab_header
title =
"Constructive second week retention rate (tone check not shown again) by user experience"
%>%
cols_label
test_group =
"Experiment group"
experience_level_group =
"Experience level group"
#return_editors = "Number of editors that returned second week",
#editors = "Number of first week editors",
retention_rate =
"Retention rate"
%>%
opt_stylize
%>%
tab_footnote
footnote =
"Limited to users shown or eligible to be shown at least one Tone Check"
locations =
cells_column_labels
columns =
'retention_rate'
display_html
as_raw_html
(retention_rate_norepeat_byuserexp_table))
Constructive second week retention rate (tone check not shown again) by user experience
Experiment group
Retention rate
Unregistered
control (eligible but not shown tone check)
0%
test (tone check shown)
0.5%
Newcomer
control (eligible but not shown tone check)
1.1%
test (tone check shown)
2.3%
Junior Contributor
control (eligible but not shown tone check)
2.7%
test (tone check shown)
2%
Limited to users shown or eligible to be shown at least one Tone Check
Reuse
CC BY-SA 4.0
US