Abstract
Scientific English is currently undergoing rapid change, with words like “delve,” “intricate,” and “underscore” appearing far more frequently than just a few years ago. It is widely assumed that scientists’ use of large language models (LLMs) is responsible for such trends. We develop a formal, transferable method to characterize these linguistic changes. Application of our method yields 21 focal words whose increased occurrence in scientific abstracts is likely the result of LLM usage. We then pose “the puzzle of lexical overrepresentation”:
why
are such words overused by LLMs? We fail to find evidence that lexical overrepresentation is caused by model architecture, algorithm choices, or training data. To assess whether reinforcement learning from human feedback (RLHF) contributes to the overuse of focal words, we undertake comparative model testing and conduct an exploratory online study. While the model testing is consistent with RLHF playing a role, our experimental results suggest that participants may be reacting differently to “delve” than to other focal words. With LLMs quickly becoming a driver of global language change, investigating these potential sources of lexical overrepresentation is important. We note that while insights into the workings of LLMs are within reach, a lack of transparency surrounding model development remains an obstacle to such research.
Introduction
Like all human language, Scientific English has changed substantially over time
Degaetano-Ortlieb and Teich (
2018
); Degaetano-Ortlieb et al. (
2018
); Bizzoni et al. (
2020
); Menzel (
2022
. New discoveries have fueled (and perhaps been fueled by) the introduction of new lexical items into scientific discourse
Degaetano-Ortlieb and Teich (
2018
Figure 1:
We formalize a procedure for identifying words whose increasing prevalence is likely the result of LLM usage. Although our focus is Scientific English, the method can be applied across domains and languages.
Changes in dominant methodological and explanatory frameworks – such as the rise of mechanical philosophy, or the mathematization of scientific fields – have been accompanied by changes in word usage and syntactic structures as well
Degaetano-Ortlieb and Teich (
2018
); Krielke (
2024
. Such changes continue through the present
Banks (
2017
); Leong (
2020
Over the last two years, however, Scientific English has witnessed increasing usage of certain lexical items at a seemingly unprecedented pace. Discussions on social media (e.g.,
Koppenburg
2024
; Nguyen
2024
; Shapira
2024
) and in academic discourse
Gray (
2024
); Kobak et al. (
2024
); Liang et al. (
2024b
); Liu and Bu (
2024
); Matsui (
2024
have pointed out that words such as “delve,” “intricate,” and “nuanced” have appeared far more frequently in scientific abstracts from 2023 and 2024 compared to earlier years. Unlike many previous changes in Scientific English, these trends do not seem to be explained by changes in the content of science or in wider language use. Instead, it is widely assumed that the sharp increase is due to the use of large language models (LLMs) like ChatGPT for scientific writing. Evidence supporting this hunch has recently emerged (e.g.,
Cheng et al.
2024
; Liang et al.
2024a
).
The goals of the present research were twofold. First, we aimed to provide a systematic characterization of this linguistic phenomenon. Some existing work has relied on informal methods to identify words observed to occur more frequently in AI-generated writing (e.g.,
Matsui
2024
). We developed a method for extracting lexical items of interest, described in Section
, which is rigorous, reproducible, and transferable to other data and models. We identified 21 “focal words”: lexical items that have recently spiked in Scientific English and are overused by ChatGPT-3.5 in scientific writing tasks, as illustrated in Figure
Prior research has focused on quantifying such focal words’ increasing prevalence and estimating how much recent scientific writing has been produced with LLM assistance (e.g.,
Kobak et al.
2024
; Liang et al.
2024b
). By contrast, our second goal was to explore the factors that might contribute to the phenomenon of lexical overrepresentation:
Why
does ChatGPT use “delve” (and other focal words) so frequently when generating scientific text? We identified a set of possible factors, characterized in Section
, and began to assess them. We did not find evidence that model architecture or algorithmic decisions play a major role in the overrepresentation of focal words (Section
), nor that lexical overrepresentation stems from training or fine-tuning data (Section
).
LLM training often involves reinforcement learning based on information about quality outputs from human evaluators. We found mixed evidence that reinforcement learning from human feedback (RLHF) contributes to the overrepresentation of our focal words in LLM-generated text. Positive evidence comes from model testing on Meta’s Llama LLM (Section
). An exploratory experiment described in Section
is inconclusive, although our findings indicate that participants became wary of the word “delve” in the first sentence of an abstract (e.g., ’This article delves into …’). Since the experiment’s inconclusiveness stems partly from methodological issues, we believe a follow-up study is warranted. Many important questions about the future of LLM-driven language change remain (Section
).
Corpus Analysis: Identification of Overrepresented Lexical Items
To probe recent changes in Scientific English, we used PubMed’s publicly available repository of scientific abstracts, which focuses on biomedical literature
National Library of Medicine (
2023
(downloaded through the PubMed API using a Python script
Python Software Foundation (
2024
; Snapshot: May 4, 2024; all code on our GitHub). Our analysis includes more than 5.2 billion tokens (inflected forms) from 26.7 million abstracts. To track changes in word usage over time, we measured occurrences per million (opm) of a given token in each year. Figure
illustrates the usage trajectories of some baseline items over time. We focus on the period from 1975 to May 2024 as data prior to 1975 are less extensive.
Figure 2:
Selected lexical entries: change over time.
The goal of our corpus analysis was to identify words whose recent overuse in scientific writing is likely the result of LLM deployment. Our approach involved three steps. First, we determined which words were more prevalent in abstracts from 2024 compared to 2020 (since LLMs were not widespread pre-2021). We calculated the percentage increase in opm for each token in the database between 2020 and 2024. Unsurprisingly, there was a straightforward explanation for why some words spiked in usage during that time. For example, “omicron” and “metaverse” were two of the words that showed the largest percentage increase (for “omicron”, see Figure
). We only considered increases deemed significant by chi-square tests, of which there were about 7300.
We were interested in isolating words whose spike in usage was unexplained. The authors functioned as annotators and independently reviewed the list of words that had the highest percentage change to exclude irrelevant tokens (like year numbers) and words whose spiking had an explanation in terms of scientific advances or world events. In cases of disagreement, we included the word on our list. We stopped once we had 50 words whose usage spiked without any obvious explanation (see incl.ods on GitHub). This list contained several of the words that had been the focus of online conversation, including “delve” and “intricate”.
However, a spike without an obvious explanation is not necessarily LLM-induced. For example, the usage of ’mash’ increased tenfold, but it is not a word that ChatGPT is known to overuse. The second step of our method involved identifying words that are overrepresented in AI-generated scientific abstracts compared to human-generated abstracts. In producing AI-generated abstracts, our aim was to imitate the process by which researchers might have deployed an LLM in 2022-early 2024 (while paying attention to careful prompt formulation
Wei et al. (
2022
); Zhou et al. (
2022
). After some exploration, we ended up with a two-stage process: (1) We randomly sampled 10,000 abstracts from papers published in 2020 from the PubMed database. Via the API, ChatGPT-3.5 then summarized the associated paper (Prompt: “The following is an abstract of an article. Summarize it in a couple of sentences.”) (2) The ChatGPT-generated summary was then used to ask ChatGPT-3.5 for a corresponding scientific abstract. (Prompt: “Please write an abstract for a scientific paper, about 200 words in length, based on the following notes.”) We suspect that the most common way of using an LLM to generate an abstract back when ChatGPT could not accept paper-length inputs involved providing important fragments of a paper. We used ChatGPT-3.5 for the entirety of our project because if scientific abstracts in our dataset contain AI-generated language, it is most likely from ChatGPT-3 or ChatGPT-3.5
Sarkar (
2023
In total, from 10,000 human abstracts, we generated 9,953 AI abstracts. (For a small number, ChatGPT would not provide a response, presumably due to topic sensitivity.) We then compared the word usage in the AI-generated abstracts with word usage in the original abstracts. We only considered words for which a chi-square test indicated a significant difference in opm between the human- and AI-produced text. This gave us a list of items overused by ChatGPT.
Figure 3:
Our method for the systematic identification of focal words.
In the third step of our analysis, we returned to the list of 50 spiking words to ask: Is the word also on the ChatGPT-overuse list? If so, then it became a “focal word” (Figure
). This gave us a list of 21 focal words (Figure
and Appendix
). Each focal word (a) shows a significant spike in opm between 2020 and 2024, (b) its spike lacks an obvious explanation, and (c) ChatGPT tends to use it significantly more than humans when writing scientific abstracts (Figure
). Thus, a plausible explanation for the increasing prevalence of each focal word in Scientific English is the use of AI.
Figure 4:
Occurrences per million words in PubMed abstracts for our 21 focal words.
This systematic, three-step method for identifying focal words is novel. It improves on more informal ways of identifying AI-associated words, and it can be applied to other corpora and LLMs beyond ChatGPT-3.5. (Appendix
reports similar results for ChatGPT-4.0(-mini).) Future research can use the method to investigate whether the same words are overrepresented in the outputs of different models – or whether there are LLMs that do not exhibit lexical overrepresentation at all.
Searching for Overrepresentation in Possible Training Data
Our focal words are overrepresented in text generated by ChatGPT compared to earlier PubMed abstracts. Other research indicates that such words also appear less frequently in related datasets in the pre-LLM era
Liang et al. (
2024b
); Gray (
2024
. Although we do not know exactly what data LLMs have been trained on, these results cast doubt on the hypothesis that ChatGPT is using words like “delve” and “surpass” frequently because they occur frequently in its training data.
To further demonstrate that the focal words are probably not overrepresented in the training data, we analyzed several additional datasets, namely: Arxiv abstracts (accessed 4 Aug 2024; contains data from 1986 onwards, averaged over all years), the Leipzig Corpus Collective (
Goldhahn et al.
2012
; the English LCC contains mostly news texts and transcriptions, data from 2005 onwards; preprocessed snapshot from a previous project), and Wikipedia articles and discussions (
Foundation
2024
, accessed 4 Aug 2024). The results are presented in Appendix
. The opm of the focal words in our ChatGPT-3.5-generated abstracts far exceeds their opm in the four datasets examined.
Second, we conducted a similar analysis for various varieties of English using the International Corpus of English (ICE;
Kirk and Nelson
2018
). Although ICE is relatively small compared to the other datasets (the subcorpora for most varieties contain about one million words), we do not find evidence that the focal words are especially prevalent in any particular variety of English (see Appendix
). This suggests that the overrepresentation of focal words in ChatGPT’s outputs is probably not due to an overrepresentation of a certain variety of English in its training data. It has been hypothesized that LLMs might frequently use words like “delve” because they are more common in varieties of English spoken by human evaluators who provide fine-tuning data, such as Nigerian English
Hern (
2024
. Our initial analysis of ICE does not support this hypothesis.
Model Choices: Architecture and Algorithms
Could choices about model architecture or algorithms be responsible for the puzzle of lexical overrepresentation? To probe this, we would ideally build an LLM ourselves and test the impact of each potential factor on the prevalence of focal words. This requires vast resources, however, and is beyond most researchers’ capabilities, including our own. A more feasible alternative would be to investigate a model that has several released variants – e.g., different versions of the same model using different optimization algorithms. Such a model must also be queryable with respect to information-theoretic measures like entropy
Shannon (
1948
. To our knowledge, no LLM offers such fine-grained releases.
The closest we could find is the comparison between Llama 2-Base (Llama-2-7b-hf) and Llama 2-Chat (Llama-2-7b-chat-hf;
Touvron et al.
2023
). We used the Llama 2 models because they are more similar to ChatGPT-3.5 than Llama 3 (
Chiang et al.
2024
; but Llama 3 produces similar results; Appendix
). The main difference between these two versions of Llama is that Llama 2-Chat includes fine-tuning and RLHF, whereas Llama 2-Base does not. Llama models can also be queried for per-word entropy
Jurafsky and Martin (
2024
p-w ent
log
subscript
p-w ent
superscript
subscript
subscript
subscript
H_{\text{p-w ent}}=-\frac{1}{L}\sum_{i=1}^{n}p(x_{i})\log p(x_{i})
italic_H start_POSTSUBSCRIPT p-w ent end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
(1)
By comparing the two models’ per-word entropy for human- and AI-generated abstracts, we could assess which was more “surprised” by abstracts with an overrepresentation of focal words. Any difference between the models provides evidence about the source of lexical overrepresentation. We provided our sample of 10,000 human-written abstracts to both versions of Llama 2, followed by the abstracts rewritten by ChatGPT-3.5 (see Section
). The results are presented in Table
Table 1:
Per-word entropy for human abstracts compared to ChatGPT-generated abstracts. Higher values of entropy mean that the model is more “surprised.”
We observe that Llama 2-Base is slightly less “surprised” by human-written text, while Llama 2-Chat is considerably less “surprised” by AI-generated abstracts, in which the focal words are overrepresented. This suggests the overuse of focal words might be driven by some factor that differs between the models. Given that model architecture and many algorithms are held constant across Llama 2-Base and Llama 2-Chat, our findings suggest that these factors are not the primary causes of lexical overrepresentation. Instead, they indicate that fine-tuning and RLHF – which differ between the models – might be important contributors.
These results are necessarily limited. We cannot claim definitively that the observed difference between the models is driven by the prevalence of focal words rather than some other feature of AI-generated text. Moreover, most of our paper is concerned with ChatGPT rather than Llama. The difficulty is that there are no models of ChatGPT (v.3 or above) that can be queried in the described fashion. We think Llama is a useful approximation.
RLHF: An Experimental Approach
Our model testing with Llama suggested that RLHF might contribute to lexical overrepresentation. This hypothesis has intuitive plausibility: when human evaluators assess alternative answers to a query, perhaps they are exhibiting a preference for answers containing certain words. Since LLMs are trained to align their answers with human preferences, they would learn to use those words more frequently
Christiano et al. (
2017
); Ziegler et al. (
2019
. To further investigate this potential explanation, we conducted an exploratory online study in which participants indicated whether they preferred scientific abstracts that contained our focal words.
Materials
. We randomly sampled shorter PubMed abstracts (70-100 words) from the year 2020 and, with Python and using the OpenAI API, used ChatGPT-3.5 to rewrite them with and without focal words. (Shorter abstracts were used to keep stimuli of a manageable length for participants.) For the focal-word abstracts, the prompt included four randomly selected words from our list of 21 focal words. An example prompt is: “Please write a 100-word abstract for the following scientific paper, using words such as ’delves,’ ’underscores,’ ’surpasses,’ and ’emphasizing’: [SUMMARY].” (The summary was generated via the procedure described in Section
.) The script instructed ChatGPT to generate and revise an abstract until it contained at least three focal words. For the no-focal-word abstracts, we used a similar prompt: “Please write a 100-word abstract for the following scientific paper, making sure not to use words such as [list of blockwords]: [SUMMARY].” The blockwords included the 21 focal words plus another 21 words identified using the methodology described in Section
. The script prompted ChatGPT to generate and revise an abstract until it contained none of the blockwords.
We created 200 items, each consisting of one abstract with focal words and one without (for the same paper). We manually filtered out a handful of ungrammatical or nonsensical abstracts. Considerably more than half of the abstracts with focal words included “delve” in the first sentence; we call items containing these abstracts “delve-initial” items. To compile a bank of 30 critical items, we selected the 15 delve-initial items and the 15 other items with the smallest difference in length between the abstracts with and without focal words. (We capped delve-initial items at 50% to prevent participants from detecting the study’s purpose.) We also constructed 30 pairs of distractor items in the same manner as the critical items, except both abstracts were generated using the no-focal-word prompt. A full list of experimental items can be found on Github, and two examples are in Appendix
Participants.
We used Prolific (
prolific.com
) to recruit participants. Public information about the human evaluators employed to provide feedback in RLHF is limited
Ouyang et al. (
2022
); Perrigo (
2023
, so we recruited 201 participants from India (140 male, 61 female). Average age was 31.3 years (stdev: 10.6). We also collected data on self-assessed English proficiency and first languages (see our GitHub). Participants were compensated at an average rate of $15 per hour.
Task and Exclusions.
The study began with IRB information, followed by task instructions, and then the items. An image of the interface can be found in Appendix
. Participants evaluated 20 items in total, indicating which abstract they preferred out of the two presented. The first item was a calibration item, followed by (in random order) 5 critical items, 10 distractor items, 2 items checking language abilities, and 2 attention checks (“This is not a real item, please click on the left button” inserted in the middle of the text). Thus, the proportion of critical items was 25%. Each time an item was displayed, it was randomly determined which abstract was displayed on the left vs. right. If a participant failed one of the attention checks, their data were disregarded. Participants were warned if they were proceeding unrealistically fast (0.25 * (225 ms + 25ms * character length of an item)), and items with excessively fast rating times were excluded from our analysis (following the methodology from
Häussler and Juzek
2017
). We also excluded data from participants who completed less than 10 out of the 20 items. After exclusions, we analyzed a total of 1822 ratings, with 1215 ratings for distractor items and 607 ratings for critical items, resulting in each critical item receiving an average of 20.2 ratings (stdev: 3.4). Given the study compensation, the high exclusion rate came as a surprise.
Analysis.
Our original plan was to test all 30 critical items together in a chi-square analysis against the distractor items (an approximation of random choices), to assess whether participants preferred abstracts containing focal words. These results are reported below. However, during the generation of the abstracts, we noticed the aforementioned excess of delves in the first sentence and split the critical items into delve-initial items and other items. A lower N per condition and a higher-than-expected exclusion rate left us considerably below the originally estimated sample size from a pre-study power analysis. Thus, we added an exploratory mixed-effects logistic regression model, with rating as the dependent variable and condition as the independent variable, including items as a random effect (rating condition + (1 | item_id)). Distractor items served as the intercept condition. For delve-initial items and other items, a preference for the focal-word abstract was encoded with 0, and a preference for the no-focal-word abstract with 1. For the distractor items, there are two no-focal-word abstracts, randomly encoded as 0 or 1.
Results.
Contrary to our expectations, when all critical items are analyzed together, there is a slight preference for the no-focal-word abstracts. However, this overall difference between all critical items and distractor items is not significant in a chi-square test (p = 0.174). The follow-up analysis suggests that this outcome might be driven by the delve-initial items, as Figure
illustrates. In the logistic regression model, we observe that the coefficient for the distractor items, represented by the intercept condition, is 0.500 (rounded to the third digit). This indicates that participants did not exhibit a significant preference between the distractor item abstracts, validating our methodology (Appendix
). The analysis also shows that delve-initial items differed significantly from the distractors (p = 0.023), with a coefficient of 0.082, indicating that for the delve-initial items, participants preferred the abstracts without focal words. Participants exhibited a slight but non-significant preference for abstracts with focal words for the other critical items (coefficient = -0.017; p = 0.651). The group variance was small (0.003), indicating that most of the variability in the ratings was due to the fixed effects. The model converged successfully (log-likelihood = -1324.9522, mean group size = 30.4). A Wald test to determine whether delve-initial items and the other items differed from each other was statistically significant (p = 0.03, Wald Test Statistic: 4.77).
In looking at the responses for each individual item, we consider a preference for the focal-word or no-focal-word abstract of a given pair to be robust if a random outcome falls outside the margin of error, and marginal otherwise (illustrated for the distractors in Appendix
). This analysis shows a slight difference between delve-initial items and the other critical items: participants exhibit a preference for the no-focal-word abstract in more of the delve-initial items, and a preference for the focal-word abstract in more of the other items.
Figure 5:
Experimental results: Preferences between focal-word and non-focal-word abstracts in delve-initial and other items.
What explains the difference between delve-initial and the other critical items? We suspect that some participants became or were already sensitive to the occurrence of “delve.” Participants were probably disproportionately young people with an affinity for technology, and so more likely to be familiar with the discourse surrounding AI language use. Wariness about the word “delve” might explain why participants preferred the abstracts without focal words in the delve-initial items (which coincides with a general downturn in sentiment towards LLMs; cf.
Leiter et al.
2024
), though we would like to see these results confirmed with a larger sample.
Having split the critical items in two, a higher N is needed to draw any conclusions about RLHF as a source of lexical overrepresentation, particularly given that we would expect a preference for focal-word abstracts to be subtle. The study warrants a follow-up. We believe that forcing ChatGPT to use certain words when generating abstracts was suboptimal. For example, if an abstract does not initially convey anything about exceeding or outperforming, then a rewritten abstract that includes the focal word ’surpasses’ will naturally be worse than the no-focal-word baseline. We suspect that generating critical items in a different way would yield clearer results.
Discussion and Concluding Remarks
It has been observed that LLMs overuse certain lexical items, a fact even acknowledged by OpenAI
OpenAI (
2024
. Our work formalized this finding and identified 21 focal words
whose usage has spiked in scientific abstracts and that are overused by ChatGPT-3.5. These results provide additional evidence that recent changes to Scientific English are partly driven by LLMs. Our work also explored possible explanations of the puzzle of lexical overrepresentation. We failed to find evidence that training data, model architecture, or algorithm choices play a role. However, model testing with Llama was consistent with the hypothesis that RLHF contributes to overuse of particular words by ChatGPT. Our experimental results suggest that human evaluators may treat “delve” differently from other focal words.
Future research should further probe the impact of each factor canvassed in Section
on lexical overrepresentation. (This includes model choices and training data; despite our negative results, we suspect that these factors do influence the lexical choices of LLMs.) We would especially like to see further work on the role of RLHF. Unfortunately, there are several obstacles to such research, particularly the lack of procedural and data transparency surrounding LLM development
Longpre et al. (
2024
. Moreover, it seems that companies building LLMs often solicit feedback from workers who are underpaid, stressed, and under time pressure
Toxtli et al. (
2021
); Roberts (
2022
); Novick (
2023
. It is difficult to simulate these conditions ethically in a research environment. Many online recruitment platforms, including Prolific, rightly require decent compensation.
Although it complicates further study, we think this economic reality lends plausibility to RLHF as a source of lexical overrepresentation. Rushed human evaluators might base their evaluations on the presence of particular words rather than on content, as the former is easier and quicker to evaluate than the latter. If certain words are treated as a proxy for quality, that could explain their overrepresentation in LLM outputs. (We suspect, however, that Scientific English in particular played a minor role in the training of LLMs. It seems more likely that human evaluators rated academic writing in general, with their preferences shaping LLMs’ scientific writing through overspill.) This mechanism coheres with our impression that a major social consequence of LLMs is the decoupling of form and content. Many of us take fluency or style as a signal of quality content (
McNamara et al.
2010
, and in an L2 context
Kim and Crossley
2018
). Because LLMs are masterful at generating fluid text in just about any style, this heuristic is radically undermined by the increasing ubiquity of LLM-generated text. The irony is that, if our hypothesis about RLHF proves correct, this heuristic has shaped model training as well. LLMs may be undercutting the very same heuristic that has shaped their own lexical preferences.
It would be interesting to apply our method for identifying focal words to alternative datasets. Although we drew abstracts exclusively from PubMed, future work could examine whether the same focal words have been spiking in scientific disciplines besides biomedicine, in domains beyond Scientific English, and in non-English-language corpora. The method could also be used to probe lexical overrepresentation in LLMs other than ChatGPT. Our impression is that ChatGPT and Llama overuse many of the same words, but a systematic investigation is needed. Finally, additional work on the quirks of LLM-generated language could look beyond the word level
Ortmann et al. (
2021
. A virtue of our formalized approach to identifying focal words is that it can be extended in these and any number of other ways to better understand how LLMs are driving linguistic change.
More generally, our research shows that despite the opacity of LLMs, there are ways of probing their behavior and internal workings. Understanding LLMs’ linguistic behavior is complicated by their complexity and by secrecy and other industry practices, as mentioned above. Nevertheless, our work indicates that the puzzle of lexical overrepresentation is tractable. Indirect investigative methods can help us explain LLMs’ linguistic behavior.
Such research is important because we need to better understand how LLMs are changing language. Almost all of our 21 focal words were already increasing in usage in the years leading up to the release of ChatGPT, suggesting that LLMs may accelerate language change (
Matsui
2024
; also see
Geng et al.
2024
and
Yakura et al.
2024
). With the increasing prevalence of AI-generated text in many areas of life, LLMs are arguably influencing the language usage even of people who do not themselves interact with these models. Our findings also show that lexical overrepresentation remains a feature of current iterations of ChatGPT (Appendix
), indicating that the phenomenon is here to stay.
Still, it is difficult to predict just how AI will shape language in the future. Discussions on social media and in academic discourse, plus our exploratory findings for items with “delve,” indicate that there is some public awareness of LLMs’ overuse of particular words. This awareness could influence future rounds of RLHF, leading to a realignment of AI and human preferences. At the same time, the language of today – lexical overrepresentations and all – will become the training data for the models of tomorrow, raising concerns about model degradation over time
Alemohammad et al. (
2023
); Briesch et al. (
2023
); Hataya et al. (
2023
); Shumailov et al. (
2023
One thing is certain: through LLMs, tech companies are having a global impact on language usage. We believe this strengthens the case for broader societal debate about the power and responsibilities of these companies. Moreover, our speculations about how the feedback of rushed and underpaid workers might contribute to lexical overrepresentation compound ethical worries about the poor working conditions of tech companies’ employees in the Global South
Kwet (
2019
); Gray (
2024
); Rohde et al. (
2024
. There are thus both moral and non-moral reasons to apply greater scrutiny to how human feedback is collected and used in the training of LLMs.
US