1. Definitions
The commonly accepted definition of Internet research ethics (IRE) has
been used by Buchanan and Ess (2008, 2009), Buchanan (2011), and Ess
& Association of Internet Researchers (AoIR) (2002):
IRE
is defined as the analysis of ethical issues and
application of research ethics principles as they pertain to research
conducted on and in the Internet. Internet-based research, broadly
defined, is research which utilizes the Internet to collect
information through an online tool, such as an online survey; studies
about how people use the Internet, e.g., through collecting data
and/or examining activities in or on any online environments; and/or,
uses of online datasets, databases, or repositories.
These examples were broadened in 2013 by the United States
Secretary’s Advisory Committee to the Office for Human Research
Protections (SACHRP 2013), and included under the umbrella term
Internet Research:
Research studying information that is already available on or via
the Internet without direct interaction with human subjects
(harvesting, mining, profiling, scraping, observation or recording of
otherwise-existing data sets, chat room interactions, blogs, social
media postings, etc.)
Research that uses the Internet as a vehicle for recruiting or
interacting, directly or indirectly, with subjects (Self-testing
websites, survey tools, Amazon Mechanical Turk, etc.)
Research about the Internet itself and its effects (use patterns
or effects of social media, search engines, email, etc.; evolution of
privacy issues; information contagion; etc.)
Research about Internet users: what they do, and how the Internet
affects individuals and their behaviors Research that utilizes the
Internet as an interventional tool, for example, interventions that
influence subjects’ behavior
Others (emerging and cross-platform types of research and methods,
including m-research (mobile))
Recruitment in or through Internet locales or tools, for example
social media, push technologies
A critical distinction in the definition of Internet research ethics
is that between the Internet as a research tool versus a research
venue. The distinction between tool and venue plays out across
disciplinary and methodological orientations. As a tool, Internet
research is enabled by search engines, data aggregators, digital
archives, application programming interfaces (APIs), online survey
platforms, and crowdsourcing platforms. Internet-based research venues
include such spaces as conversation applications (instant messaging
and discussion forums, for example), online multiplayer games, blogs
and interactive websites, and social networking platforms.
Another way of conceptualizing the distinction between tool and venue
comes from Kitchin (2008), who has referred to a distinction in
Internet research using the concepts of “engaged web-based
research” versus “non-intrusive web-based
research:”
Non-intrusive analyses refer to techniques of data collection that do
not interrupt the naturally occurring state of the site or
cybercommunity, or interfere with premanufactured text. Conversely,
engaged analyses reach into the site or community and thus engage the
participants of the web source (2008: 15).
These two constructs provide researchers with a way of recognizing
when considering of human subject protections might need to occur.
McKee and Porter (2009), as well as Banks and Eble (2007) provide
guidance on the continuum of human-subjects research, noting a
distinction between person-based versus text-based. For example, McKee
and Porter provide a range of research variables (public/private,
topic sensitivity, degree of interaction, and subject vulnerability)
which are useful in determining where on the continuum of text-based
versus how person-based the research is, and whether or not subjects
would need to consent to the research (2009: 87–88).
While conceptually useful for determining human subjects
participation, the distinction between tool and venue or engaged
versus non-intrusive web-based research is increasingly blurring in
the face of social media and their third-party applications. Buchanan
(2016) has conceptualized three phases of Internet research, starting
in the late 1990s when the Internet was used as a tool for research,
and later the emergence of social media characterizes the second
phase, circa 2006–2014. The concept of social media entails
A group of Internet-based applications that build on the ideological
and technological foundations of Web 2.0, and that allow the creation
and exchange of user-generated content (Kaplan & Haenlein 2010:
61).
A “social network site” is a category of websites with
profiles, semi-persistent public commentary on the profile, and a
traversable publicly articulated social network displayed in relation
to the profile.
This collapse of tool and venue can be traced primarily to the
increasing use of third-party sites and applications such as Facebook,
X/Twitter, or any of the myriad online platforms where subject or
participant recruitment, data collection, data analysis, and data
dissemination can all occur in the same space. With these collapsing
boundaries, the terms of “inter-jurisdictional
coordination” (Gilbert 2009: 3) are inherently challenging;
Gilbert has specifically argued against the terms of use or end-user
license agreement stipulations in virtual worlds, noting that such
agreements are often “flawed”, as they rely on laws and
regulations from a specific locale and attempt to enforce them in a
non place-based environment. Nonetheless, researchers now make
frequent use of data aggregation tools, scraping data from user
profiles or transaction logs, harvesting data from social media
streams, or storing data on cloud servers only after agreeing to the
terms of service that go along with those sites. The use of such
third-party applications or tools changes fundamental aspects of
research, oftentimes displacing the researcher or research team as the
sole owner of their data. These unique characteristics implicate
concepts and practicalities of privacy, consent, ownership, and
jurisdictional boundaries.
The definition of Internet research, thus, has expanded significantly
beyond studying online communities or content. It now includes the
collection and analysis of data generated through internet-connected
devices, such as smartphones, wearables, and IoT technologies, which
produce continuous streams of behavioral, biometric, and location
data. Researchers increasingly access social media and platform data
via APIs, enabling large-scale, automated studies of online
interactions and trends. Internet research also encompasses the
analysis of digital infrastructures—like algorithms, recommender
systems, and moderation tools—as objects of study. And internet
research might also include the widespread collection and analysis of
large-scale datasets used to train machine learning and artificial
intelligence models. These shifts reflect a broader turn toward big
data and computational methods, raising new ethical questions about
consent, privacy, and data governance in increasingly pervasive and
often invisible digital environments.
2. Human Subjects Research
The practical, professional, and theoretical implications of human
subjects protections has been covered extensively in scholarly
literature, ranging from medical/biomedical to social sciences to
computing and technical disciplines (see Beauchamp & Childress
2008; Emanual et al. 2003; PRIM&R et al. 2021; Sieber 1992; Wright
2006). Relevant protections and regulations continue to receive much
attention in the face of research ethics violations (see, for example,
Skloot 2010, on Henrietta Lacks; the U.S. Government’s admission
and apology to the Guatemalan Government for STD testing in the 1940s
(BBC 2011); and Gaw & Burns 2011, on how lessons from the past
might inform current research ethics and conduct).
The history of human subjects protections (Resnik & Hofweber 2025
Other Internet Resources
])
grew out of atrocities such as Nazi human experimentation during
World War II, which resulted in the Nuremberg Code (1947);
subsequently followed by the Declaration of Helsinki on Ethical
Principles for Medical Research Involving Human Subjects (World
Medical Association 1964/2008). Partially in response to the Tuskegee
syphilis experiment, an infamous clinical study conducted between 1932
and 1972 by the U.S. Public Health Service studying the natural
progression of untreated syphilis in rural African-American men in
Alabama under the guise of receiving free health care from the
government, the U.S. Department of Health and Human Services put forth
a set of basic regulations governing the protection of human subjects
(45 C.F.R. § 46) (see the links in the Other Internet Resources
section, under Laws and Government Documents). This was later followed
by the publication of the “Ethical Principles and Guidelines for
the Protection of Human Subjects of Research” by the National
Commission for the Protection of Human Subjects of Biomedical and
Behavioral Research, known as the Belmont Report (NCPHSBBR 1979). The
Belmont Report identifies three fundamental ethical principles for all
human subjects research: Respect for Persons, Beneficence, and
Justice.
To ensure consistency across federal agencies in the United States
context in human subjects protections, in 1991, the Federal Policy for
the Protection of Human Subjects, also known as the “Common
Rule” was codified; the Revised Common Rule was released in the
Federal Register on 19 January 2017, and went into effect 19 July
2018. Similar regulatory frameworks for the protection of human
subjects exist across the world, and include, for example, the
Canadian Tri-Council, the Australian Research Council, The European
Commission, The Research Council of Norway and its National Committee
for Research Ethics in the Social Sciences and Humanities (NESH 2006;
NESH 2019), and the U.K.’s NHS National Research Ethics Service
and the Research Ethics Framework (REF) of the ESRC (Economic and
Social Research Council) General Guidelines, and the Forum for Ethical
Review Committees in Asia and the Western Pacific (FERCAP).
In the United States, the various regulatory agencies who have signed
on to the Common Rule (45 C.F.R. 46 Subpart A) have not issued formal
guidance on Internet research (see the links in the Other Internet
Resources section, under Laws and Government Documents). The Preamble
to the Revised Rule referenced significant changes in the research
environment, recognizing a need to broaden the scope of the Rule.
However, substantial changes to the actual Rule in regards to Internet
research in its broadest context, were minimal.
For example, the Preamble states:
This final rule recognizes that in the past two decades a paradigm
shift has occurred in how research is conducted. Evolving
technologies—including imaging, mobile technologies, and the
growth in computing power—have changed the scale and nature of
information collected in many disciplines. Computer scientists,
engineers, and social scientists are developing techniques to
integrate different types of data so they can be combined, mined,
analyzed, and shared. The advent of sophisticated computer software
programs, the Internet, and mobile technology has created new areas of
research activity, particularly within the social and behavioral
sciences (Federal Register 2017 and HHS 2017).
Modest changes to the definition of human subjects included changing
“data” to “information” and
“biospecimens;” the definition now reads:
(e)
(1)
Human subject
means a living individual about
whom an investigator (whether professional or student) conducting
research:
(i)
Obtains information or biospecimens through intervention or
interaction with the individual, and uses, studies, or analyzes the
information or biospecimens; or
(ii)
Obtains, uses, studies, analyzes, or generates identifiable
private information or identifiable biospecimens.
(2)
Intervention
includes both physical procedures by
which information or biospecimens are gathered (e.g., venipuncture)
and manipulations of the subject or the subject’s environment
that are performed for research purposes.
(3)
Interaction
includes communication or
interpersonal contact between investigator and subject.
(4)
Private information
includes information about
behavior that occurs in a context in which an individual can
reasonably expect that no observation or recording is taking place,
and information that has been provided for specific purposes by an
individual and that the individual can reasonably expect will not be
made public (e.g., a medical record).
(5)
Identifiable private information
is private
information for which the identity of the subject is or may readily be
ascertained by the investigator or associated with the
information.
(6)
An identifiable biospecimen
is a biospecimen for
which the identity of the subject is or may readily be ascertained by
the investigator or associated with the biospecimen (45 C.F.R. §
46.102 (2018)).
However, the Revised Rule does have a provision that stands to be of
import in regards to Internet research; the Rule calls for
implementing departments or agencies to,
[(e)(7)]
(i)
Upon consultation with appropriate experts (including experts in
data matching and re-identification), reexamine the meaning of
“identifiable private information”, as defined in
paragraph (e)(5) of this section, and “identifiable
biospecimen”, as defined in paragraph (e)(6) of this section.
This reexamination shall take place within 1 year and regularly
thereafter (at least every 4 years). This process will be conducted by
collaboration among the Federal departments and agencies implementing
this policy. If appropriate and permitted by law, such Federal
departments and agencies may alter the interpretation of these terms,
including through the use of guidance.
(ii)
Upon consultation with appropriate experts, assess whether there
are analytic technologies or techniques that should be considered by
investigators to generate “identifiable private
information”, as defined in paragraph (e)(5) of this section, or
an “identifiable biospecimen”, as defined in paragraph
(e)(6) of this section. This assessment shall take place within 1 year
and regularly thereafter (at least every 4 years). This process will
be conducted by collaboration among the Federal departments and
agencies implementing this policy. Any such technologies or techniques
will be included on a list of technologies or techniques that produce
identifiable private information or identifiable biospecimens. This
list will be published in the Federal Register after notice and an
opportunity for public comment. The Secretary, HHS, shall maintain the
list on a publicly accessible Web site (45 C.F.R. § 46.102
(2018)).
As of this writing, there has not yet been a reexamination of the
concepts of “identifiable private information” or
“identifiable biospecimens”. However, as data analytics,
AI, and machine learning continue to expose ethical issues in human
subjects research, we expect to see engaged discussion at the federal
level and amongst research communities (PRIM&R 2021). Those
discussions may refer to previous conceptual work by Carpenter and
Dittrich (2012) and Aycock et al. (2012) that is concerned with risk
and identifiability. Secondary uses of identifiable, private data, for
example, may pose downstream harms, or unintentional risks, causing
reputational or informational harms. Reexaminations of
“identifiable private information” can not occur without
serious consideration of risk and “human harming
research”. Carpenter and Dittrich (2012) encourage
“Review boards should transition from an informed consent driven
review to a risk analysis review that addresses potential harms
stemming from research in which a researcher does not directly
interact with the at-risk individuals” (p. 4) as “[T]his
distance between researcher and affected individual indicates that a
paradigm shift is necessary in the research arena. We must transition
our idea of research protection from ‘human subjects
research’ to ‘human harming research’” (p.
14).
Similarly, Aycock et al. (2012) assert that
Researchers and boards must balance presenting risks related to the
specific research with risks related to the technologies in use. With
computer security research, major issues around risk arise, for
society at large especially. The risk may not seem evident to an
individual but in the scope of security research, larger populations
may be vulnerable. There is a significant difficulty in quantifying
risks and benefits, in the traditional sense of research
ethics….An aggregation of surfing behaviors collected by a bot
presents greater distance between researcher and respondent than an
interview done in a virtual world between avatars. This distance leads
us to suggest that computer security research focus less concern
around
human subjects research
in the traditional sense and
more concern with
human harming research
(p. 3, italics
original).
These two conceptual notions are relevant for considering emergent
forms of identities or personally identifiable information (PII) such
as avatars, virtual beings, bots, textual and graphical information.
Within the Code of Federal Regulations (45 C.F.R. § 46.102(f)
2009): New forms of representations are considered human subjects if
PII about living individuals is obtained. PII can be obtained by
researchers through scraping data sources, profiles or avatars, or
other pieces of data made available by the platform. Fairfield agrees:
“An avatar, for example, does not merely represent a collection
of pixels—it represents the identity of the user” (2012:
701).
The multiple academic disciplines already long engaged in human
subjects research (medicine, sociology, anthropology, psychology,
communication) have established ethical guidelines intended to assist
researchers and those charged with ensuring that research on human
subjects follows both legal requirements and ethical practices. But
with research involving the Internet—where individuals
increasingly share personal information on platforms with porous and
shifting boundaries, where both the spread and aggregation of data
from disparate sources has become the norm, and where web-based
services, and their privacy policies and terms of service statements,
morph and evolve rapidly—the ethical frameworks and assumptions
traditionally used by researchers and REBs are frequently
challenged.
Research ethics boards themselves are increasingly challenged with the
unique ethical dimensions of internet-based research protocols. In a
2008 survey of U.S. IRBs, less than half of the ethical review boards
identified internet-based research was “an area of concern or
importance” at that time, and only 6% had guidelines or
checklists in place for reviewing internet-based research protocols
(Buchanan & Ess 2009). By 2015, 93% of IRBs surveyed acknowledged
that are ethical issues unique to research using “online
data”, yet only 55% said they felt their IRBs are well versed in
the technical aspects of online data collection, and only 57% agreed
that their IRB has the expertise to stay abreast of changes in online
technology. IRBs are now further challenged with the growth of big
data research (see
§4.5 below
),
which increasingly relies on large datasets of personal information
generated via social media, digital devices, or other means often
hidden from users. A 2019 study of IRBs at 77 U.S. institutions
revealed only 25% felt prepared to evaluate protocols relying on big
data, and only 6% had tools sufficient for considering this emerging
area of internet research (Zimmer & Chapman 2020). Further, after
being presented various hypothetical research scenarios utilizing big
data and asked how their IRB would likely review such a protocol,
numerous viewpoints different strongly in many cases. Consider the
following scenario:
Researchers plan to scrape public comments from online newspaper pages
to predict election outcomes. They will aggregate their analysis to
determine public sentiment. The researchers don’t plan to inform
commenters, and they plan to collect potentially-identifiable user
names. Scraping comments violates the newspaper’s terms of
service.
18% of respondents indicated their IRB would view this as exempt, 21%
indicated expedited review, 33% suggested it would need full board
review, while 28% did not think this was even human subjects research
that would fall under their IRB’s purview (Zimmer & Chapman
2020). This points to potential gaps and inconsistencies in how IRBs
review the ethical implications of big data research protocols.
As research protocols increasingly include data collected through
individuals’ digital interactions—such as social media
activity, wearable device data, and online behaviors—without
direct engagement with the individuals themselves, a more nuanced
understanding of what constitutes human subjects research is emerging
that recognizes how individuals can be impacted by studies even
without direct participation or awareness (Shilton et al. 2021;
Fiesler et al. 2024).
3. History and Development of IRE as a Discipline
An extensive body of literature has developed since the 1990s around
the use of the Internet for research (S. Jones 1999; Hunsinger,
Klastrup, & Allen (eds.) 2010; Consalvo & Ess (eds.) 2011;
Zimmer & Kinder-Kurlanda (eds.) 2017), with a growing emphasis on
the ethical dimensions of Internet research.
A flurry of Internet research, and explicit concern for the ethical
issues concurrently at play in it, began in the mid-1990s. In 1996,
Storm King recognized the growing use of the Internet as a venue for
research. His work explored the American Psychological
Association’s guidelines for human subjects research with
emergent forms of email, chat, listservs, and virtual communities.
With careful attention to risk and benefit to Internet subjects, King
offered a cautionary note:
When a field of study is new, the fine points of ethical
considerations involved are undefined. As the field matures and
results are compiled, researchers often review earlier studies and
become concerned because of the apparent disregard for the human
subjects involved (1996: 119).
The 1996 issue of
Information Society
dedicated to Internet
research is considered a watershed moment, and included much seminal
research still of impact and relevance today (Allen 1996; Boehlefeld
1996; Reid 1996).
Sherry Turkle’s 1997
Life on the Screen: Identity in the Age
of the Internet
called direct attention to the human element of
online game environments. Moving squarely towards person-based versus
text-based research, Turkle pushed researchers to consider human
subjects implications of Internet research. Similarly, Markham’s
Life Online: Researching Real Experience in Virtual Space
(1998) highlighted the methodological complexities of online
ethnographic studies, as did Jacobson’s 1999 methodological
treatment of Internet research. The “field” of study
changed the dynamics of researcher-researched roles, identity, and
representation of participants from virtual spaces. Markham’s
work in qualitative online research has been influential across
disciplines, as research in nursing, psychology, and medicine has
found the potential of this paradigm for online research (Flicker et
al. 2004; Eysenbach & Till 2001; Seaboldt & Kupier 1997; Sharf
1997).
Then, in 1999, the American Association for the Advancement of Science
(AAAS), with a contract from the U.S. Office for Protection from
Research Risks (now known as the Office for Human Research
Protections), convened a workshop, with the goal of assessing the
alignment of traditional research ethics concepts to Internet
research. The workshop acknowledged
The vast amount of social and behavioral information potentially
available on the Internet has made it a prime target for researchers
wishing to study the dynamics of human interactions and their
consequences in this virtual medium. Researchers can potentially
collect data from widely dispersed population sat relatively low cost
and in less time than similar efforts in the physical world. As a
result, there has been an increase in the number of Internet studies,
ranging from surveys to naturalistic observation (Frankel & Siang
1999: 1).
In the medical/biomedical contexts, Internet research has grown
rapidly. Also in 1999, Gunther Eysenbach wrote the first editorial to
the newly formed
Journal of Medical Internet Research
. There
were three driving forces behind the inception of this journal, and
Eysenbach called attention to the growing social and interpersonal
aspects of the Internet:
First, Internet protocols are used for clinical information and
communication. In the future, Internet technology will be the platform
for many telemedical applications. Second, the Internet revolutionizes
the gathering, access and dissemination of non-clinical information in
medicine: Bibliographic and factual databases are now world-wide
accessible via graphical user interfaces, epidemiological and public
health information can be gathered using the Internet, and
increasingly the Internet is used for interactive medical education
applications. Third, the Internet plays an important role for consumer
health education, health promotion and teleprevention. (As an aside,
it should be emphasized that “health education” on the
Internet goes beyond the traditional model of health education, where
a medical professional teaches the patient: On the Internet, much
“health education” is done
“consumer-to-consumer” by means of patient self support
groups organizing in cyberspace. These patient-to-patient interchanges
are becoming an important part of healthcare and are redefining the
traditional model of preventive medicine and health promotion).
With scholarly attention growing and with the 1999 AAAS report
(Frankel & Siang 1999) calling for action, other professional
associations took notice and began drafting statements or guidelines,
or addendum to their extant professional standards. For example, The
Board of Scientific Affairs (BSA) of the American Psychological
Association established an Advisory Group on Conducting Research on
the Internet in 2001; the American Counseling Association’s 2005
revision to its Code of Ethics; the Association of Internet
Researchers (AoIR) Ethics Working Group Guidelines, the National
Committee for Research Ethics in the Social Sciences and the
Humanities (NESH Norway), among others, have directed researchers and
review boards to the ethics of Internet research, with attention to
the most common areas of ethical concern (see
Other Internet Resources
for links).
While many researchers focus on traditional research ethics
principles, conceptualizations of Internet research ethics depend on
disciplinary perspectives. Some disciplines, notably from the arts and
humanities, posit that Internet research is more about context and
representation than about “human subjects”, suggesting
there is no intent, and thus minimal or no harm, to engage in research
about actual persons. The debate has continued since the early 2000s.
White (2002) argued against extant regulations that favored or
privileged specific ideological, disciplinary and cultural
prerogatives, which limit the freedoms and creativity of arts and
humanities research. For example, she notes that the AAAS report
“confuses physical individuals with constructed materials and
human subjects with composite cultural works”, again calling
attention to the person versus text divide that has permeated Internet
research ethics debates. Another example of disciplinary differences
comes from the Oral History Association, which acknowledged the
growing use of the Internet as a site for research:
Simply put, oral history collects memories and personal commentaries
of historical significance through recorded interviews. An oral
history interview generally consists of a well-prepared interviewer
questioning an interviewee and recording their exchange in audio or
video format. Recordings of the interview are transcribed, summarized,
or indexed and then placed in a library or archives. These interviews
may be used for research or excerpted in a publication, radio or video
documentary, museum exhibition, dramatization or other form of public
presentation. Recordings, transcripts, catalogs, photographs and
related documentary materials can also be posted on the Internet
(Ritchie 2003: 19).
While the American Historical Association (A. Jones 2008) has argued
that such research be “explicitly exempted” from ethical
review board oversight, the use of the Internet could complicate such
a stance if such data became available in public settings or available
“downstream” with potential, unforeseeable risks to
reputation, economic standing, or psychological harm, should
identification occur.
Under the concept of text rather than human subjects, Internet
research rests on arguments of publication and copyright; consider the
venue of a blog, which does not meet the definition of human subject
as in 45 C.F.R. § 46.102f (2009), as interpreted by most ethical
review boards. A researcher need not obtain consent to use text from
an open blog, as it is generally considered publicly available,
textual, published material. This argument of the “public
park” analogy that has been generally accepted by researchers is
appropriate for some Internet venues and tools, but not all: Context,
intent, sensitivity of data, and expectations of Internet participants
were identified in 2004 by Sveninngsson as crucial markers in Internet
research ethics considerations.
By the mid-2000s, with three major anthologies published, and a
growing literature base, there was ample scholarly literature
documenting IRE across disciplines and methodologies, and
subsequently, there was anecdotal data emerging from the review boards
evaluating such research. In search of empirical data regarding the
actual review board processes of Internet research from a human
subjects perspective, Buchanan and Ess surveyed over 700 United States
ethics review boards, and found that boards were primarily concerned
with privacy, data security and confidentiality, and ensuring
appropriate informed consent and recruitment procedures (Buchanan
& Ess 2009; Buchanan & Hvizdak 2009).
In 2008, the Canadian Tri-Council’s Social Sciences and
Humanities Research Ethics Special Working Committee: A Working
Committee of the Interagency Advisory Panel on Research Ethics was
convened (Blackstone et al. 2008); and in 2010, a meeting at the
Secretary’s Advisory Committee to the Office for Human Research
Protections highlighted Internet research (SACHRP 2010). Such
prominent professional organizations as the Public Responsibility in
Medicine and Research (PRIM&R) and the American Educational
Research Association (AERA) have begun featuring Internet research
ethics regularly at their conferences and related publications.
Increasingly, disciplines not traditionally involved in human subjects
research have begun their own explorations of IRE. For example,
researchers in computer security have actively examined the tenets of
research ethics in CS and ICT (Aycock et al. 2012; Dittrich, Bailey,
& Dietrich 2011; Carpenter & Dittrich 2012; Buchanan et al.
2011). The U.S. Federal Register requested comments on “The
Menlo Report” in December 2011, calling for a commitment by
computer science researchers to the three principles of respect for
persons, beneficence, and justice, while also adding a fourth
principle on respect for law and public interest (Homeland Security
2011). SIGCHI, an international society for professionals, academics,
and students interested in human-technology and human-computer
interaction (HCI), has increasingly focused on how IRE applies to work
in their domain (Frauenberger et al. 2017; Fiesler et al. 2022).
Further, SIGCHI, along with other computational-oriented venues such
as the Conference on Neural Information Processing Systems (NeurIPS),
have started incorporating the consideration of research ethics into
their submission requirements and peer review processes (Ashurst et
al. 2022). Reflecting this increased focus within computational
domains, the National Science Foundation (NSF) launched a
cross-directorate program supporting research and training efforts on
“Ethical and Responsible Research” (ER2) in 2019
(Bauchspies et al. 2023).
4. Key Ethical Issues in Internet Research
4.1 Privacy
Principles of research ethics dictate that researchers must ensure
there are adequate provisions to protect the privacy of research
subjects and to maintain the confidentiality of any data collected. A
violation of privacy or breach of confidentiality presents a risk of
serious harm to participants, ranging from the exposure of personal or
sensitive information, the divulgence of embarrassing or illegal
conduct, or the release of data otherwise protected under law.
Research ethics concerns around individual privacy is often expressed
in terms of the level of linkability of data to individuals, and the
potential harms from disclosure of information. As Internet research
has grown in complexity and computational sophistication, ethics
concerns have focused on current and future uses of data, and the
potential downstream harms that could occur. Protecting research
participants’ privacy and confidentiality is typically achieved
through a combination of research tactics and practices, including
engaging in data collection under controlled or anonymous
environments, the scrubbing of data to remove personally identifiable
information (PII), or the use of access restrictions and related data
security methods. And, the specificity and characteristics of the data
will often dictate if there are regulatory considerations, in addition
to the methodological considerations around privacy and
confidentiality. For example, personally identifiable information
(PII) typically demands the most stringent protections. The National
Institutes of Health (NIH), for example, defines PII as:
any information about an individual maintained by an agency,
including, but not limited to, education, financial transactions,
medical history, and criminal or employment history and information
which can be used to distinguish or trace an individual’s
identity, such as their name, SSN, date and place of birth,
mother’s maiden name, biometric records, etc., including any
other personal information that is linked or linkable to an individual
(NIH 2010).
Typically, examples of identifying pieces of information have included
personal characteristics (such as date of birth, place of birth,
mother’s maiden name, gender, sexual orientation, and other
distinguishing features and biometrics information, such as height,
weight, physical appearance, fingerprints, DNA and retinal scans),
unique numbers or identifiers assigned to an individual (such as a
name, address, phone number, social security number, driver’s
license number, financial account numbers), and descriptions of
physical location (GIS/GPS log data, electronic bracelet monitoring
information).
The 2018 EU General Data Protection Regulation lays out the legal and
regulatory requirements for data use across the EU. Mondschein &
Monda (2018) provides a thorough discussion on the different types of
data that are considered in the GDPR: Personal data, such as names,
identification numbers, location data, and so on; Special categories
of personal data, such as race or ethic origin, political opinions, or
religious beliefs; Pseudonymous data, referring to data that has been
altered so the subject cannot be directly identified without having
further information; Anonymous data, information which does not relate
to an identifiable natural person or to personal data rendered
anonymous in such a manner that the data subject is not or no longer
identifiable. They also advise researchers to consider
data protection issues at an early stage of a research project is of
great importance specifically in the context of large-scale research
endeavours that make use of personal data (2018: 56).
Internet research introduces new complications to these longstanding
definitions and regulatory frameworks intended to protect subject
privacy. For example, researchers increasingly are able to collect
detailed data about individuals from sources such as Facebook,
X/Twitter, blogs or public email archives, and these rich data sets
can more easily be processed, compared, and combined with other data
(and datasets) available online. In numerous cases, both researchers
and members of the general public have been able to re-identify
individuals by analyzing and comparing such datasets, using
data-fields as benign as one’s zip code (Sweeney 2002), random
Web search queries (Barbaro & Zeller 2006), or movie ratings
(Narayanan & Shmatikov 2008) as the vital key for reidentification
of a presumed anonymous user. Prior to widespread Internet-based data
collection and processing, few would have considered one’s movie
ratings or zipcode as personally-identifiable. Yet, these cases reveal
that merely stripping traditional “identifiable”
information such as a subject’s name, address, or social
security number is no longer sufficient to ensure data remains
anonymous (Ohm 2010), and requires the reconsideration of what is
considered “personally identifiable information” (Schwartz
& Solove 2011). This points to the critical distinction between
data that is kept confidential versus data that is truly anonymous.
Increasingly, data are rarely completely anonymous, as researchers
have routinely demonstrated they can often reidentify individuals
hidden in “anonymized” datasets with ease (Ohm 2010). This
reality places new pressure on ensuring datasets are kept, at the
least, suitably confidential through both physical and computational
security measures. These measures may also include requirements to
store data in “clean rooms”, or in non-networked
environments in an effort to control data transmission.
Similarly, new types of data often collected in Internet research
might also be used to identify a subject within a previously-assumed
anonymous dataset. For example, Internet researchers might collect
Internet Protocol (IP) addresses when conducting online surveys or
analyzing transaction logs. An IP address is a unique identifier that
is assigned to every device connected to the Internet; in most cases,
individual computers are assigned a unique IP address, while in some
cases the address is assigned to a larger node or Internet gateway for
a collection of computers. Many websites and Internet service
providers store activity logs linking IP addresses to online activity,
which can often be connected to specific devices or users (Mayer &
Mitchell 2012). U.S. privacy legislation remains fragmented and only
some laws treat IP addresses as personally identifiable information
(PII) in limited contexts. For example, the Children’s Online
Privacy Protection Act (COPPA) defines IP addresses as personal
information (16 C.F.R. § 312.2), and the California Consumer
Privacy Act (CCPA) also considers IP addresses to be personal
information (Cal. Civ. Code § 1798.140). In contrast, under the
European Union’s General Data Protection Regulation (GDPR), IP
addresses are explicitly considered personal data when they can be
linked to an identifiable individual. There could potentially be a
reconsideration by other federal regulatory agencies over IP addresses
as PII, and researchers and boards will need to be attentive should
such change occur.
A similar complication emerges when we consider the meaning of
“private information” within the context of Internet-based
research. U.S. federal regulations define “private
information” as:
[A]ny information about behavior that occurs in a context in which an
individual can reasonably expect that no observation or recording is
taking place, and information that has been provided for specific
purposes by an individual and that the individual can reasonably
expect will not be made public (for example, a medical record) (45
C.F.R. § 46.102(f) 2009).
This standard definition of “private information” has two
key components. First, private information is that which subjects
reasonably expect is not normally monitored or collected. Second,
private information is that which subjects reasonably expect is not
typically publicly available. Conversely, the definition also suggests
the opposite is true: if users cannot reasonably expect data
isn’t being observed or recorded, or they cannot expect data
isn’t publicly available, then the data does not rise to the
level of “private information” requiring particular
privacy protections. Researchers and REBs have routinely worked with
this definition of “private information” to ensure the
protection of individuals’ privacy.
These distinctions take on greater weight, however, when considering
the data environments and collection practices common with
Internet-based research. Researchers interested in collecting or
analyzing online actions of subjects—perhaps through the mining
of online server logs, the use of tracking cookies, or the scraping of
social media profiles and feeds—could argue that subjects do not
have a reasonable expectation that such online activities are not
routinely monitored since nearly all online transactions and
interactions are routinely logged by websites and service providers.
Thus, online data trails might not rise to the level of “private
information”. However, numerous studies have indicated that
average Internet users have incomplete understandings of how their
activities are routinely tracked, and the related privacy practices
and policies of the sites they visit (Hoofnagle & King 2008 [Other
Internet Resources]; Milne & Culnan 2004; Tsai et al. 2006).
Hudson and Bruckman (2005) conducted empirical research on
users’ expectations and understandings of privacy, finding that
participants’ expectations of privacy within public chatrooms
conflicted with what was actually a very public online space.
Rosenberg (2010) examined the public/private distinction in the realm
of virtual worlds, suggesting researchers must determine what kind of
social norms and relations predominate an online space before making
assumptions about the “publicness” of information shared
within. Thus, it remains unclear whether Internet users truly
understand if and when their online activity is regularly monitored
and tracked, and what kind of reasonable expectations truly exist.
This ambiguity creates new challenges for researchers and REBs when
trying to apply the definition of “private information” to
ensure subject privacy is properly addressed (Zimmer 2010).
This complexity in addressing subject privacy in Internet research is
further compounded with the rise of social networking as a place for
the sharing of information, and a site for research. Users
increasingly share more and more personal information on platforms
like Facebook, Instagram, and TikTok. For researchers, social media
platforms provide a rich resource for study, and much of the content
is available to be viewed and downloaded with minimal effort. Since
much of the information posted to social media sites is publicly
viewable, it thus fails to meet the standard regulatory definition of
“private information”. Therefore, researchers attempting
to collect and analyze social media postings might not treat the data
as requiring any particular privacy considerations. Yet, social media
platforms represent a complex environment of social interaction where
users are often required to place friends, lovers, colleagues, and
minor acquaintances within the same singular category of
“friends”, where privacy policies and terms of service are
not fully understood (Madejski et al. 2011), and where the technical
infrastructures fail to truly support privacy projections (Bonneau
& Preibush 2010) and regularly change with little notice (Stone
2009 [Other Internet Resources]; Zimmer 2009 [Other Internet
Resources]). As a result, it is difficult to understand with any
certainty what a user’s intention was when posting an item onto
a social media platform (Acquisti & Gross 2006). The user may have
intended the post for a private group but failed to completely
understand how to adjust the privacy settings accordingly. Or, the
information might have previously been restricted to only certain
friends, but a change in the technical platform suddenly made the data
more visible to all.
Ohm (2010) warns that
the utility and privacy of data are linked, and so long as data is
useful, even in the slightest, then it is also potentially
reidentifiable (2010: 1751).
With the rapid growth of Internet-based research, Ohm’s concern
becomes even more dire. The traditional definitions and approaches to
understanding the nature of privacy, anonymity, and precisely what
kind of information deserves protection becomes strained, forcing
researchers and REBs to consider more nuanced theories of privacy and
approaches to protecting subject privacy (Markham 2012; Zimmer 2010).
Zimmer (2018), for example, employs Nissenbaum’s (2009) theory
of privacy as “contextual integrity” when assessing the
ethics of using publicly-available data in research, urging a move
beyond binary classifications of data and towards a context-aware
ethical framework. Similarly, Fiesler et al. (2024) argue that
researchers who justify large-scale data collection from Reddit based
on its public availability must reconsider privacy as a contextual
concept—one shaped by user expectations, community norms, and
the potential harms of removing information from its original
environment.
4.2 Recruitment
Depending on the type of Internet research being carried out,
recruitment of participants may be done in a number of ways. As with
any form of research, the study population or participants are
selected for specific purposes (i.e., an ethnographic study of a
particular group on online game players), or, can be selected from a
range of sampling techniques (i.e., a convenience sample gleaned from
the users of Amazon’s Mechanical Turk crowdsourcing
platform
).
In the U.S. context, a recruitment plan is considered part of the
informed consent process, and as such, any recruitment script or
posting must be reviewed and approved by an IRB prior to posting or
beginning solicitation (if the project is human subjects research).
Further, the selection of participants must be impartial and unbiased,
and any risks and benefits must be justly distributed. This concept is
challenging to apply in Internet contexts, in which populations are
often self-selected and can be exclusive, depending on membership and
access status, as well as the common disparities of online access
based on economic and social variables. Researchers also face
recruitment challenges due to online subjects’ potential
anonymity, especially as it relates to the frequent use of pseudonyms
online, having multiple or alternative identities online, and the
general challenges of verifying a subject’s age and demographic
information. Moreover, basic ethical principles for approaching and
recruiting participants involve protecting their privacy and
confidentiality. Internet research can both maximize these
protections, as an individual may never be known beyond a screen name
or avatar existence; or, conversely, the use of IP addresses,
placement of cookies, availability and access to more information than
necessary for the research purposes, may minimize the protections of
privacy and confidentiality.
Much recruitment is taking place via social media; examples include
push technologies, a synchronous approach in which a text or tweet is
sent from a researcher to potential participants based on profile
data, platform activity, or geolocation. Other methods of pull
technologies recruitment include direct email, dedicated web pages,
YouTube videos, direct solicitation via “stickies” posted
on fora or web sites directing participants to a study site, or data
aggregation or scraping data for potential recruitment. Just as
researchers must first comply with their institution’s research
ethics policies and review procedures, they must also respect the
terms and conditions of the platform, including both the specific
norms and community expectations of a given site or locale, as well as
the legal obligations imposed by terms of service agreements. For
example, early pro-anorexia web sites (see Overbeke 2008) were often
treated as sensitive spaces deserving special consideration, and
researchers were asked to respect the privacy of the participants and
not engage in research (Walstrom 2004). In the gaming context,
Reynolds and de Zwart (2010) ask:
Has the researcher disclosed the fact that he or she is engaged in
research and is observing/interacting with other players for the
purposes of gathering research data? How does the research project
impact upon the community and general game play? Is the research
project permitted under the Terms of Service?
Colvin and Lanigan (2005: 38) suggest researchers
Seek permission from Web site owners and group moderators before
posting recruitment announcements, Then, preface the recruitment
announcement with a statement that delineates the permission that has
been granted, including the contact person and date received. Identify
a concluding date (deadline) for the research study and make every
effort to remove recruitment postings, which often become embedded
within Web site postings.
Barratt and Lenton, among others, agree:
It is critical, therefore, to form partnerships with online community
moderators by not only asking their permission to post the request,
but eliciting their feedback and support as well (2010: 71).
Mendelson (2007) and Smith and Leigh (1997) note that recruitment
notices need to contain more information than the typical flyers or
advertisements used for newspaper advertisements. Mentioning the
approval of moderators is important for establishing authenticity, and
so is providing detailed information about the study and how to
contact both the researchers and the appropriate research ethics
board.
Given the array of techniques possible for recruitment, the concept of
“research spam” requires attention. The Council of
American Survey Research warns
Research Organizations should take steps to limit the number of survey
invitations sent to targeted respondents by email solicitations or
other methods over the Internet so as to avoid harassment and response
bias caused by the repeated recruitment and participation by a given
pool (or panel) of data subjects (CASRO 2011: I.B.3).
Ultimately, internet researchers must take care to ensure that online
recruitment practices provide prospective participants with clear,
accessible, and sufficient information--both in the initial
recruitment message and in any subsequent consent processes.
Transparency is essential, particularly when recruitment occurs in
public or semi-public digital spaces where individuals may not expect
to be targeted for research. As Fiesler et al. (2024) point out in the
context of studying participants in online communities such as Reddit,
researchers must assess whether their recruitment methods could
inadvertently expose an individual’s identity without their
explicit consent, especially when usernames, comments, or other
digital traces can be linked back to a person. Ethical recruitment in
online spaces, they argue, requires a context-sensitive approach that
balances the visibility of digital data with respect for
individuals’ autonomy, anonymity, and safety.
4.3 Informed Consent
As the cornerstone of human subjects protections, informed consent
means that participants are voluntarily participating in the research
with adequate knowledge of relevant risks and benefits. Providing
informed consent typically includes the researcher explaining the
purpose of the research, the methods being used, the possible outcomes
of the research, as well as associated risks or harms that the
participants might face. The process involves providing the recipient
clear and understandable explanations of these issues in a concise
way, providing sufficient opportunity to consider them and enquire
about any aspect of the research prior to granting consent, and
ensuring the subject has not been coerced into participating. Gaining
consent in traditional research is typically done verbally or in
writing, either in a face-to-face meeting where the researcher reviews
the document, through telephone scripts, through mailed documents,
fax, or video, and can be obtained with the assistance of an advocate
in the case of vulnerable populations. Most importantly, informed
consent was built on the ideal of “process” and the
verification of understanding, and thus, requires an ongoing
communicative relationship between and among researchers and their
participants. The emergence of the Internet as both a tool and a venue
for research has introduced challenges to this traditional approach to
informed consent.
In most regulatory frameworks, there are instances when informed
consent might be waived, or the standard processes of obtaining
informed consent might be modified, if approved by a research ethics
board.
Various forms of Internet research require different approaches to
the consent process. Some standards have emerged, depending on venue
(i.e., an online survey platform versus a private Facebook group).
However, researchers are encouraged to consider waiver of consent
and/or documentation, if appropriate, by using the flexibilities of
their extant regulations.
Where consent is required but documentation has been waived by an
ethical review board, a “portal” can be used to provide
consent information. For example, a researcher may send an email to
the participant with a link a separate portal or site information page
where information on the project is contained. The participant can
read the documentation and click on an “I agree”
submission. Rosser et al. (2010) recommend using a
“chunked” consent document, whereby individuals can read
specific sections, agree, and then continue onwards to completion of
the consent form, until reaching the study site.
In addition to portals, researchers will often make use of consent
cards or tokens; this alleviates concerns that unannounced researcher
presence is unacceptable, or, that a researcher’s presence is
intrusive to the natural flow and movement of a given locale. Hudson
and Bruckman (2004, 2005) highlighted the unique challenges in gaining
consent in chat rooms, while Lawson (2004) offers an array of consent
possibilities for synchronous computer-mediated communication. There
are different practical challenges in the consent process in Internet
research, given the fluidity and temporal nature of Internet
spaces.
If documentation of consent is required, some researchers have
utilized alternatives such as electronic signatures, which can range
from a simple electronic check box to acknowledge acceptance of the
terms to more robust means of validation using encrypted digital
signatures, although the validity of electronic signatures vary by
jurisdiction.
Regardless of venue, informed consent documents are undergoing changes
in the information provided to research participants. While the basic
elements of consent remain intact, researchers must now acknowledge
with less certainty specific aspects of their data longevity, risks to
privacy, confidentiality and anonymity (see
§4.1 Privacy, above
),
and access to or ownership of data. Research participants must
understand that their terms of service or end user license agreement
consent is distinct from their consent to participate in research.
And, researchers must address and inform participants/subjects about
potential risk of data intrusion or misappropriation of data if
subsequently made public or available outside of the confines of the
original research. Statements should be revised to reflect such
realities as cloud storage (see
§4.4 below
and data sharing.
For example, Aycock et al. (2012: 141) describe a continuum of
security and access statements used in informed consent documents:
“No others will have access to the data”
“Anonymous identifiers will be used during all data
collection and analysis and the link to the subject identifiers will
be stored in a secure manner”
“Data files that contain summaries of chart reviews and
surveys will only have study numbers but no data to identify the
subject. The key [linking] subject names and these study identifiers
will be kept in a locked file”
“Electronic data will be stored on a password protected and
secure computer that will be kept in a locked office. The software
‘File Vault’ will be used to protect all study data loaded
to portable laptops, flash drives or other storage media. This will
encode all data… using Advanced Encryption Standard with
128-bit keys (AES-128)”
This use of encryption in the last statement may be necessary in
research including sensitive data, such as medical, sexual, health,
financial, and so on. Barratt and Lenton (2010), in their research on
illicit drug use and online forum behaviors, also provide guidance
about use of secure transmission and encryption as part of the consent
process.
In addition to informing participants about potential risks and
employing technological protections, NIH-funded researchers whose work
includes projects with identifiable, sensitive information will
automatically be issued a Certificate of Confidentiality:
CoCs protect the privacy of research subjects by prohibiting
disclosure of identifiable, sensitive research information to anyone
not connected to the research except when the subject consents or in a
few other specific situations (NIH 2021 [Other Internet
Resources]).
However, these do not protect against release of data outside of the
U.S. Given that Internet research inherently spans national and
cultural boundaries, new models of informed consent and data
protection may be necessary to ensure meaningful confidentiality and
respect for participants. In traditional international research,
models of informed consent already face fundamental challenges due to
cultural norms and local expectations (Annas 2009; Boga et al. 2011;
Krogstad et al. 2010). In the context of Internet research–where
researchers may not even know the geographic or cultural context of an
individual participant–obtaining valid, culturally appropriate
consent becomes even more complex and demanding. While current
standards of practice show that consent models stem from the
jurisdiction of the researcher and sponsoring research institution,
complications arise in the face of age verification, age of
majority/consent, reporting of adverse effects or complaints with the
research process, and authentication of identity. Various
jurisdictional laws around privacy are relevant for the consent
process; a useful tool is DLA Piper’s “Data Protection
Laws of the World” resource, which relies on in-depth analyses
of the data privacy-related laws and cultures of countries around the
world, helping researchers design appropriate approaches to privacy
and data protection given the particular context (see
Other Internet Resources
).
In addition, as more federal agencies and funding bodies around the
world require researchers to make their data publicly available (i.e.,
NSF, NIH, Wellcome Trust, Research Councils U.K.), the language used
in consent documents will change accordingly to represent this
intended longevity of data and opportunities for future, unanticipated
use. Given the ease with which Internet data can flow between and
among Internet venues, changes in the overall accessibility of data
might occur (early “private” newsgroup conversations were
made “publicly searchable” when Google bought DejaNews),
and reuse and access by others is increasingly possible with shared
datasets. Current data sharing mandates must be considered in the
consent process. Alignment between a data sharing policy and an
informed consent document is imperative. Both should include
provisions for appropriate protection of privacy, confidentiality,
security, and intellectual property.
There is general agreement in the U.S. that individual consent is not
necessary for researchers to use publicly available data, such as
public X/Twitter feeds. Recommendations were made by The National
Human Subjects Protection Advisory Committee (NHRPAC) in 2002
regarding publicly available data sets (see
Other Internet Resources
).
Data use or data restriction agreements are commonly used and set the
parameters of use for researchers.
The U.K. Data Archive (2011
Other Internet Resources
])
provides guidance on consent and data sharing:
When research involves obtaining data from people, researchers are
expected to maintain high ethical standards such as those recommended
by professional bodies, institutions and funding organisations, both
during research and when sharing data. Research data — even
sensitive and confidential data — can be shared ethically and
legally if researchers pay attention, from the beginning of research,
to three important aspects:
when gaining informed consent, include provision for data
sharing
where needed, protect people’s identities by anonymising
data
consider controlling access to data
These measures should be considered jointly. The same measures form
part of good research practice and data management, even if data
sharing is not envisioned. Data collected from and about people may
hold personal, sensitive or confidential information. This does not
mean that all data obtained by research with participants are personal
or confidential. (p. 23)
The ethical complexities of consent and data sharing made public
headlines in 2016 when a Danish researcher released a data set
comprised of scraped data from nearly 70,000 users of the OkCupid
online dating site. The data set was highly reidentifiable and
included potentially sensitive information, including usernames, age,
gender, geographic location, what kind of relationship (or sex)
they’re interested in, personality traits, and answers to
thousands of profiling questions used by the site. The researcher
claimed the data were public and thus, such sharing and use was
unproblematic and no consent was necessary. Zimmer (2016) was among
many privacy and ethics scholars who critiqued this stance.
Shilton et al. (2021) further explore the ethical challenges of
applying traditional informed consent models to internet and social
data-driven research. They argue that in contexts where data is
passively collected or repurposed--such as through social media
platforms or mobile devices--conventional informed consent mechanisms
often fall short, highlighting that users may not fully comprehend how
their data is utilized, and the dynamic nature of data collection
makes it difficult to obtain meaningful consent. They suggest that
relying solely on informed consent is insufficient for ethical data
practices in internet research, and advocate for a more holistic
approach that includes transparency, accountability, and user
empowerment to ensure ethical standards are upheld in the evolving
landscape of big data data research.
4.4 Cloud-Based Platforms and Research Ethics
The rise of cloud-based platforms and distributed computing
environments has created new opportunities—and ethical
challenges—for researchers working with internet-based tools,
data, and collaborations. While “cloud computing” once
referred narrowly to the remote delivery of storage and computing
power via services like Amazon Web Services or Microsoft Azure, the
term now broadly encompasses a range of tools and platforms that
enable the collection, processing, and sharing of data online.
Researchers today rely on a variety of cloud-enabled services,
including collaborative workspaces (e.g., Google Drive, Microsoft
365), social platforms (e.g., Reddit, Facebook, X/Twitter),
API-enabled data collection tools, and crowdsourcing platforms (e.g.,
Amazon Mechanical Turk, Prolific). These tools are used for tasks such
as subject recruitment, data scraping, analysis, storage, remote
collaboration, and even real-time AI-powered interventions. As these
tools grow in sophistication and reach, so do the ethical risks
associated with their use.
A central ethical concern in cloud-based research is ensuring the
protection of personal data. Researchers must verify that datasets
stored in the cloud are secured through appropriate access controls,
encryption, and data minimization practices. It is critical to assess
whether third-party platforms (including storage providers and APIs)
collect metadata or reserve access rights through their terms of
service. These contracts may permit advertisers, law enforcement, or
platform owners to access data in ways that conflict with the
expectations of research participants or IRB approvals.
Geographic distribution of data storage adds complexity, particularly
in relation to jurisdictional privacy laws (e.g., GDPR in the EU, CCPA
in California). Ethical data stewardship includes ensuring that
sensitive research data is handled in accordance with relevant legal
standards–including rights such as the right to be forgotten or
the right to erasure–and that deletion or withdrawal of data is
possible, even when stored in distributed or redundant systems.
A more unique application of cloud computing for research involves the
crowdsourcing of data analysis and processing functions, that is,
leveraging the thousands of users of various online products and
services to complete research related tasks remotely. Examples include
using a distributed network of video game players to assist in solving
protein folding problems (Markoff 2010), and leveraging Amazon’s
Mechanical Turk crowdsourcing marketplace platform to assist with
large scale data processing and coding functions that cannot be
automated (Conley & Tosti-Kharas 2014; J. Chen et al. 2011). Using
cloud-based platforms can raise various critical ethical and
methodological issues.
First, new concerns over data privacy and security emerge when
research tasks are widely distributed across a global network of
users. Researchers must take great care in ensuring research data
containing personal or sensitive information isn’t accessible by
outsourced labor, or that none of the users providing crowdsourced
labor are able to aggregate and store their own copy of the research
dataset. Second, crowdsourcing presents ethical concerns over trust
and validity of the research process itself. Rather than a local team
of research assistants usually under a principal investigator’s
supervision and control, crowdsourcing tends to be distributed beyond
the direct management or control of the researcher, providing less
opportunity to ensure sufficient training for the required tasks.
Thus, researchers will need to create additional means of verifying
data results to confirm tasks are completed properly and
correctly.
Two additional ethical concerns with crowdsourcing involve labor
practices and authorship. Platforms like Amazon Mechanical Turk were
originally designed to facilitate paid microtasks such as data
labeling and transcription–not to serve as recruitment tools for
research participants. When researchers use such platforms for human
subjects research, they must ensure that workers are not exploited,
that they are legally eligible for paid work, and that compensation is
fair, meaningful, and appropriate to the nature of the research task
(Scholz 2008; Williams 2010
Other Internet Resources
).
Finally, at the conclusion of a research project that has relied on
crowdsourcing, researchers may face the ethical challenge of how to
appropriately acknowledge the contributions of crowd workers–who
are often anonymous. Ethical research demands a fair and accurate
account of authorship and contribution. Disciplinary norms for
reporting contributions by collaborators and research assistants vary,
and these complexities are amplified when the work of anonymous crowd
laborers has shaped the research (Silberman et al. 2010).
4.5 Big Data Considerations
Algorithmic processing is a corollary of big data research, and
newfound ethical considerations have emerged. From “algorithmic
harms” to “predictive analytics”, the power of
today’s algorithms exceeds long-standing privacy beliefs and
norms. Specifically, the National Science and Technology Council
note:
“Analytical algorithms” as algorithms for prioritizing,
classifying, filtering, and predicting. Their use can create privacy
issues when the information used by algorithms is inappropriate or
inaccurate, when incorrect decisions occur, when there is no
reasonable means of redress, when an individual’s autonomy is
directly related to algorithmic scoring, or when the use of predictive
algorithms chills desirable behavior or encourages other privacy
harms. (NSTC 2016: 18).
Although the concept of big data has existed since the 1990s in
technical circles, public awareness and critical engagement with big
data research has only emerged more recently. In particular, Buchanan
(2016) traces the rise of big data-driven research–especially
involving social media platforms–from around 2012 onward, noting
that this trend shows no signs of slowing.
Big data research is challenging for research ethics boards, often
presenting what the computer ethicist James Moor would call
“conceptual muddles”: the inability to properly
conceptualize the ethical values and dilemmas at play in a new
technological context. Subject privacy, for example, is typically
protected within the context of research ethics through a combination
of various tactics and practices, including engaging in data
collection under controlled or anonymous environments, limiting the
personal information gathered, scrubbing data to remove or obscure
personally identifiable information, and using access restrictions and
related data security methods to prevent unauthorized access and use
of the research data itself. The nature and understanding of privacy
become muddled, however, in the context of big data research, and as a
result, ensuring it is respected and protected in this new domain
becomes challenging.
For example, the determination of what constitutes “private
information”—and thus triggering particular privacy
concerns—becomes difficult within the context of big data
research. Distinctions within the regulatory definition of
“private information”—namely, that it only applies
to information which subjects reasonably expect is not normally
monitored or collected and not normally publicly
available—become less clearly applicable when considering the
data environments and collection practices that typify big data
research, such as the wholesale scraping of Facebook news feed content
or public OkCupid accounts.
When considered through the lens of the regulatory definition of
“private information”, social media postings are often
considered public, especially when users take no visible, affirmative
steps to restrict access. As a result, big data researchers might
conclude subjects are not deserving of particular privacy
consideration. Yet, the social media platforms frequently used for big
data research purposes represent a complex environment of
socio-technical interactions, where users often fail to understand
fully how their social activities might be regularly monitored,
harvested, and shared with third parties, where privacy policies and
terms of service are not fully understood and change frequently, and
where the technical infrastructures and interfaces are designed to
make restricting information flows and protecting one’s privacy
difficult.
As noted
in §4.1 above
it becomes difficult to confirm a user’s intention when sharing
information on a social media platform, and whether users recognize
that providing information in a social environment also opens it up
for widespread harvesting and use by researchers. This uncertainty in
the intent and expectations of users of social media and
internet-based platforms—often fueled by the design of the
platforms themselves—create numerous conceptual muddles in our
ability to properly alleviate potential privacy concerns in big data
research.
The conceptual gaps that exist regarding privacy and the definition of
personally identifiable information in the context of big data
research inevitably lead to similar gaps regarding when informed
consent is necessary. Researchers mining Facebook profile information
or public X/Twitter streams, for example, typically argue that no
specific consent is necessary due to the fact the information was
publicly available. It remains unknown whether users truly understood
the technical conditions under which they made information visible on
these social media platforms or if they foresaw their data being
harvested for research purposes, rather than just appearing onscreen
for fleeting glimpses by their friends and followers (Fiesler &
Proferes, 2018). In the case of the Facebook emotional contagion
experiment (Kramer, Guillory, & Hancock 2014), which involved
nearly 690,000 users, the failure to obtain informed consent was
initially justified by invoking Facebook’s broad terms of
service–a document over 9,000 words long that makes only a
passing reference to “research” in its data use policy. It
was later revealed, however, that the data use policy in effect when
the experiment was conducted never mentioned “research” at
all (Hill 2014).
Additional ethical concerns have arisen surrounding the large scale
data collection practices connected to machine learning and the
development of artificial intelligence. For example, negative public
attention have surrounded algorithms designed to infer sexual
orientation from photographs and facial recognition algorithms trained
on videos of transgender people. In both cases, ethical concerns have
been raised about both the purpose of these algorithms and the fact
that the data that trained them (dating profile photos and YouTube
videos, respectively) was “public” but collected from
potentially vulnerable populations without consent (Metcalf 2017;
Keyes 2019). While those building AI systems cannot always control the
conditions under which the data they utilize is collected, their
increased use of big datasets captured from social media or related
sources raises a number of concerns beyond what typically is
considered part of the growing focus on AI ethics: fairness,
accountability and transparency in AI can only be fully possible when
data collection is achieved in a fair, ethical, and just manner (Stahl
& Wright 2018; Kerry 2020).
Shilton et al. (2021) expand on these concerns by highlighting how
traditional models of privacy and informed consent are often
ill-suited to the realities of large-scale, big data collection. They
challenges researchers to move beyond simplistic public/private
distinctions and instead adopt context-aware approaches rooted in user
expectations, potential harms, and community norms, arguing that the
ethical use of big data requires not just technical compliance with
terms of service or regulatory frameworks, but a deeper engagement
with issues of power, marginalization, and the lived realities of data
subjects.
4.6 Internet Research and Industry Ethics
The Facebook emotional contagion experiment, discussed above, is just
one example in a larger trend of big data research conducted outside
of traditional university-based research ethics oversight mechanisms.
Nearly all online companies and platforms analyze data and test
theories that often rely on data from individual users. Industry-based
data research, once limited to marketing-oriented “A/B
testing” of benign changes in interface designs or corporate
communication messages, now encompasses information about how users
behave online, what they click and read, how they move, eat, and
sleep, the content they consume online, and even how they move about
their homes. Such research produces inferences about
individuals’ tastes and preferences, social relations,
communications, movements, and work habits. It implies pervasive
testing of products and services that are an integral part of intimate
daily life, ranging from connected home products to social networks to
smart cars. Except in cases where they are partnering with academic
institutions, companies typically do not put internal research
activities through a formal ethical review process, since results are
typically never shared publicly and the perceived impact on users is
minimal.
The growth of industry-based big data research, however, presents new
risks to individuals’ privacy, on the one hand, and to
organizations’ legal compliance, reputation, and brand, on the
other hand. When organizations process personal data outside of their
original context, individuals may in some cases greatly benefit, but
in other cases may be surprised, outraged, or even harmed. Soliciting
consent from affected individuals can be impractical: Organizations
might collect data indirectly or based on identifiers that do not
directly match individuals’ contact details. Moreover, by
definition, some non-contextual uses—including the retention of
data for longer than envisaged for purposes of a newly emergent
use—may be unforeseen at the time of collection. As Crawford and
Schultz (2014) note,
how does one give notice and get consent for innumerable and perhaps
even yet-to-be-determined queries that one might run that create
“personal data”? (2014: 108)
With corporations developing vast “living laboratories”
for big data research, research ethics has become a critical component
of the design and oversight of these activities. For example, in
response to the controversy surrounding the emotional contagion
experiment, Facebook developed an internal ethical review process
that, according to its facilitators,
leverages the company’s organizational structure, creating
multiple training opportunities and research review checkpoints in the
existing organizational flow (Jackman & Kanerva 2016: 444).
While such efforts are important and laudable, they remain open for
improvement. Hoffmann (2016), for example, has criticized Facebook for
launching an ethics review process that “innovates on process
but tells us little about the ethical values informing their product
development.” Further, in their study of employees doing the
work of ethics inside of numerous Silicon Valley companies, Metcalf
and colleagues found considerable tension between trying to resolve
thorny ethical dilemmas that emerge within an organization’s
data practices and the broader business model and corporate logic that
dominates internal decision-making (Metcalf, Moss, & boyd
2019).
Moreover, new uses of AI systems trained on platform data--often
scraped or mined without individuals’ awareness--have raised new
ethical concerns. Examples include predictive algorithms trained on
sensitive data sources such as dating profile photos, biometric scans,
and YouTube videos of vulnerable communities—often without the
knowledge or consent of the individuals depicted. As King and
Meinhardt (2024) emphasize, organizations now often generate sensitive
personal data not through direct collection, but by using AI systems
to infer characteristics such as mental health status or sexual
orientation from innocuous data like search queries or social media
activity. These practices raise urgent questions about how to ensure
meaningful consent, particularly when individuals are unaware their
data has been included in training datasets, and highlight the broader
challenge of addressing emergent privacy harms that traditional
regulatory frameworks are ill-equipped to manage While the data
used may be “public,” the intent, impact, and power
asymmetries involved often demand a higher ethical standard (Shilton
et al. 2021).
Taken together, these developments underscore the urgent need for
industry-specific ethical frameworks that go beyond self-regulation or
internal ethics-by-design protocols. Proposals include establishing
third-party ethics audits, enforcing algorithmic transparency
requirements, and building external oversight boards for high-risk
data projects (Bernstein et al. 2021 [Other Internet Resources]).
5. Conclusion
As the Internet continues to serve as a research tool, venue, and
object of study, the ethical landscape of Internet research remains
both expansive and evolving. From early debates about whether existing
frameworks like consequentialism or deontology could adequately
address online research, to current concerns about datafication,
algorithmic harms, and cross-jurisdictional data flows, the core
challenges of Internet Research Ethics persist, but in increasingly
complex and uncertain forms. This entry has highlighted how
foundational principles of human subjects research – such as
privacy, informed consent, and justice – are being reshaped in
the face of cloud-based platforms, big data practices, and the growth
of computational methods. Key ethical issues such as participant
recruitment, consent mechanisms, and the ambiguous status of public
data now require not only updated definitions but also flexible,
context-aware interpretations that consider both the technological
infrastructures and social expectations at play. The role of research
ethics boards remains central, yet uneven, in their ability to
navigate emerging dilemmas, especially when industry-based research
often falls outside traditional oversight mechanisms.
Ultimately, Internet research ethics is not a static set of rules but
a dynamic, interdisciplinary endeavor that must keep pace with rapidly
shifting digital norms, evolving technologies, and emerging forms of
harm. The challenges ahead demand thoughtful engagement across
disciplines and sectors, renewed attention to fairness and
accountability, and ongoing efforts to adapt ethical guidance and
governance in ways that remain sensitive to the lives, identities, and
vulnerabilities of those whose data becomes the object of
research.
US