Privacy Risks from Genomic Data-Sharing Beacons - PMC
Official websites use .gov
.gov
website belongs to an official
government organization in the United States.
Secure .gov websites use HTTPS
lock
) or
means you've safely
connected to the .gov website. Share sensitive
information only on official, secure websites.
Journal List
User Guide
PERMALINK
As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsement of, or agreement with,
the contents by NLM or the National Institutes of Health.
Learn more:
PMC Disclaimer
PMC Copyright Notice
. 2015 Oct 29;97(5):631–646. doi:
10.1016/j.ajhg.2015.09.010
Privacy Risks from Genomic Data-Sharing Beacons
Suyash S Shringarpure
Suyash S Shringarpure
Department of Genetics, Stanford University, Stanford, CA 94305, USA
Find articles by
Suyash S Shringarpure
1,
Carlos D Bustamante
Carlos D Bustamante
Department of Genetics, Stanford University, Stanford, CA 94305, USA
Find articles by
Carlos D Bustamante
1,
∗∗
Department of Genetics, Stanford University, Stanford, CA 94305, USA
Corresponding author
suyashs@stanford.edu
∗∗
Corresponding author
cdbustam@stanford.edu
Received 2015 Aug 11; Accepted 2015 Sep 23; Issue date 2015 Nov 5.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
PMC Copyright notice
PMCID: PMC4667107  PMID:
26522470
Abstract
The human genetics community needs robust protocols that enable secure sharing of genomic data from participants in genetic research. Beacons are web servers that answer allele-presence queries—such as “Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?”—with either “yes” or “no.” Here, we show that individuals in a beacon are susceptible to re-identification even if the only data shared include presence or absence information about alleles in a beacon. Specifically, we propose a likelihood-ratio test of whether a given individual is present in a given genetic beacon. Our test is not dependent on allele frequencies and is the most powerful test for a specified false-positive rate. Through simulations, we showed that in a beacon with 1,000 individuals, re-identification is possible with just 5,000 queries. Relatives can also be identified in the beacon. Re-identification is possible even in the presence of sequencing errors and variant-calling differences. In a beacon constructed with 65 European individuals from the 1000 Genomes Project, we demonstrated that it is possible to detect membership in the beacon with just 250 SNPs. With just 1,000 SNP queries, we were able to detect the presence of an individual genome from the Personal Genome Project in an existing beacon. Our results show that beacons can disclose membership and implied phenotypic information about participants and do not protect privacy a priori. We discuss risk mitigation through policies and standards such as not allowing anonymous pings of genetic beacons and requiring minimum beacon sizes.
Introduction
In the coming decade, a great deal of human genomic data, along with linked phenotypes in electronic health records, will be collected in the context of health care. A major goal of the human genomics community is to enable efficient sharing, aggregation, and analysis of these data in order to understand the genetic contributors of health and disease. Previous large-scale data-sharing approaches have had limited success because of the potential for privacy breaches and risks of participant re-identification. Homer et al.
and others
showed that subjects in a genome-wide association study could be re-identified with the use of allele frequencies, resulting in the removal of publicly available allele-frequency data.
The Beacon Project by the Global Alliance for Genomics & Health (GA4GH) aims to simplify data sharing through a web service (“beacon”) that provides only allele-presence information. Users can query institutional beacons for information about genomic data available at the institution. Queries are of the form “Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?” and the beacon server can answer “yes” or “no.” Beacons are intended to be easily set up and to allow data sharing while protecting participant privacy. By providing only allele-presence information, beacons are safe from attacks that require allele frequencies.
However, a privacy breach from a beacon would be troubling given that beacons often summarize data with a particular disease of interest. For instance, identifying that a given genome is part of the SFARI beacon, which contains genomic data from families with a child affected by autism spectrum disorder, means that the individual belongs to a family where some member has autism spectrum disorder. Thus, beacons could leak not only membership information but also phenotype information. Although genetic privacy is protected to some extent by the Genetic Information Nondiscrimination Act (GINA), the offered protections are limited, and GINA does not apply to long-term care insurance, life insurance, disability insurance, or other special cases.
Therefore, all data-sharing mechanisms, including beacons, must protect participant privacy.
To examine the question of re-identification in a beacon, we have developed a likelihood-ratio test (LRT) that uses allele presence or absence responses from a beacon to predict whether a given individual genome is present in the beacon database. Our approach is independent of allele frequencies. The statistical properties of the LRT guarantee that it is the most powerful test for this problem. A variation of our LRT can detect relatives of the query individual in the beacon. Our results suggest that anonymous-access beacons do not protect individual privacy and are open to re-identification attacks. As a result, they can also disclose phenotype information about individuals whose genomes are present in the beacon.
Material and Methods
We assume a beacon composed of unrelated individuals from a single population. Given query
= {
}, the beacon answers “yes” (represented as 1) if allele
is an alternate allele at position
on chromosome
and has a non-zero frequency in the sample used for constructing the beacon, and it answers “no” (represented as 0) otherwise. We consider only bi-allelic SNPs for our analysis.
Thus, given a set of
queries
= {
, …,
}, the beacon returns a set of responses
= {
, …,
}. For our scenario, we assume that the attacker has access to more information—the number of individuals (
) in the beacon database and the site frequency spectrum (SFS) of the population in the beacon—parameterized as a beta distribution with shape parameters (
′,
′). Thus, we assume that alternate allele frequencies
for all SNPs observed in the population are distributed as
∼ beta(
′,
′).
For our attack scenario, we assume a setting identical to that used by Homer et al.
and others. In this setting, the attacker receives a VCF file listing all the SNP positions at which the query individual has an alternate allele and the genotype calls at the corresponding positions. The attacker then queries the beacon for all heterozygous positions by using the alternate allele listed in the VCF and obtains the set of responses
from the beacon. We develop a LRT that can use the responses
to decide whether the query genome is in the beacon.
If the query individual is present in the beacon, then every allele in the query genome must be present in the beacon. Thus, the beacon will return a “yes” (1) response to every query. If a query individual is not present in the beacon, then the beacon response will be “yes” (1) if some individual in the beacon has the allele and “no” (0) otherwise. By calculating the likelihood of the responses, we can differentiate query individuals in the beacon from those not in the beacon. Our approach for re-identifying individuals within a beacon is based on a LRT that uses this information. For each query genome, we calculate the likelihood of the beacon responses to
allele-presence queries under the null hypothesis that a given individual is not in the beacon and the alternative hypothesis that the given individual is in the beacon. We then calculate the test statistic as the ratio of the two likelihoods.
To make our LRT generalizable across populations, we will remove direct dependence on allele frequencies given that frequencies can vary considerably for a given allele across populations. Instead, we will allow our test to depend on the shape of the SFS, which is described by (
′,
′), the parameters of the beta distribution. Although allele frequencies for a given allele can vary considerably across populations, the SFS parameters for most populations are similar to each other (
Modeling SFSs by Beta Distributions
in
Appendix A
). Therefore, the results from a test that depends on the shape of the SFS but is independent of the actual allele frequencies can be generalized to many populations (
Figure S1
).
Our LRT evaluates the likelihood of the beacon response under two possible hypotheses.
Null hypothesis
: query genome is not in the beacon database.
Alternative hypothesis
: query genome is in the beacon database.
LRT
In an ideal setting, we would expect
… =
= 1 if a query genome
is in the beacon
. In practice, because of sequencing errors and differences in variant-calling pipelines, we might have some mismatches between the query copy of a genome and its copy in the beacon. We assume that this happens with probability
Let the alternate allele frequency at the SNP corresponding to query
be
. Because the beacon is only queried at the positions where the query genome is heterozygous,
is not distributed as beta(
′,
′) but shows an ascertainment bias. We can show that
∼ beta(
), where
′ + 1 and
′ + 1 in theory (
Posterior Distribution of Allele Frequencies
in
Appendix A
).
The log-likelihood of a response set
= {
, …,
} can be written as
log
log
(Equation 1)
For the LRT, we need to evaluate this log-likelihood under the null hypothesis and the alternative hypothesis. The null hypothesis is that the query genome is not present in the beacon, and the alternative hypothesis is that the query genome is present in the beacon.
We can show that under the alternative hypothesis, the log-likelihood can be calculated as
log
log
(Equation 2)
where
− 1
is the probability that none of
− 1 genomes has an alternate allele at a given position (see
Likelihood under the Alternative Hypothesis
in
Appendix A
).
Similarly, the log-likelihood under the null hypothesis is
log
log
(Equation 3)
(see
Likelihood under the Null Hypothesis
in
Appendix A
).
The log of the likelihood-ratio statistic can then be written as
log
log
where we have defined
log
and
log
(see
LRT Statistic
in
Appendix A
). For
, we have
< 0. In practice, because
, and mismatch rate
, this will always be true.
Therefore, the LRT statistic can be stated as
(Equation 4)
The LRT stated above can be understood to be a test for a simple null hypothesis
= 1 −
against a simple alternative hypothesis
= 1 −
δD
when we are given {
, …,
} sampled as
∼ Bernoulli(
). By the Neyman-Pearson lemma, the LRT is the most powerful test for a given test size
Binomial Test
The null hypothesis is rejected if Λ <
for some threshold
. Let
be such that
(Λ <
) =
. This is equivalent to rejecting the null hypothesis if
, where
Because the
are independent and identically distributed (i.i.d.) under both hypotheses,
binomial
and
binomial
. Therefore, the power of the exact test can be calculated as
, where
is chosen such that
A sufficient statistic for the LRT is the number of “yes” responses from the beacon.
Relationship between the Number of Queries Required and Beacon Size
In the null and alternative hypotheses,
is a Bernoulli random variable. Therefore, by the central limit theorem, the LRT statistic has a Gaussian distribution. We can therefore use the parameters of the Gaussian distribution to obtain a relationship between the number of queries (required for achieving a desired power and false-positive rate) and the number of individuals in the beacon.
Let
and
be the mean and SD, respectively, of the LRT statistic under the null hypothesis, and let
and
be the corresponding values under the alternative hypothesis.
For an LRT statistic with false-positive rate
, power 1 −
, and a normal distribution, we have that
(Equation 5)
where
is the
quantile of the standard normal distribution.
For the LRT we describe, this relationship is equivalent to
(Equation 6)
(see
Gaussian LRT Power Approximation
in
Appendix A
). The right-hand side of the equation is independent of both
and
for a specified false-positive rate
and power 1 −
. Thus, we have that
LRT for Detecting Relatives
The relatedness of two individuals can be parameterized with a single parameter
, which is the probability that the two individuals share an allele at a single SNP. Thus, identical twins have
= 1, parent-offspring and sibling pairs have
= 0.5, first cousins have
= 0.25, and so on.
The likelihood for the null hypothesis remains the same as before. Under the alternate hypothesis (a relative of the query genome
with relatedness
is present in beacon
), the log-likelihood is given by
log
log
(Equation 7)
(see
Likelihood under the Alternate Hypothesis
in
Appendix B
).
We can use this form to calculate the LRT statistic for this setting. Here, the exact test uses
as the sufficient statistic (as before), and the sufficient statistic is binomially distributed under both hypotheses. The distributions are given by
binomial
and
binomial
Therefore, the power of the exact test can be calculated as
, where
is chosen such that
Simulation Experiments
We simulated 500,000 SNPs in a sample of 1,000 diploid individuals. Alternate allele frequencies were sampled from a multinomial distribution with probabilities obtained from the expected allele-frequency distribution for a standard neutral model under the assumption of a population size of 10,000 individuals.
We constructed a beacon by using the 1,000 simulated individuals. The query set of individuals consisted of
200 diploid individuals from the beacon
200 diploid individuals not in the beacon and whose genotypes were simulated according to the generated allele frequencies at all SNPs.
For initial experiments, the mismatch rate between the beacon and query copies of the same genomes was set to 10
−6
to simulate near-ideal data.
The null distribution of the LRT statistic was obtained with the exact-test calculation for the 200 individuals not in the beacon. Power was calculated as the proportion of successfully rejected tests (out of 200) for the query genomes in the beacon.
Detecting Relatives
To examine whether relatives could be identified from the beacon, we used 200 individuals from the beacon to generate query genomes with varying degrees of relatedness to the original individual.
Effect of Noise
Genome sequencing is more error prone than array genotyping. Even with high-coverage data, biological replicates of the same individual could have 1%–5% SNPs unique to each replicate. On the same sequenced sample, different variant-calling pipelines can produce SNP calls at positions that might differ from each other. We model this in our simulation by allowing for a mismatch probability (
) that for a query individual who is in the beacon and is heterozygous at the query SNP, the copy in the beacon is a homozygous reference, i.e., the beacon will (erroneously) return 0 as the response to the query.
Table S2
shows the levels of mismatch modeled in our experiments.
Experiments with Real Data
1000 Genomes Phase 1 CEU Beacon
We created a beacon by using the CEU population (Utah residents with ancestry from northern and western Europe from the CEPH collection) from phase 1 of the 1000 Genomes Project.
Of the 85 CEU samples present in phase 1, 65 were used in the beacon. 20 samples from the beacon and the remaining 20 samples were used as query genomes.
Figure S4
shows the setup of the 1000 Genomes phase 1 CEU beacon.
To test the effect of censoring on power, we constructed a beacon by using the same data as above, except that the beacon always responded “no” to queries for singletons. We then used whole genomes to query the beacon, as before.
To test whether sharing SNP array data was more secure than sharing whole genomes, we repeated the setup of
Figure S4
with Affymetrix array data for the CEU samples. We then used SNP array data to query the beacon.
Combining Multiple Datasets
We used the scheme of
Figure S5
to create beacons that contained either a single population (65 CEU individuals) or multiple populations (a CEU + YRI [Yoruba in Ibadan, Nigeria] beacon with 32 CEU and 33 YRI individuals and a CEU + JPT [Japanese in Tokyo, Japan] beacon with 32 CEU and 33 JPT individuals). We used 40 CEU individuals as query individuals, 20 of whom belonged to all beacons and 20 of whom belonged to none of the beacons.
Re-identifying a Personal Genome Project Individual
To test our method on existing beacons, we selected from the Personal Genome Project (PGP)
a single genome (ID hu48C4EB or PGP 183). We chose 1,000 heterozygous SNPs from the selected individual’s genome and used the GA4GH Beacon Network query interface to query all existing beacons for the alternate allele at the chosen SNPs. If a beacon of size
produced
“yes” responses to
queries, the p value was calculated under the null hypothesis as
binomial
Through metadata (see
Web Resources
), we were able to ascertain that the selected individual was present in the PGP beacon and the Kaviar
10
beacon.
Results
Re-identification in a Simulated Beacon
We validated our LRT framework by simulating a beacon with 1,000 individuals and 500,000 total SNPs. From the power curve (
Figure 1
A), we can see that the LRT had more than 95% power to detect whether an individual was in the beacon with just 5,000 SNP queries. We also see that our theoretical analysis matches the empirical results. For the same number of SNPs queried, the power for detecting relatives was reduced but still considerable (
Figure 1
B;
Figure S2
). Sequencing errors and variant-calling differences reduced the power of the test (
Figure S3
).
Figure 1.
Open in a new tab
Power of Re-identification Attacks on Beacons Constructed with Simulated Data
Power curves for the likelihood-ratio test (LRT) on (A) a simulated beacon with 1,000 individuals and (B) detecting relatives in the simulated beacon. The false-positive rate was set to 0.05 for all scenarios.
Re-identification in Phase 1 CEU Beacon
For evaluation with real data, we set up a beacon by using 65 CEU individuals from phase 1 of the 1000 Genomes Project
Figure S4
). With just 250 SNPs, beacon membership could be detected with 95% power and a 5% false-positive rate (
Figure 2
A). A beacon constructed with the same individuals but with SNP array data showed a reduction in power, as did a beacon that used sequence data but censored responses by always replying “no” to queries for singletons (
Figure 2
B). Even in these scenarios, the LRT had greater than 90% power if 10,000 or more queries were permitted.
Figure 2.
Open in a new tab
Power of Re-identification Attacks on Beacons Constructed with Real Data
Power curves for the LRT on (A) a beacon constructed from 65 CEU individuals from 1000 Genomes phase 1 and (B) CEU beacons of size 65 and constructed with array data, censored WGS data (without singletons), and WGS data. The false-positive rate was set to 0.05 for all scenarios.
Re-identification in Multi-population Beacon
From our theoretical analysis, we can see that increasing beacon size increases the number of SNPs required for achieving a given power level at a specified threshold for the false-positive rate. Combining multiple datasets can make detection more difficult in the same way. A question of interest is whether combining multiple datasets can also make detection more difficult by affecting the SFS of the samples in the beacon.
Figure 3
shows the power curves for beacons containing multiple populations. The results show that for a fixed number of SNPs to query, the power for the multi-population beacons is higher than that for the CEU-only beacon. A single-population beacon is therefore more secure than a multi-population beacon of the same size. Because the protective effect of extra samples in the beacon against re-identification depends on their allele sharing with the query genome, including other populations in a beacon is less effective than including the same number of individuals from the population of the query genome.
Figure 3.
Open in a new tab
Power of the LRT for Multi-population Datasets
Power is larger for multi-population beacons than for the CEU-only beacon.
Re-identification in Existing Beacons
We used our theoretical analysis to estimate the number of queries our framework would require to re-identify individuals and relatives from existing beacons. We used publicly available beacon metadata to infer the number of individuals present in the beacon. Where this was not possible (the AMPlab, ICGC, and NCBI beacons), we used conservative estimates based on the size of the underlying datasets. For SFS parameters, we used the estimates we obtained for our simulation data. The Kaviar beacon contains 63,500 exomes and 8,400 whole genomes. Because exomes are only 1% of entire genomes in length, this beacon can be assumed to consist of two beacons—an exome beacon with 72,000 exomes and a genome beacon with 8,400 whole genomes. Re-identification is possible in the genome beacon if queries for SNPs in the coding regions are avoided. From
Table 1
, we see that only the Cafe CardioKit gene-panel beacon, the Broad Institute exome beacon, and the Kaviar beacon are safe from our re-identification attack, given that the gene panels and exomes have much fewer SNPs than genomes. For all other beacons, re-identification is possible with 95% power and fewer than 50,000 allele queries. Thus, our approach is computationally feasible with existing beacons.
Table 1.
Estimated Number of SNP Queries Required for Re-identification in Real Beacons with a 5% False-Positive Rate and 95% Power
Beacon Name
Number of Samples
SNPs Required for Re-identification
Identical Genomes
First-Degree Relatives
Second-Degree Relatives
1000 Genomes Project
1,092
3,649
34,467
157,861
1000 Genomes Project phase 3
2,535
8,469
79,976
366,276
AMPLab
2,535
8,469
79,976
366,276
Broad Institute
60,706
202,770
1,914,581
8,768,007
Cafe CardioKit
1,070
3,575
33,773
154,684
ICGC
12,807
42,779
403,936
1,849,878
Known VARiants
72,000
240,494
2,270,772
10,399,218
Known VARiants (genomes only)
8,400
28,059
264,947
1,213,368
NCBI
14,466
48,320
456,258
2,089,490
PGP
174
582
5,515
25,273
IBD
5,070
16,936
159,926
732,410
Native American + Egyptian
100
335
3,181
14,586
UK10K
6,322
21,118
199,411
913,239
SFARI
10,400
34,739
328,024
1,502,231
Open in a new tab
Re-identifying a PGP Individual
We demonstrated the feasibility of re-identification in existing beacons by querying them 1,000 times with a single genome from the PGP.
To avoid overloading the beacon servers, we inserted a delay of 5 s between queries, and all 1,000 queries were completed in 3 hr 53 min from a single computer. In beacons where the presence of the individual could be confirmed from metadata, we obtained 100% “yes” responses (
Table 2
). The null hypothesis (the query genome is not in the beacon) could be rejected only for the PGP beacon (p = 0.0033), but not for the larger Kaviar beacon (p = 0.98), demonstrating that re-identification is more difficult in larger beacons.
Table 2.
Theoretical p Values for 1,000 Queries for SNPs from a Genome in the Personal Genome Project
Beacon Name
Beacon Size
“Yes” Responses
p Value
Known VARiants
72,000
1000
0.98
Broad Institute
60,706
27
1000 Genomes Project
1,092
711
PGP
174
1000
0.0033
Cafe CardioKit
1,070
Wellcome Trust Sanger Institute
11,492
960
NCBI
14,466
947
ICGC
12,807
134
AMPLab
2,535
946
1000 Genomes Project phase 3
2,535
946
Open in a new tab
Beacons known to contain the individual (from metadata).
Discussion
We have developed a LRT for identifying whether a given individual genome is part of a beacon. Our experiments show that in a variety of settings, detecting membership in a beacon is possible with high power for not only individuals in the beacon but also their relatives. Because beacons are often designed to share samples with a certain phenotype, this also discloses phenotype information about the individual who is detected to be part of the beacon. Although detecting membership does not breach privacy, disclosure of potentially sensitive phenotype information is a serious privacy breach. In
Table 1
, of the nine beacons that index non-publically available genomic data (see
Table S3
for details of beacon datasets and phenotypes), four are associated with a single phenotype (Cafe CardioKit, ICGC, IBD, and SFARI beacons), four are associated with multiple phenotypes (Broad Institute, Kaviar, NCBI, and UK10K beacons), and one is not associated with any phenotype (Native American + Egyptian beacon). For instance, identifying that a given genome is part of the SFARI beacon, which contains genomic data from families with a child affected by autism spectrum disorder, means that the individual belongs to a family where some member has autism spectrum disorder. The LRT we describe can be used in a number of undesirable ways. For instance, a United States direct-to-consumer genetic-testing company that collects genome-wide data from customers could use it to infer phenotype or disease information without their customers’ knowledge by querying beacons.
Because the re-identification attack we describe requires the attacker to have access to an individual’s genome, an alternative is that the attacker can use the query genome to directly predict disease risk by using existing risk-prediction methods, such as genomic risk scores
11
or machine-learning methods.
12
A comparison of the performance of risk prediction and the re-identification LRT would be useful in understanding whether re-identification discloses any extra information about the query individual. However, most risk-prediction methods focus on the risk that the subject will develop the disease (in 10 years or at some future time), whereas identifying beacon membership gives a direct estimate of the probability that the queried individual currently has the disease studied in the beacon sample. A fair comparison of the two is therefore not possible. If our LRT (with false-positive rate
= 5%) identifies an individual as belonging to a case-only beacon (i.e., rejects the null hypothesis) for a disease with population prevalence (prior probability that an individual has the disease)
= 1%, the posterior probability that the individual has the disease is given by (1 −
) +
αp
= 0.9505 according to Bayes’ theorem. For the same result in a case-control beacon with equal numbers of case and control individuals, the probability that the individual has the disease is given by 0.5 × (1 −
) + 0.05
= 0.4755. In contrast, although genomic risk prediction has high success rates for Mendelian diseases with highly penetrant alleles and in some cancers, the success of such approaches for predicting common disease risk is modest.
13
An upper bound on performing genomic risk prediction by using an individual’s genome can be obtained if one considers the (broad-sense) heritability of the disease being studied. Polderman et al.
14
examined the heritability of 17,804 human traits. From their analysis, we can see that 26 out of 43 ICD-10 (International Classification of Diseases, Tenth Revision) and ICF (International Classification of Functioning, Disability, and Health) subchapter-level disease categories have heritability less than 50%, suggesting that the performance of genomic risk prediction for many disease categories will be limited.
Our approach makes some simplifying assumptions. We assume that the beacon samples and the query genome belong to the same population. This is a reasonable assumption given that beacons often publish the ethnicity of the datasets included, whereas the ethnicity of the query genome can be identified by comparison to reference panels such as 1000 Genomes. Genotypes are assumed to be distributed according to Hardy-Weinberg equilibrium. We also assume that allele queries are independent, which can lead to overly confident predictions for common SNPs. We expect that it will not affect our results significantly, given that most SNPs are rare (<5% frequency) in human populations. Inaccurate estimates of the shape of the SFS can affect our theoretical analysis. However, as
Figure S1
shows for the theoretical power, the power of the test is similar for populations with different SFS parameters, and
Figure 2
A shows good agreement between theoretical and empirical power curves on the CEU beacon. In addition, the empirical power of the test does not depend on the SFS parameters (
Binomial Test
in
Appendix A
). This suggests that our test is robust to different SFS parameters. A computational limitation is that establishing high confidence might need millions of queries. In our experiments with existing beacons, we were able to make 1,000 queries to the beacon server in 3 hr 53 min, with a 5 s delay between queries.
An important caveat is that the proposed LRT is only a demonstration that individual privacy can be compromised through beacons. It aims to show that beacon membership can be identified with only the query genome, even if allele frequencies are not known. As a result, the bounds we obtain for the number of queries required for re-identification (
Table 1
) are conservative and should not be used directly to guide the construction of beacons. A re-identification test that uses only rare SNPs and/or incorporates the allele frequencies at SNPs will be more powerful than our method and will require fewer queries than our estimates. Because the LRT we describe requires access to genomic data, such attacks might not be frequent or imminent at this time. However, as access to genomic data becomes easier, such attacks might need to be accounted for in the design of data-sharing mechanisms.
Our results have important implications for setting up beacons to allow data sharing and protect individual privacy. Beacons are designed to help researchers find datasets relevant to their research interests (e.g., datasets containing an allele that the researchers might suspect to be associated with a rare Mendelian disorder). Access to individual-level genotype data is usually controlled,
and a researcher might spend considerable time and effort applying for access only to find that the dataset is not relevant to his or her study. An advantage of a beacon is that any researcher can use it to query access-controlled data without applying for access. This will allow researchers to establish whether an access-controlled dataset might be of interest to them and apply for access only for relevant datasets. Two desirable features in beacons might therefore be that they contain a single dataset (so researchers who find a relevant dataset by querying a beacon can get data access through a single request) and that they return accurate information about the presence of rare alleles. Solutions for protecting privacy in beacons must also maintain the utility of beacons by supporting these features. We examine two ways in which security can be improved for anonymous-access beacons: (1) making detection of membership in the beacon harder and (2) reducing the leakage of phenotype information from the beacon.
A number of approaches can be used for making detection of membership in the beacon harder. Increasing beacon size can make detection harder, but protection against genome-wide re-identification attacks will require tens of thousands of individuals. Beacons sharing small genomic regions (single genes or exomes) are more secure than those sharing whole genomes. Beacons containing multiple populations are less secure than single-population beacons of the same size. Publishing metadata—such as the ethnicity of samples, beacon size, or the names of datasets included—reduces beacon security. Limiting the number and/or rate of queries per IP address can only slow down attackers and is therefore ineffective. Data-anonymization
15
approaches, such as using only common variation or censoring (
Figure 2
B;
Censoring Beacon Responses
in
Appendix B
), make re-identification harder but not impossible. All of these methods make detection of membership in the beacon harder, but they also reduce the utility of beacons to users.
An alternative way of improving beacon security is to address the leakage of phenotype information instead of the possibility of genomic re-identification. As described earlier, the probability that a re-identified sample has the disease associated with the beacon dataset depends on the proportion of case samples in the beacon dataset. Therefore, adding a suitable number of control samples or aggregating responses from multiple beacons (implemented as an option in the Beacon Network) might reduce the probability that a re-identified sample has the disease to an acceptable level. Heritability estimates can be used for determining an acceptable probability level for a particular disease. By including non-case samples, these solutions reduce the phenotype information that can be obtained from a beacon while keeping the reduction in the utility of the beacon to a minimum.
We expect that, because of the lack of monitoring and access control, anonymous-access beacons will always be open to re-identification attempts. The most important step for improving security and reducing loss of privacy through beacons would be to prohibit anonymous access. Requiring users to authenticate their identity to access beacons will allow the research community to discourage re-identification attacks through policies outlining acceptable uses of beacons.
16
Acknowledgments
The authors would like to acknowledge Snehit Prabhu, Christopher Gignoux, and Katie Kanagawa for helpful comments on the manuscript and the Stanford Genetics Bioinformatics Service Center for computing resources. This research was partially supported by the NIH under award number U01HG007436. C.D.B is on the scientific advisory boards (SABs) of
Ancestry.com
, Personalis, Liberty Biosecurity, and Etalon DX. He is also a founder and chair of the SAB of IdentifyGenomics. None of these entities played a role in the design, interpretation, or presentation of these results.
Published: October 29, 2015
Footnotes
This is an open access article under the CC BY-NC-ND license (
).
Supplemental Data include five figures and three tables and can be found with this article online at
Contributor Information
Suyash S. Shringarpure, Email: suyashs@stanford.edu.
Carlos D. Bustamante, Email: cdbustam@stanford.edu.
Appendix A: LRT
In an ideal setting, we would expect
… =
= 1 if a query genome
is in the beacon
. In practice, because of sequencing errors and differences in variant-calling pipelines, we might have some mismatches between the query copy of a genome and its copy in the beacon. We assume that this happens with probability
Let the alternate allele frequency at the SNP corresponding to query
be
. Because the beacon is only queried at the positions where the query genome is heterozygous,
is not distributed as beta(
′,
′) but shows an ascertainment bias. We can show that
∼ beta(
), where
′ + 1 and
′ + 1 in theory (see
Posterior Distribution of Allele Frequencies
in
Appendix A
for details).
The log-likelihood of a response set
= {
, …,
} can be written as
log
log
(Equation A1)
For the LRT, we need to evaluate this log-likelihood under the null hypothesis and the alternative hypothesis. The null hypothesis is that the query genome is not present in the beacon, and the alternative hypothesis is that the query genome is present in the beacon.
Likelihood under the Alternative Hypothesis
Under the alternative hypothesis (query genome
is present in beacon
), the response
is given by
I.
= 1 if
(a)
there is no mismatch between the query genome and its copy in the beacon or
(b)
there is a mismatch between the query genome and its copy in the beacon but at least one of the other
− 1 genomes in the beacon has the alternate allele.
II.
= 0 if there is a mismatch between the query genome and its copy in the beacon and none of the other
− 1 genomes in the beacon has the alternate allele.
The log-likelihood under the alternative hypothesis is given by
log
log
(Equation A2)
We first calculate
mismatch
mismatch
none
of
the
other
genomes
has
the
alternate
allele
none
of
the
other
genomes
has
the
alternate
allele
none
of
the
other
genomes
has
the
alternate
allele
beta
beta
beta
Let
is therefore the probability that none of
− 1 genomes has an alternate allele.
Therefore, we have that
(Equation A3)
and
(Equation A4)
The log-likelihood can be calculated as
log
log
log
log
Likelihood under the Null Hypothesis
Under the null hypothesis (query genome
is not in beacon
), the response
is given by
I.
= 1 if at least one of the
genomes in the beacon contains the alternate allele.
II.
= 0 if none of the
genomes in the beacon contains the alternate allele.
The log-likelihood under the null hypothesis is given by
log
log
(Equation A5)
We first calculate
none
of
the
genomes
has
the
alternate
allele
from
our
earlier
definition
Therefore, we have that
(Equation A6)
and
(Equation A7)
The log-likelihood can be calculated as
log
log
log
log
Thus, the log-likelihood under the null hypothesis is
log
log
(Equation A8)
LRT Statistic
The log of the likelihood-ratio statistic can then be written as
log
log
log
log
log
log
log
log
log
log
log
log
log
where we have defined
log
and
log
. For
, we have
. In practice, because
, and mismatch rate
, this will always be true.
Therefore, the LRT statistic can be stated as
(Equation A9)
Neyman-Pearson Optimality of LRT
The LRT stated above can be understood to be a test for a simple null hypothesis
against a simple alternative hypothesis
δD
N −
when we are given {
, …,
} sampled as
∼ Bernoulli(
). By the Neyman-Pearson lemma, the LRT is the most powerful test for a given test size
Binomial Test
The null hypothesis is rejected if Λ <
for some threshold
. Let
be such that
(Λ <
) =
This is equivalent to
(Equation A10)
(Equation A11)
(Equation A12)
because
(Equation A13)
where
(Equation A14)
Because the
are i.i.d. under both hypotheses,
binomial
and
binomial
. Therefore, the power of the exact test can be calculated as
, where
is chosen such that
Thus, a sufficient statistic for the LRT is the number of “yes” responses from the beacon. If a “truth set” of individuals known to be (or not be) in the beacon is available, the critical value and power of the test can be computed with only the number of “yes” responses from the beacon. Thus, the empirical power is not dependent on the SFS parameters, which suggests that the test is robust to SFS parameters.
Gaussian LRT Power Approximation
In the null and alternative hypotheses,
is a Bernoulli random variable. Therefore, by the central limit theorem, the LRT statistic has a Gaussian distribution. We can therefore use the parameters of the Gaussian distribution to obtain a relationship between the false-positive rate and power of the test.
Let
and
be the mean and SD, respectively, of the LRT statistic under the null hypothesis, and let
and
be the corresponding values under the alternative hypothesis.
From earlier results, we have that
(Equation A15)
(Equation A16)
(Equation A17)
(Equation A18)
(Equation A19)
and
Var
(Equation A20)
Var
(Equation A21)
(Equation A22)
(Equation A23)
(Equation A24)
where we have chosen the square root of
, which is larger than 0 (because
< 0).
Similarly, we can show that
(Equation A25)
and
(Equation A26)
For an LRT statistic with false-positive rate
, power 1 −
, and a normal distribution, we have that
(Equation A27)
where
is the
quantile of the standard normal distribution.
Substituting from above, we get
(Equation A28)
(Equation A29)
We have
(Equation A30)
(Equation A31)
Also,
Therefore, we get
(Equation A32)
(Equation A33)
(Equation A34)
Thus, for a given false-positive rate
, the number of SNPs that must be queried for obtaining power 1 −
is given as
(Equation A35)
Also, given a number of SNPs
and a false-positive rate
, the power that will be achieved is
(Equation A36)
where Φ is the cumulative distribution function of the standard normal distribution
0,1
We can show that
can be approximated as
(Equation A37)
(see
Approximating Probability of No Alternate Alleles
in
Appendix A
).
For
, given that
and
, we assume
, and
. Also, because
, we assume
. Then, we can write
In terms of the SFS of the entire population, we can write this as
(Equation A38)
The right-hand side of the equation is independent of both
and
for a specified false-positive rate
and power 1 −
. Thus, we have that
Modeling SFSs by Beta Distributions
We model alternate allele frequencies for populations by beta distributions similarly to the approaches used by Sankararaman et al.
and Clayton.
If
are distributed as
beta
, then the sample mean and variance, respectively, are given by
(Equation A39)
and
(Equation A40)
The method-of-moments estimators for the parameters of the beta distribution are
(Equation A41)
and
(Equation A42)
provided that
Table S1
shows that the estimates of the SFS parameters for simulated data and some public datasets (1000 Genomes, SSMP,
17
and GoNL
18
) are similar to each other.
Figure S1
shows the effect of different estimates of SFS parameters for the populations in
Table S1
on the theoretical power of the LRT. We see that estimates of SFS parameters affect the theoretical predictions, although results remain qualitatively similar. The test has the least power for SNP array data given that it has relatively less rare variation.
Posterior Distribution of Allele Frequencies
According to the SFS of the population, the alternate allele frequency
at a SNP is distributed as
beta
. If we assume Hardy-Weinberg equilibrium and
is the frequency observed at a SNP where the query genome is heterozygous (
gt
= 1), the posterior distribution of
at the SNP is given by Bayes’ rule as
(Equation A43)
(Equation A44)
(Equation A45)
(Equation A46)
beta
(Equation A47)
Therefore, the posterior distribution of the alternate allele frequency
at heterozygous sites is given as
∼ beta(
), where
′ + 1 and
′ + 1. In practice, for both simulated and real genomic data, the observed values of the parameters of the posterior distribution are slightly different from their expectations. In the theoretical power curves for our analyses, we use the correct estimated values of the parameters for the simulated data rather than the theoretical expectation or estimates from real data.
Approximating Probability of No Alternate Alleles
We have defined
as the probability that none of
individuals has an alternate allele. Here, we show that
(Equation A48)
For this result, we use Stirling’s approximation to the Gamma function, given by
(Equation A49)
We use the result that for
(Equation A50)
We also assume that
and
We have
(Equation A51)
(Equation A52)
(Equation A53)
(Equation A54)
(Equation A55)
(Equation A56)
(Equation A57)
0.5
(Equation A58)
Given that
, we can simplify this expression further as
0.5
(Equation A59)
(Equation A60)
(Equation A61)
(Equation A62)
(Equation A63)
(Equation A64)
(Equation A65)
(Equation A66)
(Equation A67)
(Equation A68)
(Equation A69)
Appendix B: LRT Variations
We consider variations of the likelihood test to detect relatives and to examine the effect of censoring the SFS on the power of the test.
LRT for Detecting Relatives
The relatedness of two individuals can be parameterized with a single parameter
, which is the probability that the two individuals share an allele at a single SNP. Thus, identical twins have
= 1, parent-offspring and sibling pairs have
= 0.5, first cousins have
= 0.25 and so on.
The likelihood for the null hypothesis remains the same as before. Below, we show the likelihood computation for the alternate hypothesis
Likelihood under the Alternate Hypothesis
Under the alternate hypothesis (a relative of the query genome
with relatedness
is present in beacon
), the response
is given by
I.
= 0 if none of the other
− 1 genomes in the beacon has the alternate allele and
(a)
there is no mismatch between the query genome and its copy in the beacon and the relative is a homozygous reference at the SNP or
(b)
there is a mismatch between the query genome and its copy in the beacon and the relative is not a homozygous reference at the SNP.
II.
= 1 otherwise.
For an individual who is heterozygous at a SNP with alternate allele frequency
, the genotype probabilities for a relative with relatedness
can be shown to be
relative
relative
and
relative
(Equation B1)
where
gt
and
gt
relative
are the genotypes of the individual and the relative, respectively, at the SNP.
The log-likelihood under the alternate hypothesis is given by
log
log
(Equation B2)
We first calculate
relative
none
of
the
other
genomes
has
the
alternate
allele
relative
relative
none
of
the
other
genomes
has
the
alternate
allele
relative
relative
relative
relative
For
= 1, this expression collapses to the form of Equation
A3
Therefore, we have that
(Equation B3)
and
(Equation B4)
Thus, the log-likelihood under the alternate hypothesis is
log
log
(Equation B5)
We can use this form to calculate the LRT statistic. Here, the exact test uses
as the sufficient statistic (as before), and the sufficient statistic is binomially distributed under both hypotheses. The distributions are given by
binomial
and
binomial
Therefore, the power of the exact test can be calculated as
, where
is chosen such that
Censoring Beacon Responses
One solution to the re-identification problem is to return accurate responses only for common variants. We consider a setting where the beacon chooses a threshold frequency
and returns “no” responses to queries for alleles that have frequency less than or equal to
in the population (not necessarily in the beacon samples). For alleles at frequency larger than
, the beacon will return the true answer.
Likelihood under the Alternative Hypothesis
Under the alternative hypothesis (query genome
is present in beacon
), the response
is given by
I.
= 0 if
(a)
the frequency of the allele
or
(b)
the frequency of the allele
and there is a mismatch between the query genome and its copy in the beacon and none of the other
− 1 genomes in the beacon has the alternate allele.
II.
= 1 otherwise.
The log-likelihood under the alternative hypothesis is given by
log
log
(Equation B6)
We first calculate
none
of
the
other
genomes
has
the
alternate
allele
where
Here,
) is the cumulative distribution function for a beta distribution with shape parameters
and
, evaluated at value
Therefore, we have that
(Equation B7)
and
(Equation B8)
The log-likelihood can be calculated as
log
log
log
log
Likelihood under the Null Hypothesis
Under the null hypothesis (query genome
is not in beacon
), the response
is given by
I.
= 0 if
(a)
the frequency of the allele
or
(b)
the frequency of the allele
and none of the other
genomes in the beacon has the alternate allele.
II.
= 1 otherwise.
The log-likelihood under the null hypothesis is given by
log
log
(Equation B9)
We first calculate
none
of
the
other
genomes
has
the
alternate
allele
where
Therefore, we have that
(Equation B10)
and
(Equation B11)
The log-likelihood can be calculated as
log
log
log
log
Finding the Optimal Censoring Threshold
We use the Gaussian approximation described earlier to obtain an estimate of the optimal censoring threshold frequency
. We have that
(Equation B12)
(Equation B13)
(Equation B14)
and
(Equation B15)
We have
(Equation B16)
(Equation B17)
Also,
Therefore, we get
(Equation B18)
(Equation B19)
(Equation B20)
(Equation B21)
All terms in this equation depend on
, and
. Thus, while allowing
queries given a desired false-positive rate
and maximum allowable power 1 −
in a beacon with
individuals from a population with SFS parametrized by
beta
, we can find the censoring threshold
. An analytical solution cannot be obtained for
because of the form of the cumulative distribution function. However, a grid search over
can be used for finding the optimal value for the censoring threshold.
Web Resources
The URLs for data presented herein are as follows:
GA4GH Beacon Project,
GA4GH Beacon Network,
Kaviar (Known Variants) beacon metadata,
PGP beacon metadata,
SFARI Beacon,
Supplemental Data
Document S1. Figures S1–S5 and Tables S1–S3
mmc1.pdf
(422.2KB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf
(1.1MB, pdf)
References
1.
Homer N., Szelinger S., Redman M., Duggan D., Tembe W., Muehling J., Pearson J.V., Stephan D.A., Nelson S.F., Craig D.W. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4:e1000167. doi: 10.1371/journal.pgen.1000167.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
2.
Sankararaman S., Obozinski G., Jordan M.I., Halperin E. Genomic privacy and limits of individual detection in a pool. Nat. Genet. 2009;41:965–967. doi: 10.1038/ng.436.
DOI
] [
PubMed
] [
Google Scholar
3.
Jacobs K.B., Yeager M., Wacholder S., Craig D., Kraft P., Hunter D.J., Paschal J., Manolio T.A., Tucker M., Hoover R.N. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nat. Genet. 2009;41:1253–1257. doi: 10.1038/ng.455.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
4.
Visscher P.M., Hill W.G. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet. 2009;5:e1000628. doi: 10.1371/journal.pgen.1000628.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
5.
Clayton D. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics. 2010;11:661–673. doi: 10.1093/biostatistics/kxq035.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
6.
Zerhouni E.A., Nabel E.G. Protecting aggregate genomic data. Science. 2008;322:44. doi: 10.1126/science.322.5898.44b.
DOI
] [
PubMed
] [
Google Scholar
7.
Rothstein M.A. Putting the Genetic Information Nondiscrimination Act in context. Genet. Med. 2008;10:655–656. doi: 10.1097/gim.0b013e31818337bd.
DOI
] [
PubMed
] [
Google Scholar
8.
Abecasis G.R., Auton A., Brooks L.D., DePristo M.A., Durbin R.M., Handsaker R.E., Kang H.M., Marth G.T., McVean G.A., 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
9.
Church G.M. The personal genome project. Mol. Syst. Biol. 2005;1:0030. doi: 10.1038/msb4100040.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
10.
Glusman G., Caballero J., Mauldin D.E., Hood L., Roach J.C. Kaviar: an accessible system for testing SNV novelty. Bioinformatics. 2011;27:3216–3217. doi: 10.1093/bioinformatics/btr540.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
11.
Goldstein B.A., Yang L., Salfati E., Assimes T.L. Contemporary Considerations for Constructing a Genetic Risk Score: An Empirical Approach. Genet. Epidemiol. 2015;39:439–445. doi: 10.1002/gepi.21912.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
12.
Wei Z., Wang W., Bradfield J., Li J., Cardinale C., Frackelton E., Kim C., Mentch F., Van Steen K., Visscher P.M., International IBD Genetics Consortium Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am. J. Hum. Genet. 2013;92:1008–1012. doi: 10.1016/j.ajhg.2013.05.002.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
13.
Schrodi S.J., Mukherjee S., Shan Y., Tromp G., Sninsky J.J., Callear A.P., Carter T.C., Ye Z., Haines J.L., Brilliant M.H. Genetic-based prediction of disease traits: prediction is very difficult, especially about the future. Front. Genet. 2014;5:162. doi: 10.3389/fgene.2014.00162.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
14.
Polderman T.J., Benyamin B., de Leeuw C.A., Sullivan P.F., van Bochoven A., Visscher P.M., Posthuma D. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nat. Genet. 2015;47:702–709. doi: 10.1038/ng.3285.
DOI
] [
PubMed
] [
Google Scholar
15.
Erlich Y., Narayanan A. Routes for breaching and protecting genetic privacy. Nat. Rev. Genet. 2014;15:409–421. doi: 10.1038/nrg3723.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
16.
Erlich Y., Williams J.B., Glazer D., Yocum K., Farahany N., Olson M., Narayanan A., Stein L.D., Witkowski J.A., Kain R.C. Redefining genomic privacy: trust and empowerment. PLoS Biol. 2014;12:e1001983. doi: 10.1371/journal.pbio.1001983.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
17.
Wong L.P., Ong R.T., Poh W.T., Liu X., Chen P., Li R., Lam K.K., Pillai N.E., Sim K.S., Xu H. Deep whole-genome sequencing of 100 southeast Asian Malays. Am. J. Hum. Genet. 2013;92:52–66. doi: 10.1016/j.ajhg.2012.12.005.
DOI
] [
PMC free article
] [
PubMed
] [
Google Scholar
18.
Genome of the Netherlands Consortium Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 2014;46:818–825. doi: 10.1038/ng.3021.
DOI
] [
PubMed
] [
Google Scholar
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Document S1. Figures S1–S5 and Tables S1–S3
mmc1.pdf
(422.2KB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf
(1.1MB, pdf)
ACTIONS
PERMALINK
RESOURCES
Cite
Download .nbib
.nbib
Format:
Add to Collections