Evaluation of the Algorithms Implemented in TAGster | National Institute of Environmental Health Sciences
Source: https://www.niehs.nih.gov/research/resources/software/epidemiology/tagster/evaluation
Archived: 2026-04-23 17:15
Evaluation of the Algorithms Implemented in TAGster | National Institute of Environmental Health Sciences
Skip Navigation
Evaluation of the Algorithms Implemented in TAGster
Close the left navigation
Add
Table of Contents
1. Introduction
2. Data
3. Single Population Tag SNP
4. Multiple Population Tag SNP
5. Multiple SNP Bin Tag SNP
1. Introduction
We implemented 3 algorithms for tag SNP selection in
TAGster
. These algorithms are:
Algorithm 1: A Greedy algorithm for single or multi-population tag SNP;
Algorithm 2: An efficient exhaustive search algorithm for single population tag SNP;
Algorithm 3: A two-stage solution algorithm for multi-population tag SNP.
We evaluated these algorithms against algorithms in existing software
ldSelect
(Carlson et al. 2004),
FESTA
(Qin, et al. 2006) and
MultiPop-TagSelect
(Howie, et al. 2006) using SNP genotype data from
Environmental Genome Project
(EGP).
2. Data
2.1 EGP Panel 2
At the time of this study, 207 genes were resequenced by EGP across 95 DNA samples from 4 populations (27 Africans, 24 Asians, 22 Europeans, and 22 Hispanics). There were a total of 16,153 SNPs with minor allele frequency (MAF) ≥ 0.05 in at least one of the 4 populations.
2.2 HapMap ENCODE
HapMap ENCODE (Encyclopedia of DNA Elements) Project resequenced ten 500 kb genomic regions in 48 individuals and subsequently genotyped all discovered SNPs as well as all SNPs in dbSNP at the time in 270 HapMap DNA samples from 3 populations including 30 CEPH (Utah residents with ancestry from northern and western Europe) trios, 90 Asians (45 unrelated JPT (Japanese in Tokyo, Japan), 45 unrelated CHB (Han Chinese in Beijing, China) and 30 YRI (Yoruba from Ibadan, Nigeria) trios. There were a total of 11,700 SNPs with minor allele frequency (MAF) ≥ 0.05 in at least one of the 3 populations.
3. Single Population Tag SNP
We applied both the refined greedy algorithm in
TAGster
and the greedy algorithm in
ldSelect
to select population specific tag SNPs at r
2
threshold of 0.8 from each population specific data set. Table 1 shows that, in EGP data, the modified greedy algorithm selected l42 fewer tag SNPs than the greedy algorithm as implemented in
ldSelect
(Carlson, et al., 2004) in EGP. For 62 genes the modified greedy algorithm selected fewer tags in at least one of the 4 populations, whereas the greedy algorithm had fewer tag SNPs in only 2 genes in one population. Table 2 shows the modified greedy algorithm selected 30 fewer tag SNPs than
ldSelect
using HapMap ENCODE data.
Table 1. Comparison between the refined greedy algorithm in
TAGster
and the greedy algorithm in
ldSelect
using EGP Panel 2 data.
Table 2. Comparison between the refined greedy algorithm in
TAGster
and the greedy algorithm in
ldSelect
using HapMap ENCODE data
We applied both the exhaustive search algorithms in TAGster and the comprehensive search algorithm in
FESTA
(Qin, et al., 2006) to select population specific tag SNPs at r
2
threshold of 0.8 and an exhaustive search step limit specification of 1,000,000 (the default setup of
FESTA
) for both algorithms for each of the 4 populations in EGP Panel 2.
Table 3 shows that the exhaustive search algorithm in
TAGster
greatly improved the computational efficiency in all 4 populations. Moreover,
FESTA
did not find an optimal solution for the number of tag SNPs for 1 gene in Africans and 1 gene in Europeans.
FESTA
exceeded the 1,000,000 step limit and defaulted to use of the greedy algorithm 20 times in order to provide a result while
TAGster
only used greedy algorithm 4 times (Table 4). Evaluation of HapMap ENCODE data to generate table 5 showed a similar pattern of computational efficiency and requirements for defaulting to the greedy algorithm.
Table 3. Comparison between
FESTA
and
TAGster
using EGP Panel 2 data
Table 4. Gene list in EGP that greedy algorithm has to be used for selection of tag SNPs
Table 5. Comparison between FESTA and TAGster using HapMap ENCODE data
4. Multiple Population Tag SNP
We applied the modified greedy algorithm (Algorithm 1) and 2-stage method (Algorithm 3) to select multi-population tag SNP in 207 genes for the 4 populations from EGP Panel 2 and used as a benchmark measure the number of tag SNPs found using
ldSelect
followed by
MultiPop-TagSelect
(Howie, et al., 2006). The generalized modified greedy algorithm (generalized algorithm 1 for multiple populations) reduced tag SNP requirements by 183 SNPs whereas the two-stage method (Algorithm 3) reduced tag SNP requirements by 159 SNPs. If for each gene we selected the minimum of these two methods, it reduced tag SNP requirements by 233 SNPs below that required by
ldSelect
followed by
MultiPop-TagSelect
(Table 4). Evaluation in 3 populations from HapMap ENCODE shows a similar pattern of reduction (Table 6)
Both
TAGster
and
MultiPop-TagSelect
allow an investigator to specify a
priori
a set of SNPs for inclusion as tag SNP.
MultiPop-TagSelect
algorithm selects from population specific tag SNPs. Thus if an investigator-specified SNP is not one of these population specific tag SNPs, then it can not serve as a proxy for any population specific LD bin. Conversely, in the
TAGster
selection process, every investigator-specified SNP can serve as a proxy for other SNPs unless it is a singleton SNPs.
Table 6. Multi-population tag SNPs for 4 populations from EGP Panel 2
Table 7. Multi-population tag SNPs for 3 populations from HapMap ENCODE
5. Multiple SNP Bin Tag SNP
In order to further reduce the number of tag SNPs, investigators may choose to select tag SNPs only for bins that contain multiple SNPs. The minimum bin size can be specify using the parameter
-minimum
in the parameter file
params.txt
. For example setting
-minimum: 2
requires that bins contain at least two SNPs and eliminates singleton bin tag SNPs. Elimination of singleton bin tag SNPs can dramatically cut down the number of tag SNPs, while still capturing the majority of SNPs. It is particularly useful when selecting multiple population tag SNPs. For example, if parameter
–minimum
is set to a value of 2,
TAGster
selected
4094
multiple population multiple SNP bin (MPMS) tag SNPs for the 4 populations in EGP, compared to
7429
SNPs required if singleton bins are tagged. This smaller number of tag SNPs still captures ~95% common SNPs in Asian and CEPH populations, 91% in Hispanic population and 84% in Africans. For HapMap ENCODE data,
2095
MPMS tag SNPs (out of total of
3882
tag SNPs if singleton bin tags are included) can capture ~96% of common SNPs in Asian and CEU and 86% of SNPs in YRI.
Back
to Top
Last Reviewed: February 18, 2026
Skip Navigation
Evaluation of the Algorithms Implemented in TAGster
Close the left navigation
Add
Table of Contents
1. Introduction
2. Data
3. Single Population Tag SNP
4. Multiple Population Tag SNP
5. Multiple SNP Bin Tag SNP
1. Introduction
We implemented 3 algorithms for tag SNP selection in
TAGster
. These algorithms are:
Algorithm 1: A Greedy algorithm for single or multi-population tag SNP;
Algorithm 2: An efficient exhaustive search algorithm for single population tag SNP;
Algorithm 3: A two-stage solution algorithm for multi-population tag SNP.
We evaluated these algorithms against algorithms in existing software
ldSelect
(Carlson et al. 2004),
FESTA
(Qin, et al. 2006) and
MultiPop-TagSelect
(Howie, et al. 2006) using SNP genotype data from
Environmental Genome Project
(EGP).
2. Data
2.1 EGP Panel 2
At the time of this study, 207 genes were resequenced by EGP across 95 DNA samples from 4 populations (27 Africans, 24 Asians, 22 Europeans, and 22 Hispanics). There were a total of 16,153 SNPs with minor allele frequency (MAF) ≥ 0.05 in at least one of the 4 populations.
2.2 HapMap ENCODE
HapMap ENCODE (Encyclopedia of DNA Elements) Project resequenced ten 500 kb genomic regions in 48 individuals and subsequently genotyped all discovered SNPs as well as all SNPs in dbSNP at the time in 270 HapMap DNA samples from 3 populations including 30 CEPH (Utah residents with ancestry from northern and western Europe) trios, 90 Asians (45 unrelated JPT (Japanese in Tokyo, Japan), 45 unrelated CHB (Han Chinese in Beijing, China) and 30 YRI (Yoruba from Ibadan, Nigeria) trios. There were a total of 11,700 SNPs with minor allele frequency (MAF) ≥ 0.05 in at least one of the 3 populations.
3. Single Population Tag SNP
We applied both the refined greedy algorithm in
TAGster
and the greedy algorithm in
ldSelect
to select population specific tag SNPs at r
2
threshold of 0.8 from each population specific data set. Table 1 shows that, in EGP data, the modified greedy algorithm selected l42 fewer tag SNPs than the greedy algorithm as implemented in
ldSelect
(Carlson, et al., 2004) in EGP. For 62 genes the modified greedy algorithm selected fewer tags in at least one of the 4 populations, whereas the greedy algorithm had fewer tag SNPs in only 2 genes in one population. Table 2 shows the modified greedy algorithm selected 30 fewer tag SNPs than
ldSelect
using HapMap ENCODE data.
Table 1. Comparison between the refined greedy algorithm in
TAGster
and the greedy algorithm in
ldSelect
using EGP Panel 2 data.
Table 2. Comparison between the refined greedy algorithm in
TAGster
and the greedy algorithm in
ldSelect
using HapMap ENCODE data
We applied both the exhaustive search algorithms in TAGster and the comprehensive search algorithm in
FESTA
(Qin, et al., 2006) to select population specific tag SNPs at r
2
threshold of 0.8 and an exhaustive search step limit specification of 1,000,000 (the default setup of
FESTA
) for both algorithms for each of the 4 populations in EGP Panel 2.
Table 3 shows that the exhaustive search algorithm in
TAGster
greatly improved the computational efficiency in all 4 populations. Moreover,
FESTA
did not find an optimal solution for the number of tag SNPs for 1 gene in Africans and 1 gene in Europeans.
FESTA
exceeded the 1,000,000 step limit and defaulted to use of the greedy algorithm 20 times in order to provide a result while
TAGster
only used greedy algorithm 4 times (Table 4). Evaluation of HapMap ENCODE data to generate table 5 showed a similar pattern of computational efficiency and requirements for defaulting to the greedy algorithm.
Table 3. Comparison between
FESTA
and
TAGster
using EGP Panel 2 data
Table 4. Gene list in EGP that greedy algorithm has to be used for selection of tag SNPs
Table 5. Comparison between FESTA and TAGster using HapMap ENCODE data
4. Multiple Population Tag SNP
We applied the modified greedy algorithm (Algorithm 1) and 2-stage method (Algorithm 3) to select multi-population tag SNP in 207 genes for the 4 populations from EGP Panel 2 and used as a benchmark measure the number of tag SNPs found using
ldSelect
followed by
MultiPop-TagSelect
(Howie, et al., 2006). The generalized modified greedy algorithm (generalized algorithm 1 for multiple populations) reduced tag SNP requirements by 183 SNPs whereas the two-stage method (Algorithm 3) reduced tag SNP requirements by 159 SNPs. If for each gene we selected the minimum of these two methods, it reduced tag SNP requirements by 233 SNPs below that required by
ldSelect
followed by
MultiPop-TagSelect
(Table 4). Evaluation in 3 populations from HapMap ENCODE shows a similar pattern of reduction (Table 6)
Both
TAGster
and
MultiPop-TagSelect
allow an investigator to specify a
priori
a set of SNPs for inclusion as tag SNP.
MultiPop-TagSelect
algorithm selects from population specific tag SNPs. Thus if an investigator-specified SNP is not one of these population specific tag SNPs, then it can not serve as a proxy for any population specific LD bin. Conversely, in the
TAGster
selection process, every investigator-specified SNP can serve as a proxy for other SNPs unless it is a singleton SNPs.
Table 6. Multi-population tag SNPs for 4 populations from EGP Panel 2
Table 7. Multi-population tag SNPs for 3 populations from HapMap ENCODE
5. Multiple SNP Bin Tag SNP
In order to further reduce the number of tag SNPs, investigators may choose to select tag SNPs only for bins that contain multiple SNPs. The minimum bin size can be specify using the parameter
-minimum
in the parameter file
params.txt
. For example setting
-minimum: 2
requires that bins contain at least two SNPs and eliminates singleton bin tag SNPs. Elimination of singleton bin tag SNPs can dramatically cut down the number of tag SNPs, while still capturing the majority of SNPs. It is particularly useful when selecting multiple population tag SNPs. For example, if parameter
–minimum
is set to a value of 2,
TAGster
selected
4094
multiple population multiple SNP bin (MPMS) tag SNPs for the 4 populations in EGP, compared to
7429
SNPs required if singleton bins are tagged. This smaller number of tag SNPs still captures ~95% common SNPs in Asian and CEPH populations, 91% in Hispanic population and 84% in Africans. For HapMap ENCODE data,
2095
MPMS tag SNPs (out of total of
3882
tag SNPs if singleton bin tags are included) can capture ~96% of common SNPs in Asian and CEU and 86% of SNPs in YRI.
Back
to Top
Last Reviewed: February 18, 2026