A graph-based filtering method for top-d

The top-down MS/MS spectrum of a proteoform consists of a precursor mass and a list of fragment ion peaks. The precursor mass corresponds to the molecular mass of the proteoform and the fragment ion peaks correspond to fragments of the proteoform. Because of the existence of highly charged fragment ions and isotopic peaks, top-down MS/MS spectra are usually very complex. In data preprocessing, spectral deconvolution tools [23, 24] are used to converted fragment ion peaks into monoisotopic fragment masses. The intensity of a monoisotopic mass is computed as the sum of the intensities of its corresponding fragment ions peaks.

Noise masses need to be removed from deconvoluted top-down MS/MS spectra to improve the accuracy of protein sequence filtering. Generally speaking, low-intensity masses are more possible to be noise ones than high-intensity masses. An intensity-based method is used to remove noise masses from a deconvoluted spectrum. For each mass x in the spectrum, we rank all the masses in the interval [x−100,x+100] with a width of 200 Dalton (Da) in the decreasing order of their intensities. If x is not one of the top λ masses, x is treated as a noise mass and removed, where λ is a user-specified parameter. All parameters used in the SGM filtering algorithm are summarized in Additional file 1.

Spectrum graphs

Spectrum graphs have been used in sequence tag-based protein sequence filtration to extract sequence tags from mass spectra [25]. There are two steps to generate a spectrum graph from a deconvoluted query spectrum: (a) a node is added to the graph for each fragment mass in the spectrum, and (b) a directed edge from a node u to a node v is added to the graph if mass(v)−mass(u) matches the mass of one amino acid residue within an error tolerance, where mass(u) and mass(v) are the masses corresponding to u and v, respectively. The edge is labeled with the amino acid explaining the mass difference (Fig. 1b). Each path in a spectrum graph corresponds to a sequence tag, which is the readout of the labeled amino acids on its edges. For example, the only path from node v₀ to v₄ in the spectrum graph in Fig. 1b corresponds to the sequence tag NVRS. There are many existing methods for extracting sequence tags for protein sequence filtration from spectrum graphs [17, 26].

When a spectrum misses many fragment masses, the mass differences of many node pairs in the spectrum graph are explained by 2 or more amino acids, not single amino acids. In this case, the spectrum graph approach described in the previous paragraph may fail to extract long correct sequence tags, which are extremely useful in protein sequence filtration.

Blocked patterns [20] or gapped sequence tags [19] are often used to address the missing peak problem in tag generation. To extract blocked patterns, we change step (b) in spectrum graph generation as follows: a directed edge is added into the graph from a node u to a node v if the mass difference mass(v)−mass(u) is explained by a combination of several amino acids and is no larger than a predefined bound α (α is chosen between 200 and 400 Da in practice). The edge is labeled with the mass difference mass(v)−mass(u) (Fig. 1c). Each path in a spectrum graph with labeled masses corresponds to a mass sequence, called a blocked pattern or a gapped sequence tag [20]. In block pattern-based filtration, multiple blocked patterns (gapped sequence tags) extracted from a spectrum graph are searched against protein sequences to find candidate ones. The number of blocked patterns in a spectrum graph may be an exponential function of the length of the longest blocked pattern, making it inefficient to explicitly extract all blocked patterns in the spectrum graph. As a result, it is common to report only a limited number of paths and their corresponding blocked patterns in a spectrum graph. Because of the existence of noise peaks, it is a challenging problem to determine which paths in a spectrum graph correspond to blocked patterns that match the target protein sequence. We propose to circumvent the blocked pattern selection step and directly search spectrum graphs against protein sequences for filtration.

The spectrum graph matching problem

For a gene or a transcript, most protein databases contain only one unmodified reference sequence. To simplify the description of the SGM filtering method, we assume that protein databases used in spectral identification contain only unmodified sequences and that amino acid residue masses are integers. All amino acid residue masses are discretized by first multiplying the masses by a scale factor (100 was used in the experiments) and then rounding the results to integers.

Notations introduced by Deng et al. [21] are used in the description of the SGM problem. An amino acid string is represented as a list of discretized residue masses: the mass representation of an amino acid string a₁,a₂,…,a_n is a residue mass string S=s₁,s₂,…,s_n, in which s_i is the discretized residue mass of a_i for 1≤i≤n. Residue mass strings are called text strings in the following analysis. The sum of all the masses in a substring s_i,s_i+1,…,s_j of S, i.e., \(\sum _{k=i}^{j} s_{k}\), is called the mass of the substring. Two substrings of S are called consecutive substrings if the first residue mass of the second string directly follows the last residue mass of the first string. For example, s_i,s_i+1,…,s_j and s_j+1,s_j+2,…,s_k are consecutive substrings of S. A sequence of consecutive substrings A₁,A₂,…,A_k that include all residue masses in S is called a partition of S. The masses of the consecutive substrings in a partition is called the mass string of the partition. For example, let S=114,156,99,87, A₁=114,A₂=156,99, and A₃=87, then A₁,A₂,A₃ is a partition of S and the mass string of the partition is 114,255,87, where 255 is the mass of A₂. A blocked pattern obtained from a path in a spectrum graph is represented by a sequence of masses, which are labels of the edges in the path. For example, the blocked pattern corresponding to the bold path in Fig. 1c is 114.04, 225.17, 87.03. A blocked pattern can be further discretized to an integer sequence using the same method for residue mass discretization. In the following analysis, we assume that all blocked patterns are integer ones. Unlike text strings, a blocked pattern contains not only single amino acid residue masses, but also those of combinations of several amino acids. A blocked pattern P matches a text string S if P is the mass string of a partition of S. For example, the blocked pattern 114,255,87 matches the text string 114,156,99,87 because the blocked pattern is the mass string of a partition of the text string.

In protein sequence filtration, we search blocked patterns extracted from a spectrum graph against a protein database to find matched amino acid sequences (sequence tags). All protein sequences in the database are concatenated and represented as a text string over an alphabet Σ containing the discretized residue masses of the 20 common amino acids. Because the masses of leucine and isoleucine are the same, the size of Σ is 19 instead of 20. Given a text string S over an alphabet Σ and a blocked pattern P, the blocked pattern matching problem is to find all substrings of S that match P.

Definition 1

Given a text string S over an alphabet Σ and a pattern P, the blocked pattern matching (BPM) problem is to find all substrings of S that match P.

The SGM problem is an extension of BPM problem that considering the noises in the spectrum. It is more complex than the blocked pattern matching problem.

Definition 2

Given a text string S over an alphabet Σ and a spectrum graph G, the spectrum graph matching (SGM) problem is to find all substrings of S that match a blocked pattern in G.

We first study a restricted version of the SGM problem, in which a start node and an end node in G are specified and only paths from the start node to the end node are used to find matched substrings of the text string. The SGM problem is solved by enumerating all node pairs in G as the start and end nodes in the restricted spectrum graph matching (RSGM) problem.

Definition 3

Given a text string S, a spectrum graph G over an alphabet Σ, a start node s, and an end node t, the restricted spectrum graph matching (RSGM) problem is to find all substrings of S that match a blocked pattern corresponding to a path from s to t in G.

A suffix tree based algorithm for the RSGM problem

To solve the RSGM problem, we present a suffix tree based algorithm that extends the algorithm proposed by Deng et al. for the blocked pattern matching problem [21]. A suffix tree is used to represent the string S. We assume that each edge in the suffix tree is labeled with only one letter (integer residue mass) to simplify the algorithm description.

We first review the algorithm for the blocked pattern matching problem, which was proposed by Deng et al. [21]. A blocked pattern is represented as a spectrum graph in which all edges are in one path. Let G={V,E} be the graph representation of a blocked pattern P=p₁,p₂,…,p_m, where V={v₀,v₁…,v_m} and v₀,v₁,…,v_m is the only path from v₀ to v_m in G. Each prefix p₁,p₂,…,p_k of the blocked pattern P corresponds to a path v₀,v₁,…,v_k. A text string over Σ that matches the prefix p₁,p₂,…,p_k is called a prefix text string of v_k. A prefix text string is identifiable if it is a substring of S. For example, when P=114,255,87, both 114,156,99 and 114,99,156 are prefix text string of P. When S=114,156,99,87, the string 114,156,99 is an identifiable prefix text string of P, but 114,99,156 is not. If a prefix text string is not identifiable, then all text strings with the prefix are not identifiable, making it not necessary to search these text strings in S. Using identifiable prefix text strings significantly improves in the speed of searching a blocked pattern against S represented as a suffix tree.

Let B_i be the set of nodes in the suffix tree corresponding to all identifiable prefix text strings of v_i for 0≤i≤m, where m is the length of the blocked pattern. Specifically, B₀ contains only the root node of the suffix tree. The blocking pattern matching algorithm progressively finds the node sets B₁,…,B_m in the suffix tree. The node set B_m contains the solution to the blocked pattern matching problem.

Unlike the blocked pattern matching problem with only one blocked pattern, the spectrum graph G in the RSGM problem contains many paths from the start node to the end node, each path corresponding to a blocked pattern. The number of all paths in a spectrum graph may be an exponential function of the number of nodes. So it is an inefficient approach to directly extract all paths in a spectrum graph and search each path against a suffix tree separately.

Next, we describe our new algorithm for the RSGM problem, which extends the blocked pattern matching algorithm. Let V={v₀,v₁,…v_m} be the set of nodes in the increasing order of their corresponding masses, in which the start node s is v₀ and the end node t is v_m. A text string is a prefix text string of node v_i if it matches a blocked pattern corresponding to a path from v₀ to v_i. Let B_i be the set of nodes in the suffix tree corresponding to all identifiable prefix text strings of v_i, and C(e) be the set of all text strings whose masses match the labeled mass of e (Table 1). In practice, a precomputed table is used to quickly obtain C(e). Because the fragment masses in query spectra have measured errors, an error tolerance ε is allowed in generating the text strings in the table. Denote E_i⊆E the set of all edges whose tail nodes are v_i. For each edge e=(v_j,v_i) in E_i, we search from each suffix tree node in B_j to find all the paths corresponding to a text string in C(e) and add the end nodes of these paths to B_i (Algorithm 1).

Table 1 An example set of text strings matched a mass 270.14

After the last set B_m is found, the RSGM problem is solved by reporting all the text strings corresponding to a suffix tree node in B_m and their positions in S, which are stored in the suffix tree. Because the string S is represented by a suffix tree, the space complexity and time complexity of the algorithm are both O(n), where n is the length of the string S.

Time complexity

We use the idea proposed by Deng et al. to prove the time complexity of the RSGM algorithm [21]. The preprocessing step, in which the text S is represented as a suffix tree, is implemented using the Ukkonen’s algorithm [27], and its time complexity is O(n). Below we study the time complexity of the pattern query steps in the RSGM algorithm (Steps 2-11 of the algorithm in Algorithm 1).

We divide the set C(e) of text strings for an edge e in the spectrum graph into subsets C(e,1),C(e,2), …,C(e,l), where l is the length of the longest text string in C(e). The subset C(e,j) contains all text strings in C(e) with length j. The expansion factor of the set C(e,j) is defined as \(r(e,j) = |C(e,j)|^{\frac {1} {j}}\), where |C(e,j)| is the size of C(e,j). The largest expansion factor of all edges in G is denoted as r= maxe,jr(e,j). Let N be the maximum size of C(e) for all edges in G, that is, N= maxe∈E|C(e)|.

The running time of the graph query (Steps 2 - 11) in the RSGM algorithm is related to \(\sum _{i=1}^{m}|B_{i}|\), the sum of the sizes of the sets B_i. Next we will describe how to compute \(\sum _{i=1}^{m} |B_{i}|\). Let U be the set of all prefix text strings of v₁,…,v_m, and L be the length of the longest prefix text string in U. We define X_l as the set of all length l prefix text strings in U, and Y_l as the set of all length l identifiable prefix strings in U, for 1≤l≤L. Each node in \(\cup _{i=1}^{m}B_{i}\) corresponds to an identifiable prefix text string in \(\cup _{l=1}^{L}Y_{l}\), that is, \(\sum _{i=1}^{m}|B_{i}| = \sum _{l=1}^{L}|Y_{l}|\). We compute the expectation of |Y_l| by multiplying the size of X_l and the probability that a length l text string can be found in S.

A path in G starting from node s is called a prefix path. Let Q be a prefix path of G and X_l(Q) the set all prefix text strings with length l that match a prefix subpath of Q. Deng et al. proved the following Lemma about the size of X_l(Q) [21].

Lemma 1

The size of X_l(Q) is bounded by (2r)^l, where r is the maximum expansion factor.

Based on Lemma 1, we give an upper bound of the size of X_l.

Lemma 2

The size of X_l is bounded by (2rd)^l, where d is the maximum out-degree of the nodes in G.

Proof

Each length l prefix text string in X_l matches at least one prefix path. The total number of prefix paths in G with length l is bounded by d^l. Because of Lemma 1, the number of prefix text strings in X_l matching one prefix path in G is bounded by (2r)^l. As a result, the size of X_l is bounded by (2r)^ld^l=(2rd)^l. □

Let g=|Σ| be the size of the alphabet Σ. Using the same proof method in Lemmas 1 and 2 in the supplementary material in [21], we can prove the following two lemmas.

Lemma 3

When 2rd<g, the expectation of |Y_l| is bounded by \(\phantom {\dot {i}\!}(2rd)^{{\text {log}_{g}}n}=n^{{\text {log}_{g}}(2rd)}\), where n is the length of the text string S.

Lemma 4

When 1<2rd<g, the expectation of \({\sum \nolimits }_{i=1}^{m}|B_{i}|\) in the RSGM algorithm is bounded by \(\phantom {\dot {i}\!}c\,n^{{\text {log}_{g}}(2rd)}\), where \(\phantom {\dot {i}\!}c=\frac {2rd}{2rd-1}+\frac {g}{g-2rd}\).

Theorem 1

When Σ is a finite set and 1<2rd<g, the average-case time complexity of the graph query steps in the RSGM algorithm is O(dNn^k+M), where d is the maximum out-degree of the nodes in G, N= maxe∈E|C(e)|, k= logg(2rd)<1, and M is the number of matched substrings in S.

Proof

For each suffix tree node u in \(\cup _{i=0}^{m} B_{i}\), we find its corresponding node v in the spectrum graph. In Step 6 of the RSGM algorithm, the length of the text string x is a constant. In addition, the set B_i is stored as a hash table, and the number of operations to check if u^′ exists in B_i is a constant. As a result, the time complexity of one search in Step 6 is O(1). The total number of searches starting from u is bounded by the out-degree of v multiplied by the largest size of C(e) for all edges e whose head node is v. That is, the number of searches of u is bounded by dN. Because \(\left |\cup _{i=0}^{m} B_{i}\right |\) is O(n^k), the total number of searches in the algorithm is O(dNn^k). As a result, the time complexity of the algorithm is O(dNn^k). □

Theorem 1 given the time complexity of the graph query steps in the RSGM algorithm when 2rd is no larger than the size of the alphabet Σ. In practice, the default value of the parameter α (the maximum difference between the masses corresponding to two nodes connected by an edge) is 350 Da and the scale factor is 100. In this case, the value of r is about 2.91. In addition, the out-degree of a node v in G is restricted to be at most 3 by ranking all the edges with the head node v in the increasing order with respect to the labeled masses and keeping only the top 3 ones. Because the maximum out-degree of the nodes in G is 3, the condition 2rd<g (2rd<18<g=19) is satisfied. In the experiments, the maximum out-degree of nodes in spectrum graphs was set to 3.

When 2rd≥g, the number of the suffix tree nodes searched by the algorithm is O(n), that is, the number of searched nodes is bounded by the size of the suffix tree. Each suffix tree node u corresponds to a spectrum graph node v. The total number of searches starting from u is bounded by the out-degree of v multiplied by the largest size of C(e) for all edges e whose head node is v. Therefore, the number of searches starting from u is bounded by dN, and the time complexity of the algorithm is O(dNn).

The RSGM algorithm gives all substrings in S that match a path from v₀ to v_i by reporting the suffix tree nodes in B_i. That is, the algorithm reports all substrings in S that match a path starting from v₀. We execute the RSGM algorithm m−1 rounds to solve the SGM problem. In the ith round, the start node is set to v_i and the end node to v_m. All matched substrings in the m−1 rounds are combined as the solution to the SGM problem.

In practice, we are interested in finding only text strings that match a long path in G. Let V_s be the set of nodes v in G such that the sum of the labeled masses on the shortest path from v₀ to v is no larger than β, where β is a user-specified parameter. Let V_t be the set of nodes v in G such that the sum of the labeled masses on the shortest path from v to v_m is no larger than β. We only report text strings that match a path from a node in V_s to a node in V_t.

Protein sequence filtration

The spectrum graph of a top-down MS/MS spectrum usually consists of several disconnected components because the spectrum often has low fragment coverage. Based on this observation, we propose to extract γ mass intervals (subspectra) that are abundant with fragment masses from a query spectrum and construct a spectrum graph for each extracted mass interval, where γ is a user-specified parameter. In practice, the value of γ is chosen between 1 and 20. The spectrum graphs are searched against the protein database separately and the search results are combined to find the best sequence candidates. The filtering algorithm is called the SGM filtering algorithm.

In mass interval selection, mass intervals with the same width δ are reported and the width δ is a user-specified parameter. In practice, the value of δ is chosen between 500 and 1400 Da so that each reported mass interval corresponds to a sequence tag with 5−14 amino acids. (See “Parameter settings” section) The score of a mass interval is defined as the number of masses in it. The overlapping ratio of two intervals [a,a+δ] and [b,b+δ] with a<b<a+δ is defined as the ratio between the width of the overlapping region [b,a+δ] and the interval width δ, that is, \(\frac {a+\delta -b}{\delta }\). If a<a+δ<b, the overlapping ratio is 0. A greedy approach is employed to find multiple top-scoring mass intervals. First, the top-scoring mass interval of the spectrum is reported. Second, we remove all mass intervals that have an overlapping ratio ≥ρ with the reported one, where ρ is a user-specified parameter, and then report the best scoring one from the remaining mass intervals. The second step is performed iteratively until a total of γ mass intervals are reported or no candidate mass intervals can be reported. In addition, only mass intervals with at least 6 fragment masses are reported.

We use reversed mass intervals to find text strings that match suffix fragment masses in the query spectrum. Let b₁,b₂,…,b_k be the masses in a mass interval [m_l,m_r] extracted from a collision-induced dissociation (CID) MS/MS spectrum with a precursor mass M. We generate a reversed mass interval as follows: (a) the mass interval [m_l,m_r] is converted into [M−m_r,M−m_l], and (b) for each mass b_i, 1≤i≤k, a reversed mass M−b_i is added to the reversed mass list. For example, the reversed mass interval of a mass interval [100,400] with masses 114,213 and a precursor mass 644 is [244,544] with masses 431,530. Each extracted mass interval is further reversed to obtain a reversed mass interval in which suffix fragment masses are converted into prefix fragment masses. We search the spectrum graphs generated from the extracted and reversed mass intervals against the protein database represented by a suffix tree.

We describe three functions for assigning scores to fragment masses. For a fragment mass (x,h) with a mass value x and its intensity h in a query spectrum, the first function assigns the score 1 to the mass, that is, score(x,h)=1. The second function uses the logarithm function to normalize mass intensities to compute scores of masses. We find the lowest mass intensity b in the query spectrum, the score for the mass (x,h) is computed as \(\text {score}(x,h) = \log _{2} \left (\frac {2h}{b}\right)\). The constant number 2 in the function is to guarantee that all scores are positive. The third function is based on the rankings of mass intensities. The fragment masses in the query spectrum are sorted in the increasing order of their corresponding intensities. Let i be the rank of the mass (x,h). The score of the mass (x,h) is defined as score(x,h)=1+i/k where k is the total number of fragment masses in the spectrum. The score of the lowest intensity mass is about 1 and the score of the highest intensity mass is 2. The three scoring functions are called the mass count score, log intensity score, and rank score, respectively. The score of a node in a spectrum graph is assigned as the score of its corresponding fragment mass.

Here we introduce a similarity score between a query spectrum and the protein sequence. A protein sequence matches a blocked pattern P if its text string representation contains a substring that matches P. The score of the pattern P is the number of the nodes in the corresponding path in the spectrum graph. The similarity score between a protein sequence and a spectrum graph is the score of the best scoring pattern extracted from the graph that matches the protein sequence. Let G₁,G₂,…,G_k be the set of spectrum graphs extracted from a query spectrum, the similarity score between a protein sequence and the query spectrum is the best similarity score between the protein sequence and G₁,G₂,…,G_k.

After finding the best scoring pattern from G₁, G₂,…,G_k that matches a substring of the protein sequence, we compute the mass shift between the experimental fragment masses in the spectrum and the theoretical fragment masses of the protein sequence. By shifting all theoretical masses by the mass shift, we recompute the similarity score between the experimental masses and shifted theoretical ones. The new score is called the extended similarity score between the protein sequence and the query spectrum, which is used to rank and filter protein sequences.

A graph-based filtering method for top-down mass spectral identification | BMC Genomics | Springer Nature Link

Spectrum graphs

The spectrum graph matching problem

Definition 1

Definition 2

Definition 3

A suffix tree based algorithm for the RSGM problem

Time complexity

Lemma 1

Lemma 2

Proof

Lemma 3

Lemma 4

Theorem 1

Proof

Protein sequence filtration