The top-down MS/MS spectrum of a proteoform consists of a precursor mass and a list of fragment ion peaks. The precursor mass corresponds to the molecular mass of the proteoform and the fragment ion peaks correspond to fragments of the proteoform. Because of the existence of highly charged fragment ions and isotopic peaks, top-down MS/MS spectra are usually very complex. In data preprocessing, spectral deconvolution tools [23, 24] are used to converted fragment ion peaks into monoisotopic fragment masses. The intensity of a monoisotopic mass is computed as the sum of the intensities of its corresponding fragment ions peaks.

Noise masses need to be removed from deconvoluted top-down MS/MS spectra to improve the accuracy of protein sequence filtering. Generally speaking, low-intensity masses are more possible to be noise ones than high-intensity masses. An intensity-based method is used to remove noise masses from a deconvoluted spectrum. For each mass x in the spectrum, we rank all the masses in the interval [x−100,x+100] with a width of 200 Dalton (Da) in the decreasing order of their intensities. If x is not one of the top λ masses, x is treated as a noise mass and removed, where λ is a user-specified parameter. All parameters used in the SGM filtering algorithm are summarized in Additional file 1.

Spectrum graphs

Spectrum graphs have been used in sequence tag-based protein sequence filtration to extract sequence tags from mass spectra [25]. There are two steps to generate a spectrum graph from a deconvoluted query spectrum: (a) a node is added to the graph for each fragment mass in the spectrum, and (b) a directed edge from a node u to a node v is added to the graph if mass(v)−mass(u) matches the mass of one amino acid residue within an error tolerance, where mass(u) and mass(v) are the masses corresponding to u and v, respectively. The edge is labeled with the amino acid explaining the mass difference (Fig. 1b). Each path in a spectrum graph corresponds to a sequence tag, which is the readout of the labeled amino acids on its edges. For example, the only path from node v0 to v4 in the spectrum graph in Fig. 1b corresponds to the sequence tag NVRS. There are many existing methods for extracting sequence tags for protein sequence filtration from spectrum graphs [17, 26].

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.

Spectrum graph generation. Illustration of spectrum graph generation using an example deconvoluted MS/MS spectrum of the protein LNRVSG. a In the spectrum, the mass of the N-terminal fragment LNR is missing, and there is a noise mass peak (bold) between the fragment masses of LNR and LNRV. b In the spectrum graph, each node corresponds to a peak in the spectrum. Two nodes are connected by a directed edge if the difference between their corresponding masses matches the residue mass of one amino acid; the edge is labeled with the amino acid. The sequence tag NVRS extracted from the spectrum is incorrect because of the noise mass peak and its node v2. c In the spectrum graph, each node corresponds to a peak in the spectrum. Two nodes are connected by a directed edge if the difference between their corresponding masses is less than 400 Da and matches the residue mass of one or several amino acids; the edge is labeled by the mass difference. The mass sequence of a path is a blocked pattern of the spectrum. For example, the bold path v0,v1,v3,v4 corresponds to a blocked pattern 114.04, 255.17, 87.03, which matches a correct sequence tag NRVS because 255.07 is the sum of the mass 156.10 of R and the mass 99.07 of V

When a spectrum misses many fragment masses, the mass differences of many node pairs in the spectrum graph are explained by 2 or more amino acids, not single amino acids. In this case, the spectrum graph approach described in the previous paragraph may fail to extract long correct sequence tags, which are extremely useful in protein sequence filtration.

Blocked patterns [20] or gapped sequence tags [19] are often used to address the missing peak problem in tag generation. To extract blocked patterns, we change step (b) in spectrum graph generation as follows: a directed edge is added into the graph from a node u to a node v if the mass difference mass(v)−mass(u) is explained by a combination of several amino acids and is no larger than a predefined bound α (α is chosen between 200 and 400 Da in practice). The edge is labeled with the mass difference mass(v)−mass(u) (Fig. 1c). Each path in a spectrum graph with labeled masses corresponds to a mass sequence, called a blocked pattern or a gapped sequence tag [20]. In block pattern-based filtration, multiple blocked patterns (gapped sequence tags) extracted from a spectrum graph are searched against protein sequences to find candidate ones. The number of blocked patterns in a spectrum graph may be an exponential function of the length of the longest blocked pattern, making it inefficient to explicitly extract all blocked patterns in the spectrum graph. As a result, it is common to report only a limited number of paths and their corresponding blocked patterns in a spectrum graph. Because of the existence of noise peaks, it is a challenging problem to determine which paths in a spectrum graph correspond to blocked patterns that match the target protein sequence. We propose to circumvent the blocked pattern selection step and directly search spectrum graphs against protein sequences for filtration.

The spectrum graph matching problem

For a gene or a transcript, most protein databases contain only one unmodified reference sequence. To simplify the description of the SGM filtering method, we assume that protein databases used in spectral identification contain only unmodified sequences and that amino acid residue masses are integers. All amino acid residue masses are discretized by first multiplying the masses by a scale factor (100 was used in the experiments) and then rounding the results to integers.

Notations introduced by Deng et al. [21] are used in the description of the SGM problem. An amino acid string is represented as a list of discretized residue masses: the mass representation of an amino acid string a1,a2,…,an is a residue mass string S=s1,s2,…,sn, in which si is the discretized residue mass of ai for 1≤in. Residue mass strings are called text strings in the following analysis. The sum of all the masses in a substring si,si+1,…,sj of S, i.e., \(\sum _{k=i}^{j} s_{k}\), is called the mass of the substring. Two substrings of S are called consecutive substrings if the first residue mass of the second string directly follows the last residue mass of the first string. For example, si,si+1,…,sj and sj+1,sj+2,…,sk are consecutive substrings of S. A sequence of consecutive substrings A1,A2,…,Ak that include all residue masses in S is called a partition of S. The masses of the consecutive substrings in a partition is called the mass string of the partition. For example, let S=114,156,99,87, A1=114,A2=156,99, and A3=87, then A1,A2,A3 is a partition of S and the mass string of the partition is 114,255,87, where 255 is the mass of A2. A blocked pattern obtained from a path in a spectrum graph is represented by a sequence of masses, which are labels of the edges in the path. For example, the blocked pattern corresponding to the bold path in Fig. 1c is 114.04, 225.17, 87.03. A blocked pattern can be further discretized to an integer sequence using the same method for residue mass discretization. In the following analysis, we assume that all blocked patterns are integer ones. Unlike text strings, a blocked pattern contains not only single amino acid residue masses, but also those of combinations of several amino acids. A blocked pattern P matches a text string S if P is the mass string of a partition of S. For example, the blocked pattern 114,255,87 matches the text string 114,156,99,87 because the blocked pattern is the mass string of a partition of the text string.

In protein sequence filtration, we search blocked patterns extracted from a spectrum graph against a protein database to find matched amino acid sequences (sequence tags). All protein sequences in the database are concatenated and represented as a text string over an alphabet Σ containing the discretized residue masses of the 20 common amino acids. Because the masses of leucine and isoleucine are the same, the size of Σ is 19 instead of 20. Given a text string S over an alphabet Σ and a blocked pattern P, the blocked pattern matching problem is to find all substrings of S that match P.

Definition 1

Given a text string S over an alphabet Σ and a pattern P, the blocked pattern matching (BPM) problem is to find all substrings of S that match P.

The SGM problem is an extension of BPM problem that considering the noises in the spectrum. It is more complex than the blocked pattern matching problem.

Definition 2

Given a text string S over an alphabet Σ and a spectrum graph G, the spectrum graph matching (SGM) problem is to find all substrings of S that match a blocked pattern in G.

We first study a restricted version of the SGM problem, in which a start node and an end node in G are specified and only paths from the start node to the end node are used to find matched substrings of the text string. The SGM problem is solved by enumerating all node pairs in G as the start and end nodes in the restricted spectrum graph matching (RSGM) problem.

Definition 3

Given a text string S, a spectrum graph G over an alphabet Σ, a start node s, and an end node t, the restricted spectrum graph matching (RSGM) problem is to find all substrings of S that match a blocked pattern corresponding to a path from s to t in G.

A suffix tree based algorithm for the RSGM problem

To solve the RSGM problem, we present a suffix tree based algorithm that extends the algorithm proposed by Deng et al. for the blocked pattern matching problem [21]. A suffix tree is used to represent the string S. We assume that each edge in the suffix tree is labeled with only one letter (integer residue mass) to simplify the algorithm description.

We first review the algorithm for the blocked pattern matching problem, which was proposed by Deng et al. [21]. A blocked pattern is represented as a spectrum graph in which all edges are in one path. Let G={V,E} be the graph representation of a blocked pattern P=p1,p2,…,pm, where V={v0,v1…,vm} and v0,v1,…,vm is the only path from v0 to vm in G. Each prefix p1,p2,…,pk of the blocked pattern P corresponds to a path v0,v1,…,vk. A text string over Σ that matches the prefix p1,p2,…,pk is called a prefix text string of vk. A prefix text string is identifiable if it is a substring of S. For example, when P=114,255,87, both 114,156,99 and 114,99,156 are prefix text string of P. When S=114,156,99,87, the string 114,156,99 is an identifiable prefix text string of P, but 114,99,156 is not. If a prefix text string is not identifiable, then all text strings with the prefix are not identifiable, making it not necessary to search these text strings in S. Using identifiable prefix text strings significantly improves in the speed of searching a blocked pattern against S represented as a suffix tree.

Let Bi be the set of nodes in the suffix tree corresponding to all identifiable prefix text strings of vi for 0≤im, where m is the length of the blocked pattern. Specifically, B0 contains only the root node of the suffix tree. The blocking pattern matching algorithm progressively finds the node sets B1,…,Bm in the suffix tree. The node set Bm contains the solution to the blocked pattern matching problem.

Unlike the blocked pattern matching problem with only one blocked pattern, the spectrum graph G in the RSGM problem contains many paths from the start node to the end node, each path corresponding to a blocked pattern. The number of all paths in a spectrum graph may be an exponential function of the number of nodes. So it is an inefficient approach to directly extract all paths in a spectrum graph and search each path against a suffix tree separately.

Next, we describe our new algorithm for the RSGM problem, which extends the blocked pattern matching algorithm. Let V={v0,v1,…vm} be the set of nodes in the increasing order of their corresponding masses, in which the start node s is v0 and the end node t is vm. A text string is a prefix text string of node vi if it matches a blocked pattern corresponding to a path from v0 to vi. Let Bi be the set of nodes in the suffix tree corresponding to all identifiable prefix text strings of vi, and C(e) be the set of all text strings whose masses match the labeled mass of e (Table 1). In practice, a precomputed table is used to quickly obtain C(e). Because the fragment masses in query spectra have measured errors, an error tolerance ε is allowed in generating the text strings in the table. Denote EiE the set of all edges whose tail nodes are vi. For each edge e=(vj,vi) in Ei, we search from each suffix tree node in Bj to find all the paths corresponding to a text string in C(e) and add the end nodes of these paths to Bi (Algorithm 1).

Table 1 An example set of text strings matched a mass 270.14

After the last set Bm is found, the RSGM problem is solved by reporting all the text strings corresponding to a suffix tree node in Bm and their positions in S, which are stored in the suffix tree. Because the string S is represented by a suffix tree, the space complexity and time complexity of the algorithm are both O(n), where n is the length of the string S.

Time complexity

We use the idea proposed by Deng et al. to prove the time complexity of the RSGM algorithm [21]. The preprocessing step, in which the text S is represented as a suffix tree, is implemented using the Ukkonen’s algorithm [27], and its time complexity is O(n). Below we study the time complexity of the pattern query steps in the RSGM algorithm (Steps 2-11 of the algorithm in Algorithm 1).

We divide the set C(e) of text strings for an edge e in the spectrum graph into subsets C(e,1),C(e,2), …,C(e,l), where l is the length of the longest text string in C(e). The subset C(e,j) contains all text strings in C(e) with length j. The expansion factor of the set C(e,j) is defined as \(r(e,j) = |C(e,j)|^{\frac {1} {j}}\), where |C(e,j)| is the size of C(e,j). The largest expansion factor of all edges in G is denoted as r= maxe,jr(e,j). Let N be the maximum size of C(e) for all edges in G, that is, N= maxeE|C(e)|.

The running time of the graph query (Steps 2 - 11) in the RSGM algorithm is related to \(\sum _{i=1}^{m}|B_{i}|\), the sum of the sizes of the sets Bi. Next we will describe how to compute \(\sum _{i=1}^{m} |B_{i}|\). Let U be the set of all prefix text strings of v1,…,vm, and L be the length of the longest prefix text string in U. We define Xl as the set of all length l prefix text strings in U, and Yl as the set of all length l identifiable prefix strings in U, for 1≤lL. Each node in \(\cup _{i=1}^{m}B_{i}\) corresponds to an identifiable prefix text string in \(\cup _{l=1}^{L}Y_{l}\), that is, \(\sum _{i=1}^{m}|B_{i}| = \sum _{l=1}^{L}|Y_{l}|\). We compute the expectation of |Yl| by multiplying the size of Xl and the probability that a length l text string can be found in S.

A path in G starting from node s is called a prefix path. Let Q be a prefix path of G and Xl(Q) the set all prefix text strings with length l that match a prefix subpath of Q. Deng et al. proved the following Lemma about the size of Xl(Q) [21].

Lemma 1

The size of Xl(Q) is bounded by (2r)l, where r is the maximum expansion factor.

Based on Lemma 1, we give an upper bound of the size of Xl.

Lemma 2

The size of Xl is bounded by (2rd)l, where d is the maximum out-degree of the nodes in G.

Proof

Each length l prefix text string in Xl matches at least one prefix path. The total number of prefix paths in G with length l is bounded by dl. Because of Lemma 1, the number of prefix text strings in Xl matching one prefix path in G is bounded by (2r)l. As a result, the size of Xl is bounded by (2r)ldl=(2rd)l. □

Let g=|Σ| be the size of the alphabet Σ. Using the same proof method in Lemmas 1 and 2 in the supplementary material in [21], we can prove the following two lemmas.

Lemma 3

When 2rd<g, the expectation of |Yl| is bounded by \(\phantom {\dot {i}\!}(2rd)^{{\text {log}_{g}}n}=n^{{\text {log}_{g}}(2rd)}\), where n is the length of the text string S.

Lemma 4

When 1<2rd<g, the expectation of \({\sum \nolimits }_{i=1}^{m}|B_{i}|\) in the RSGM algorithm is bounded by \(\phantom {\dot {i}\!}c\,n^{{\text {log}_{g}}(2rd)}\), where \(\phantom {\dot {i}\!}c=\frac {2rd}{2rd-1}+\frac {g}{g-2rd}\).

Theorem 1

When Σ is a finite set and 1<2rd<g, the average-case time complexity of the graph query steps in the RSGM algorithm is O(dNnk+M), where d is the maximum out-degree of the nodes in G, N= maxeE|C(e)|, k= logg(2rd)<1, and M is the number of matched substrings in S.

Proof

For each suffix tree node u in \(\cup _{i=0}^{m} B_{i}\), we find its corresponding node v in the spectrum graph. In Step 6 of the RSGM algorithm, the length of the text string x is a constant. In addition, the set Bi is stored as a hash table, and the number of operations to check if u exists in Bi is a constant. As a result, the time complexity of one search in Step 6 is O(1). The total number of searches starting from u is bounded by the out-degree of v multiplied by the largest size of C(e) for all edges e whose head node is v. That is, the number of searches of u is bounded by dN. Because \(\left |\cup _{i=0}^{m} B_{i}\right |\) is O(nk), the total number of searches in the algorithm is O(dNnk). As a result, the time complexity of the algorithm is O(dNnk). □

Theorem 1 given the time complexity of the graph query steps in the RSGM algorithm when 2rd is no larger than the size of the alphabet Σ. In practice, the default value of the parameter α (the maximum difference between the masses corresponding to two nodes connected by an edge) is 350 Da and the scale factor is 100. In this case, the value of r is about 2.91. In addition, the out-degree of a node v in G is restricted to be at most 3 by ranking all the edges with the head node v in the increasing order with respect to the labeled masses and keeping only the top 3 ones. Because the maximum out-degree of the nodes in G is 3, the condition 2rd<g (2rd<18<g=19) is satisfied. In the experiments, the maximum out-degree of nodes in spectrum graphs was set to 3.

When 2rdg, the number of the suffix tree nodes searched by the algorithm is O(n), that is, the number of searched nodes is bounded by the size of the suffix tree. Each suffix tree node u corresponds to a spectrum graph node v. The total number of searches starting from u is bounded by the out-degree of v multiplied by the largest size of C(e) for all edges e whose head node is v. Therefore, the number of searches starting from u is bounded by dN, and the time complexity of the algorithm is O(dNn).

The RSGM algorithm gives all substrings in S that match a path from v0 to vi by reporting the suffix tree nodes in Bi. That is, the algorithm reports all substrings in S that match a path starting from v0. We execute the RSGM algorithm m−1 rounds to solve the SGM problem. In the ith round, the start node is set to vi and the end node to vm. All matched substrings in the m−1 rounds are combined as the solution to the SGM problem.

In practice, we are interested in finding only text strings that match a long path in G. Let Vs be the set of nodes v in G such that the sum of the labeled masses on the shortest path from v0 to v is no larger than β, where β is a user-specified parameter. Let Vt be the set of nodes v in G such that the sum of the labeled masses on the shortest path from v to vm is no larger than β. We only report text strings that match a path from a node in Vs to a node in Vt.

Protein sequence filtration

The spectrum graph of a top-down MS/MS spectrum usually consists of several disconnected components because the spectrum often has low fragment coverage. Based on this observation, we propose to extract γ mass intervals (subspectra) that are abundant with fragment masses from a query spectrum and construct a spectrum graph for each extracted mass interval, where γ is a user-specified parameter. In practice, the value of γ is chosen between 1 and 20. The spectrum graphs are searched against the protein database separately and the search results are combined to find the best sequence candidates. The filtering algorithm is called the SGM filtering algorithm.

In mass interval selection, mass intervals with the same width δ are reported and the width δ is a user-specified parameter. In practice, the value of δ is chosen between 500 and 1400 Da so that each reported mass interval corresponds to a sequence tag with 5−14 amino acids. (See “Parameter settings” section) The score of a mass interval is defined as the number of masses in it. The overlapping ratio of two intervals [a,a+δ] and [b,b+δ] with a<b<a+δ is defined as the ratio between the width of the overlapping region [b,a+δ] and the interval width δ, that is, \(\frac {a+\delta -b}{\delta }\). If a<a+δ<b, the overlapping ratio is 0. A greedy approach is employed to find multiple top-scoring mass intervals. First, the top-scoring mass interval of the spectrum is reported. Second, we remove all mass intervals that have an overlapping ratio ≥ρ with the reported one, where ρ is a user-specified parameter, and then report the best scoring one from the remaining mass intervals. The second step is performed iteratively until a total of γ mass intervals are reported or no candidate mass intervals can be reported. In addition, only mass intervals with at least 6 fragment masses are reported.

We use reversed mass intervals to find text strings that match suffix fragment masses in the query spectrum. Let b1,b2,…,bk be the masses in a mass interval [ml,mr] extracted from a collision-induced dissociation (CID) MS/MS spectrum with a precursor mass M. We generate a reversed mass interval as follows: (a) the mass interval [ml,mr] is converted into [Mmr,Mml], and (b) for each mass bi, 1≤ik, a reversed mass Mbi is added to the reversed mass list. For example, the reversed mass interval of a mass interval [100,400] with masses 114,213 and a precursor mass 644 is [244,544] with masses 431,530. Each extracted mass interval is further reversed to obtain a reversed mass interval in which suffix fragment masses are converted into prefix fragment masses. We search the spectrum graphs generated from the extracted and reversed mass intervals against the protein database represented by a suffix tree.

We describe three functions for assigning scores to fragment masses. For a fragment mass (x,h) with a mass value x and its intensity h in a query spectrum, the first function assigns the score 1 to the mass, that is, score(x,h)=1. The second function uses the logarithm function to normalize mass intensities to compute scores of masses. We find the lowest mass intensity b in the query spectrum, the score for the mass (x,h) is computed as \(\text {score}(x,h) = \log _{2} \left (\frac {2h}{b}\right)\). The constant number 2 in the function is to guarantee that all scores are positive. The third function is based on the rankings of mass intensities. The fragment masses in the query spectrum are sorted in the increasing order of their corresponding intensities. Let i be the rank of the mass (x,h). The score of the mass (x,h) is defined as score(x,h)=1+i/k where k is the total number of fragment masses in the spectrum. The score of the lowest intensity mass is about 1 and the score of the highest intensity mass is 2. The three scoring functions are called the mass count score, log intensity score, and rank score, respectively. The score of a node in a spectrum graph is assigned as the score of its corresponding fragment mass.

Here we introduce a similarity score between a query spectrum and the protein sequence. A protein sequence matches a blocked pattern P if its text string representation contains a substring that matches P. The score of the pattern P is the number of the nodes in the corresponding path in the spectrum graph. The similarity score between a protein sequence and a spectrum graph is the score of the best scoring pattern extracted from the graph that matches the protein sequence. Let G1,G2,…,Gk be the set of spectrum graphs extracted from a query spectrum, the similarity score between a protein sequence and the query spectrum is the best similarity score between the protein sequence and G1,G2,…,Gk.

After finding the best scoring pattern from G1, G2,…,Gk that matches a substring of the protein sequence, we compute the mass shift between the experimental fragment masses in the spectrum and the theoretical fragment masses of the protein sequence. By shifting all theoretical masses by the mass shift, we recompute the similarity score between the experimental masses and shifted theoretical ones. The new score is called the extended similarity score between the protein sequence and the query spectrum, which is used to rank and filter protein sequences.