RAId | Quantitative Molecular Biological Physics (QMBP)

RAId

Robust Accurate Identification (RAId) is a suite of proteomics tools for analyzing tandem mass spectrometry data with accurate statistics. In addition to providing accurate statistics, RAId offers users with different modes of data analysis: database search, generation of de novo score distribution of all possible peptides using different scoring functions (RAId, XCorr, Hyperscore, Kscore), and statistical confidence reassignment. In particular, RAId integrated knowledge databases incorporate known single amino polymorphisms (SAPs), post-translational modifications (PTMs) and disease information, providing dynamic information retrieval for biomedical applications.

RAId modes: (click on a mode to start)

	Database search	Search database with specified spectrum using different scoring functions.
	Generate histogram	Generate the score histogram for a MS/MS spectrum.
	Compute TNPP	Compute total number of possible peptides for a given molecular weight.

Statistical methods:

Controlling the number of false positives (FPs) or the proportion of false discoveries (PFD) without reducing the number of identified true positives (TPs) has been among the most important in retrieval problems. This section briefly describes the strategy employed by RAId and the difference between our strategy and others’.

Let us begin with some definitions: Imagine that there are in total m hypotheses tested and sorted according to some kind of significance measure. Out of the m hypotheses, m₀ of them are true null hypotheses (false positives) while m − m₀ of them are untrue null hypotheses (true positives). At a specified threshold, some true null hypotheses and some untrue null hypotheses will be deemed significant/insignificant as shown in the table below:

	declared insignificant	declared significant	Total
True null hypotheses (FPs)	U	V	m₀ ≡ m π₀
Untrue null hypotheses (TPs)	T	S	m - m₀ ≡ m (1 - π₀)
	m - R	R	m

At this threshold, the number of FPs (denoted by V ) and the number of TPs (denoted by S) are unobservable, but their sum R, the total number of hypotheses passing this threshold, is observable. The PFD is defined to be Q = V /(V + S) = V /R. As described in the first sentence of the second paragraph of section 2.3 of Benjamini & Hochberg¹, controlling the random variable Q for each realization (i.e., each experiment) is most desirable. This, however, is impossible when m₀ = m because in this case Q can only take the value 0 or 1 per cutoff per realization. Benjamini & Hochberg thus propose to control the expectation value of Q (𝔼[Q]). However, as we will elaborate below, the cases of m₀ = m or V > 0 while S = 0 are not expected to occur in proteomics data analysis and we can employ a simpler approach similar to Sorić². The method by Sorić is to control (valid only when m > m₀ and V remains zero till S > 0) with P-value cutoff α

Q_max =

(

𝔼[V]

)

(

α m₀

)

(

α m

)

≡ ε

_max

Given that ε is an upperbound of Q, it has been discussed that Sorić's method by using ε has conservative statistics. Hence, recently Storey³ addressed this issue by allowing an estimate of π₀ and its explicit inclusion but with 𝔼[Q] essentially replaced by 𝔼[V]/𝔼[R].

Evidently, the error control by Sorić's ε is conservative when peptide identifications are analyzed via the so-called peptide-spectrum match protocol. In this protocol, one picks only the best scoring peptide for each spectrum (hence peptide-spectrum match) and each spectrum is treated as an hypotheses that yield either TP (if the top hit of that spectrum is a TP) or FP (if the top hit of that spectrum is a FP). For this type of method, the π₀ factor, being the ratio of the number of FP spectra over the total number of spectra (TP spectra + FP spectra), should be carefully included to avoid conservative statistics.

In our method, all scored peptides are considered for each spectrum. Therefore, even a spectrum with TP as the top hit can still yield FPs, hence the additional factor π₀ is not needed as elaborated below. The PFD in our case is estimated by

PFD =	E_c × N_σ
	R(E ≤ E_c)

where R(E ≤ E_c) records number of peptide hits with E-values E ≤ E_c, E_c is our cutoff E-value reflecting the expected number of false positives per spectrum, and N_σ is the number of MS/MS spectra contained in the experimental data. As explained in ⁴, accurate spectrum-specific E-values can be used to rank peptides across spectra and even across experiments. If one superficially views E_c as the the corresponding α, N_σ as m, then our PFD estimate appears as if we are using ε. However, with careful scrutiny, one sees that it is not the case. Actually, since E_c gives the expected number of false positives per spectrum, E_c × N_σ yields the expected number of false positives with E-values E ≤ E_c, which is exactly the expectation value of V. That is, our E_c × N_σ = 𝔼[V(E ≤ E_c)]. Therefore, our estimate for the PFD, being 𝔼[V]/R, is neither too conservative nor too aggressive. We now describe in more detail what are our m and m₀. For a given set of experimental data, the qualified candidate peptides for spectrum σ_i may include t_i true positives (t_i > 1 for spectrum of co-eluted peptides) and n_i false positives. In this case, the m and m₀ are given as

m =	N_σ	[t_i + n_i];	m₀ =	N_σ	n_i
	∑			∑
	i = 1			i = 1

We now justify why the sick case m = m₀ does not apply in MS/MS data analyses. First, m = m₀ can occur only when the database searched contains only false protein sequences, implying that one is using the wrong database for peptide search. This can be corrected by using the correct database. Second, once a correct database is in place, any decent scoring function is expected to identify some true positive peptides prior to finding false positive peptides. That is, before V becomes one, S > 1 already.

Finally, let us comment a little on our protein identification statistics. Given that a protein’s significance is obtained by combining its evidence peptides’ significances, users might ask if this strategy suffers from the so-called “File Drawer Problem” by Rosenthal⁵. The “File Drawer Problem” emerges because people only report studies showing high significances while keeping results showing low significances in the “File Drawer”. Therefore, even for a fixed hypothesis with many possible independent tests, only a few that show high significance are reported and when one combines the statistical significances, the combined significance tends to get exaggerated. This is why Rosenthal proposed the test to see the effect of adding some number of insignificant results into combined statistics. However, this does not apply to the way protein is identified via peptides as described below. In proteomics experiment, one does not expect an MS/MS spectrum to contain more than few true positive peptides due to the use of chromatography and only a small m/z window of precursor ions were further fragmented. That is, a spectrum not reporting a peptide contained in a certain protein γ should not be viewed as if that spectrum yields a poor statistical significance for the presence of γ. That is, spectra not reporting peptides contained in γ are NOT to be considered as the studies put inside the “File Drawer” when the presence of protein γ is concerned.

In addition, with cutoff peptide E-value equals one for our protein identification, it means that the expected number of false positive peptides per spectrum is one. Given that the majority of spectra yield no true positive identifications and the remaining spectra yields no more than few true peptides per spectrum, having spectrum-specific E-value one is considered very insignificant already. With E-value equals one, the total number of false positive or insignificant peptides included is about the total number of spectra analyzed, while the number of true positive peptides is only a fraction of the total number of spectra.

Furthermore, each peptide’s E-value E is first transformed into a database E-value P_db = 1 − e^−E and then the statistics is combined based on the database P-values. With E-value E = 1, the corresponding database P-value P_db = 1 − 1/e ≈ 0.632. This actually corresponds to a negative Z score when a Gaussian is assumed. That is, in terms of Rosenthal’s argument, we are including a lot of events with negative Z scores when combining peptide evidence to form protein identification. That is, with E-value cutoff one, in terms of analyzing the experimental data that contains tens of thousands of spectra, majority of the identifications considered are not in the top tier (with very high significance) but with mediocre or poor significances.

Combining the reasoning of the paragraphs above, the so-called “File Drawer Problem” does not happen in our protein identification method.

References:

1. Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological), 57: 289-300.
Link to the paper

2. Sorić B (1989) Statistical "Discoveries" and Effect-Size Estimation. Journal of the American Statistical Association, 84: 608-10.
DOI: 10.1080/01621459.1989.10478811

3. Storey J, Tibshirani R (2003) Statistical "Discoveries" and Effect-Size Estimation. PNAS, 91: 12091-5.
DOI: 10.1073/pnas.1530509100

4. Alves G, Ogurtsov AY, Yu YK (2010) RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics. PLoS One, 5: 5.
PMID: 21103371

5. Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychological Bulletin, 86: 638-41.
DOI: 10.1037/0033-2909.86.3.638

Relevant group publications:

1. Alves G, Yu YK (2016) Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution. Bioinformatics, 32: 32.
PMID: 27153659

2. Alves G, Yu YK (2015) Mass spectrometry-based protein identification with accurate statistical significance assignment. Bioinformatics, 31: 31.
PMID: 25362092

3. Alves G, Yu YK (2014) Accuracy evaluation of the unified P-value from combining correlated P-values. PLoS One, 9: 9.
PMID: 24663491

4. Alves G, Yu YK (2011) Combining independent, weighted P-values: achieving computational stability by a systematic expansion with controllable accuracy. PLoS One, 6: 6.
PMID: 21912585

5. Alves G, Ogurtsov AY, Yu YK (2010) RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics. PLoS One, 5: 5.
PMID: 21103371

6. Alves G, Ogurtsov AY, Yu YK (2011) Assigning statistical significance to proteotypic peptides via database searches. J Proteomics, 74: 74.
PMID: 21055489

7. Alves G, Yu YK (2008) Statistical Characterization of a 1D Random Potential Problem - with applications in score statistics of MS-based peptide sequencing. Physica A, 387: 387.
PMID: 19918268

8. Alves G, Ogurtsov AY, Yu YK (2008) RAId_DbS: mass-spectrometry based peptide identification web server with knowledge integration. BMC Genomics, 9: 9.
PMID: 18954448

9. Alves G, Ogurtsov AY, Kwok S, Wu WW, Wang G, Shen RF, Yu YK (2008) Detection of co-eluted peptides using database search methods. Biol Direct, 3: 3.
PMID: 18597684

10. Alves G, Wu WW, Wang G, Shen RF, Yu YK (2008) Enhancing peptide identification confidence by combining search methods. J Proteome Res, 7: 7.
PMID: 18558733

11. Alves G, Ogurtsov AY, Wu WW, Wang G, Shen RF, Yu YK (2007) Calibrating E-values for MS2 database search methods. Biol Direct, 2: 2.
PMID: 17983478

12. Alves G, Ogurtsov AY, Yu YK (2007) RAId_DbS: peptide identification using database searches with realistic statistics. Biol Direct, 2: 2.
PMID: 17961253

13. Doerr TP, Alves G, Yu YK (2005) Ranked solutions to a class of combinatorial optimizations - with applications in mass spectrometry based peptide sequencing and a variant of directed paths in random media. Physica A, 354: 558-70.
DOI: 10.1016/j.physa.2005.03.004

14. Sardiu ME, Alves G, Yu YK (2005) Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem. Phys Rev E Stat Nonlin Soft Matter Phys, 72: 72.
PMID: 16485984

15. Alves G, Yu YK (2005) Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics. Bioinformatics, 21: 21.
PMID: 16105903

QMBP

raiddbs

NCBI

Web service

Download

Documentation