A sparse PLS for variable selection when integrating omics data

Kim-Anh Lê Cao; Debra Rossouw; Christèle Robert-Granié; Philippe Besse

doi:10.2202/1544-6115.1390

A sparse PLS for variable selection when integrating omics data

Stat Appl Genet Mol Biol. 2008;7(1):Article 35. doi: 10.2202/1544-6115.1390. Epub 2008 Nov 18.

Authors

Kim-Anh Lê Cao¹, Debra Rossouw, Christèle Robert-Granié, Philippe Besse

Affiliation

¹ INRA UR 631, Université de Toulouse. k.lecao@imb.uq.edu.au

PMID: 19049491
DOI: 10.2202/1544-6115.1390

Abstract

Recent biotechnology advances allow for multiple types of omics data, such as transcriptomic, proteomic or metabolomic data sets to be integrated. The problem of feature selection has been addressed several times in the context of classification, but needs to be handled in a specific manner when integrating data. In this study, we focus on the integration of two-block data that are measured on the same samples. Our goal is to combine integration and simultaneous variable selection of the two data sets in a one-step procedure using a Partial Least Squares regression (PLS) variant to facilitate the biologists' interpretation. A novel computational methodology called ;;sparse PLS" is introduced for a predictive analysis to deal with these newly arisen problems. The sparsity of our approach is achieved with a Lasso penalization of the PLS loading vectors when computing the Singular Value Decomposition. Sparse PLS is shown to be effective and biologically meaningful. Comparisons with classical PLS are performed on a simulated data set and on real data sets. On one data set, a thorough biological interpretation of the obtained results is provided. We show that sparse PLS provides a valuable variable selection tool for highly dimensional data sets.

Publication types

Comparative Study

MeSH terms

Animals
Biometry / methods*
Data Interpretation, Statistical
Fermentation / genetics
Gene Expression Profiling / statistics & numerical data
Genomics / statistics & numerical data
Least-Squares Analysis*
Liver / drug effects
Liver / metabolism
Male
Metabolomics / statistics & numerical data
Multivariate Analysis
Proteomics / statistics & numerical data
Rats
Saccharomyces cerevisiae / genetics
Saccharomyces cerevisiae / metabolism
Toxicology / statistics & numerical data