Variable Selection For Generalized Canonical Correlation Analysis

 

 

Arthur Tenenhausa, Cathy Philippeb,c, Vincent Guillemotb, Kim-Anh Le Cao d,
Jacques Grill c, Vincent Frouinb

 

a Supelec, Department of Signal Processing and Electronics Systems, Gif-sur-Yvette, France

b Laboratoire de Neuroimagerie Assistée par Ordinateur. NeuroSpin Center, I2BM, DSV, CEA, Gif-sur-Yvette, France

c Unité Mixte de Recherche 8203 du Centre National de la Recherche Scientifique ”Vectorology and Anticancer Therapeutics”, Gustave Roussy Cancer Institute, University Paris XI, Villejuif, France

d Institute for Molecular Biosciences and ARC Centre of Excellence in Bioinformatics,The University of Queensland, Brisbane QLD 4072, Australia

 

 

Contact : arthur.tenenhaus@supelec.fr

 

 

Abstract

 

Regularized Generalized Canonical Correlation Analysis (RGCCA) is a generalization of Regularized Canonical Correlation Analysis to three or more sets of variables. RGCCA is component- based approach which aims at studying the relationship between several set of variables. The quality and interpretability of the RGCCA components are likely affected by the usefulness and relevance of the variables of each block. Accordingly, it is an important issue to identify within each block a subset of significant variables which are active in the relationships between blocks. In this paper, RGCCA is extended to address this issue of variable selection. Specifically, RGCCA is combined with an l1 -penalty in a unified framework giving rise to Sparse GCCA (SGCCA).

 

Within this framework, blocks are not necessarily fully connected and this allows SGCCA to recover, with a single monotonically convergent algorithm a large number of well-known methods as particular cases. Finally, the usefulness of SGCCA is illustrated on a 3-blocks datasets which combine Gene Expression, Comparative Genomic Hypbridation and a qualitative phenotype.

 

 

Supplementary material