A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review...

15
Review Article A Review of Feature Extraction Software for Microarray Gene Expression Data Ching Siang Tan, Wai Soon Ting, Mohd Saberi Mohamad, Weng Howe Chan, Safaai Deris, and Zuraini Ali Shah Artificial Intelligence and Bioinformatics Research Group, Faculty of Computing, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia Correspondence should be addressed to Mohd Saberi Mohamad; [email protected] Received 23 April 2014; Revised 24 July 2014; Accepted 24 July 2014; Published 31 August 2014 Academic Editor: Dongchun Liang Copyright © 2014 Ching Siang Tan et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. When gene expression data are too large to be processed, they are transformed into a reduced representation set of genes. Transforming large-scale gene expression data into a set of genes is called feature extraction. If the genes extracted are carefully chosen, this gene set can extract the relevant information from the large-scale gene expression data, allowing further analysis by using this reduced representation instead of the full size data. In this paper, we review numerous soſtware applications that can be used for feature extraction. e soſtware reviewed is mainly for Principal Component Analysis (PCA), Independent Component Analysis (ICA), Partial Least Squares (PLS), and Local Linear Embedding (LLE). A summary and sources of the soſtware are provided in the last section for each feature extraction method. 1. Introduction e advances of microarray technology allow the expression levels of thousands of genes to be measured simultaneously [1]. is technology has caused an explosion in the amount of microarray gene expression data. However, the gene expression data generated are high-dimensional, containing a huge number of genes and small number of samples. is is called the “large small problem” [2]. e high- dimensional data are the main problem when analysing the data. As a result, instead of using gene selection methods, feature extraction methods are also important in order to reduce the dimensionality of high-dimensional data. Instead of eliminating irrelevant genes, feature extraction methods work by transforming the original data into a new represen- tation. Feature extraction is usually better than gene selection in terms of causing less information loss. As a result, the high-dimensionality problem can be solved using feature extraction. Soſtware is a set of machine readable instructions that direct a computer’s processor to perform specific operations. With increases in the volume of data generated by modern biomedical studies, soſtware is required to facilitate and ease the understanding of biological processes. Bioinformatics has emerged as a discipline in which emphasis is placed on easily understanding biological processes. Gheorghe and Mitrana [3] relate bioinformatics to computational biology and natural computing. Higgs and Attwood [4] believe that bioinformatics is important in the context of evolutionary biology. In this paper, the soſtware applications that can be used for feature extraction are reviewed. e soſtware reviewed is mainly for Principal Component Analysis (PCA), Indepen- dent Component Analysis (ICA), Partial Least Squares (PLS), and Local Linear Embedding (LLE). In the last section for each feature extraction method, a summary and sources are provided. 2. Software for Principal Component Analysis (PCA) In the domain of dimension reduction, PCA is one of the renowned techniques. e fundamental concept of PCA is Hindawi Publishing Corporation BioMed Research International Volume 2014, Article ID 213656, 15 pages http://dx.doi.org/10.1155/2014/213656

Transcript of A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review...

Page 1: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

Review ArticleA Review of Feature Extraction Software for Microarray GeneExpression Data

Ching Siang Tan Wai Soon Ting Mohd Saberi Mohamad Weng Howe ChanSafaai Deris and Zuraini Ali Shah

Artificial Intelligence and Bioinformatics Research Group Faculty of Computing Universiti Teknologi Malaysia81310 Skudai Johor Malaysia

Correspondence should be addressed to Mohd Saberi Mohamad saberiutmmy

Received 23 April 2014 Revised 24 July 2014 Accepted 24 July 2014 Published 31 August 2014

Academic Editor Dongchun Liang

Copyright copy 2014 Ching Siang Tan et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

When gene expression data are too large to be processed they are transformed into a reduced representation set of genesTransforming large-scale gene expression data into a set of genes is called feature extraction If the genes extracted are carefullychosen this gene set can extract the relevant information from the large-scale gene expression data allowing further analysis byusing this reduced representation instead of the full size data In this paper we review numerous software applications that can beused for feature extraction The software reviewed is mainly for Principal Component Analysis (PCA) Independent ComponentAnalysis (ICA) Partial Least Squares (PLS) and Local Linear Embedding (LLE) A summary and sources of the software areprovided in the last section for each feature extraction method

1 Introduction

The advances of microarray technology allow the expressionlevels of thousands of genes to be measured simultaneously[1] This technology has caused an explosion in the amountof microarray gene expression data However the geneexpression data generated are high-dimensional containinga huge number of genes and small number of samplesThis is called the ldquolarge 119901 small 119899 problemrdquo [2] The high-dimensional data are the main problem when analysing thedata As a result instead of using gene selection methodsfeature extraction methods are also important in order toreduce the dimensionality of high-dimensional data Insteadof eliminating irrelevant genes feature extraction methodswork by transforming the original data into a new represen-tation Feature extraction is usually better than gene selectionin terms of causing less information loss As a result thehigh-dimensionality problem can be solved using featureextraction

Software is a set of machine readable instructions thatdirect a computerrsquos processor to perform specific operationsWith increases in the volume of data generated by modern

biomedical studies software is required to facilitate and easethe understanding of biological processes Bioinformaticshas emerged as a discipline in which emphasis is placedon easily understanding biological processes Gheorghe andMitrana [3] relate bioinformatics to computational biologyand natural computing Higgs and Attwood [4] believe thatbioinformatics is important in the context of evolutionarybiology

In this paper the software applications that can be usedfor feature extraction are reviewed The software reviewed ismainly for Principal Component Analysis (PCA) Indepen-dent Component Analysis (ICA) Partial Least Squares (PLS)and Local Linear Embedding (LLE) In the last section foreach feature extraction method a summary and sources areprovided

2 Software for Principal ComponentAnalysis (PCA)

In the domain of dimension reduction PCA is one of therenowned techniques The fundamental concept of PCA is

Hindawi Publishing CorporationBioMed Research InternationalVolume 2014 Article ID 213656 15 pageshttpdxdoiorg1011552014213656

2 BioMed Research International

to decrease the dimensionality of a given data set whilstmaintaining as plentiful as possible the variation existing inthe initial predictor variablesThis is attained by transformingthe 119901 initial variables 119883 = [119909

1 1199092 119909

119901] to a latest set

of 119902 predictor variables Linear amalgamation of the initialvariables is 119879 = [119905

1 1199052 119905

119902] In mathematical domain PCA

successively optimizes the variance of a linear amalgamationof the initial predictor variables

119906119902

= argmax (Var (119883119906))

119906119905119906 = 1

(1)

conditional upon the constraint 119906119879

119894119878119883

119906119895

= 0 for every 1 le

119894 le 119895 The orthogonal constraint makes sure that the linearcombinations are uncorrelated that is Cov(119883119906

119894 119883119906119895) = 0

119894 = 119895 These linear combinations are denoted as the principlecomponents (PCs)

119905119894

= 119883119906119894 (2)

The projection vectors (or known as the weighting vectors)119906 can be attained by eigenvalue decomposition on thecovariance matrix 119878

119883

119878119883

119906119894

= 120574119894119906119894 (3)

where 120574119894is the 119894th eigenvalue in the decreasing order for 119894 =

1 119902 and 119906119894is the resultant eigenvector The eigenvalue 120574

119894

calculates the variance of the 119894th PC and the eigenvector 119906119894

gives the weights for the linear transformation (projection)

21 FactoMineR FactoMineR is an R package that providesvarious functions for the analysis of multivariate data [5]Thenewest version of this package is maintained by Hussen et al[6] There are a few main features provided by this packagefor example different types of variables data structuresand supplementary information can be taken into accountBesides that it offers dimension reduction methods such asPrincipal Component Analysis (PCA) Multiple Correspon-dence Analysis (MCA) and Correspondence Analysis (CA)The steps in implementing PCA are described in Le et al [5]and Hoffmann [7] For PCA there are three main functionsfor performing the PCA plotting it and printing its resultsThis package is mainly for Windows MacOS and Linux

22 ExPosition ExPosition is an R package for the multivari-ate analysis of quantitative and qualitative data ExPositionstands for Exploratory Analysis with the Singular ValueDecomposition The newest version of this package is main-tained by Beaton et al [8] A variety of multivariate methodsare provided in this package such as PCA multidimensionalscaling (MDS) and Generalized PCA All of these methodscan be performed by using the corePCA function in this pack-age Another function epPCA can be applied to implementPCA Besides that Generalized PCA can be implementedusing the function epGPCA as well All of these methods areused to analyse quantitative data A plotting function is alsooffered by this package in order to plot the results of theanalysis This package can be installed on Windows Linuxand MacOS

23 amap The R package ldquoamaprdquo was developed for clus-tering as well as PCA for both parallelized functions androbust methods It is an R package for multidimensionalanalysisThe newest version ismaintained by Lucas [9]Threedifferent types of PCA are provided by this package Themethods are PCA Generalized PCA and Robust PCA PCAmethods can be implemented using the functions acp andpca for PCA acpgen for Generalized PCA and acprob forRobust PCA This package also allows the implementationof correspondence factorial analysis through the functionafc Besides that a plotting function is also provided forplotting the results of PCA as a graphical representationTheclustering methods offered by this package are k-means andhierarchical clustering The dissimilarity matrix and distancematrix can be computed using this package as well Thispackage is mainly for Windows Linux and MacOS

24 ADE-4 ADE-4 was originally developed by Thioulouseet al [10] as software for analyzing multivariate data and dis-playing graphics This software includes a variety of methodssuch as PCA CA Principal Component Regression PLSCanonical Correspondence Analysis Discriminant Analysisand others Besides that this software is implemented in anR environment as an R package ldquoade4rdquo The newest versionof this package is maintained by Penel [37] In this packagePCA can be performed by using the dudipca function Avisualization function is also provided in order to visualizethe results as a graphical representation In previous studiesthis package was implemented by Dray and Dufour [38] toidentify and understand ecological community structuresThis package is mainly for Linux Windows and MacOS

25 MADE4 MADE4 (microarray ade4) was developed byCulhane et al [11] for multivariate analysis of gene expressiondata based on the R package ldquoade4rdquo Basically it is the ext-ensions of the R package ldquoade4rdquo formicroarray dataThe pur-pose of writing this software was to help users in the analysisof microarray data using multivariate analysis methods Thissoftware is able to handle a variety of gene expression dataformats and new visualization software has been added to thepackage in order to facilitate the visualization of microarraydata Other extra features such as data preprocessing andgene filtering are included as well However this packagewas further improved by the addition of the LLSimpute alg-orithm to handle themissing values in themicroarray data byMoorthy et al [39] It is implemented in an R environmentThe advance of this package is that multiple datasets canbe integrated to carry out analysis of microarray data Thenewest version is maintained by Culhane [40] This packagecan be installed on Linux Windows and MacOS

26 XLMiner XLMiner is add-in software for MicrosoftExcel that offers numerous data mining methods for ana-lysing data [12] It offers a quick start in the use of a variety ofdata miningmethods for analysing dataThis software can beused for data reduction using PCA classification using Neu-ral Networks or Decision Trees [41 42] class prediction dataexploration affinity analysis and clustering In this softwarePCA can be implemented using the Principle Component

BioMed Research International 3

tab [43] This software is implemented in Excel As a resultthe dataset should be in an Excel spreadsheet In order tostart the implementation of XLMiner the dataset needs tobe manually partitioned into training validation and testsets Please see httpwwwsolvercomxlminer-data-miningfor further detailsThis software can be installed onWindowsand MacOS

27 ViSta ViSta stands for Visual Statistics System and canbe used for multivariate data analysis and visualization inorder to provide a better understanding of the data [13]This software is based on the Lisp-Stat system [44] It isan open source system that can be freely distributed formultivariate analysis and visualization PCA and multipleand simple CA are provided in this software Its mainadvance is that the data analysis is guided in a visualizationenvironment in order to generate more reliable and accu-rate results The four state-of-the-art visualization methodsoffered by this software are GuideMaps [45] WorkMaps[46] Dynamic Statistical Visualization [47] and StatisticalRe-Vision [48] The plug-ins for PCA can be downloadedfrom httpwwwmdpeduarpsicologiavistavistahtm Anexample of implementation of the analysis using PCA can beviewed in Valero-Mora and Ledesma [49] This software canbe installed on Windows Unix and Macintosh

28 imDEV Interactive Modules for Data Exploration andVisualization (imDEV) [14] is an application of RExcel thatintegrates R and Excel for the analysis visualization andexploration of multivariate data It is used in Microsoft Excelas add-ins by using an R package Basically it is implementedin Visual Basic and R In this software numerous dimensionreduction methods are provided such as PCA ICA PLSregression andDiscriminant Analysis Besides that this soft-ware also offers clustering imputing ofmissing values featureselection and data visualization The 2 times 3 visualizationmethods are offered such as dendrograms distribution plotsbiplots and correlation networksThis software is compatiblewith a few versions of Microsoft Excel such as Excel 2007 and2010

29 Statistics Toolbox Statistical Toolbox offers a variety ofalgorithms and tools for data modelling and data analy-sis Multivariate data analysis methods are offered by thistoolbox The methods include PCA clustering dimensionreduction factor analysis visualization and others In thestatistical toolbox of MATLAB several PCA functions areprovided for multivariate analysis for example pcacov prin-comp and pcares (MathWorks) Most of these functions areused for dimensional reduction pcacov is used for covariancematrices princomp for raw data matrices and pcares forresiduals from PCA All of these functions are implementedin MATLAB

210 Weka Weka [16] is data mining software that pro-vides a variety of machine learning algorithms This soft-ware offers feature selection data preprocessing regressionclassification and clustering methods [50] This software isimplemented in a Java environment PCA is used as a dim-

ension reduction method in Weka to reduce the dimension-ality of complex data through transformation However notall of the datasets are complete Prabhume and Sathe [51]introduced a new filter PCA for Weka in order to solve theproblem of incomplete datasets It works by estimating thecomplete dataset from the incomplete dataset This softwareis mainly for Windows Linux and MacOS

211 NAG Library In NAG Library the function of PCAis provided as the g03aa routine [17] in both C and Fort-ran This routine performs PCA on data matrices This soft-ware was developed by the Numerical Algorithms GroupIn the NAG Library more than 1700 algorithms are offeredfor mathematical and statistical analysis For PCA it is suit-able for multivariate methods G03 Other methods prov-ided are correlation analysis wavelet transforms and par-tial differential equations Please refer to httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf for fur-ther details about the g03aaa routine This software can beinstalled on Windows Linux MacOS AIX HP UX andSolaris

212 Case Study In this section we will discuss the imple-mentation of coinertia analysis (CIA) to cross-platformvisualization in MADE4 and ADE4 to perform multivariateanalysis of microarray datasets To demonstrate PCA wasapplied on 4 childhood tumors (NB BL-NHL EWS andRMS) from amicroarray gene expression profiling study [52]From these data a subset (khan$train 206 genes times 64 cases)each casersquos factor denoting the respective class (khan$trainclasses length = 64) and a gene annotationrsquos data frame areaccessible in aforementioned dataset in MADE4

lt library (made4)

lt data (khan)

lt dataset = khan$train

lt fac = khan$trainclasses

lt geneSym = khan$annotation$Symbol

lt resultscoa lt- ord (dataset type = ldquocoardquo)

lt par (mfrow = c (1 2))

lt plotarrays (resultscoa classvec = fac)

lt plotgenes (resultscoa genelabels = geneSym)

Figure 1 shows the PCA of a 306-gene subset As origin asthe point of reference the more advanced gene and case areprojected in the similar direction the stronger the associationbetween involved gene and case is (gene is upregulated in thatarray sample)

213 Summary of PCA Software Tables 1 and 2 show thesummary and sources of PCA software respectively Table 3discusses the related work of this software

4 BioMed Research International

LYN

TNNT2 IGF2

AIF1

- MYL4 IGF2

HCLS1 ELF1 HLA-DMA

TNNT1

CDKN3 ISG20 BUB1 MME

COL3A1

PRKAR2B -

FNDC5

MYC

METAP2 GSTM5

TNA CAV1 TLE2

TUBB5

TNFAIP6

MYC

PTPN13

LOXL2

FCGRT OLFM1

GYG2

NFIX FVT1

d = 05

Figure 1 Plot of genes

3 Software for IndependentComponent Analysis (ICA)

ICA is considered as a valuable extension of PCA thathas been established considering the blind separation ofindependent sources from their linear combination [53] In away the initial point of ICA is the property of uncorrelation ofgeneral PCA Based on 119899times119901datamatrix119883 whose rows 119903

119894(119895 =

1 119899) tally to observational variables and whose columns119888119895

(119895 = 1 119901) are the individuals of the correspondingvariables the ICA model of 119883 can be written as

119883 = 119860119878 (4)

With generality intact119860 is a 119899times119899 mixingmatrix whereas 119878 isa 119899times119901 sourcematrix under the necessity of 119878being statisticallyindependent as possible ldquoIndependent componentsrdquo are thenew variables confined in the rows of 119878 to wit the variablesobserved are linearly collected independent componentsMutual information 119868 = sum

119896119867(119878119896) minus 119867(119878) where 119867(119878

119896) =

minus int 119901(119878119896) log119901(119878

119896)119889119904119896is themarginal entropy of the variables

119878119896 119901(119878119896) is the probabilistic density function and 119867(119878) is the

joint entropy [54] Value the independent components able tobe attained by discovering the correct linear mixtures of theobservational variables since mixing can be inverted as

119880 = 119878 = 119860minus1

119883 = 119882119883 (5)

31 FastICA FastICA is themost widely usedmethod of ICA[55] It is implemented in an R environment as the R packageldquoFastICArdquo for performing ICA and Projection Pursuit byusing the FastICA algorithm FastICA was first introducedby Hyvarinen [54] for single and multiple componentextraction The FastICA algorithm is based on a fixed-pointiteration scheme maximizing non-Gaussianity as a measureof statistical independence This package is maintained byMarchini et al [18] ICA is used to extract the informativefeatures through a transformation of the observed multidi-mensional random vectors into independent components

This package is mainly for Windows Linux and MacOSFastICA is also implemented in MATLAB In MATLABFastICA implements a fast fixed-point algorithm for ICA aswell as projection pursuit It provides a simple user interfaceand also a powerful algorithm for computation

32 JADE JADE is an R package that provides a functionfor implementing ICA This package is maintained by Nord-hausen et al [19] In this package Cardosorsquos JADE algorithm[56] is provided for ICA Instead of the JADE algorithm otherBlind Source Separation (BSS)methods such as the SOBI [57]andAMUSE [58]methods are offered Both of thesemethodsare mainly used for solving second order BSS problemsAmari error [59] is offered to evaluate the performance ofthe ICA algorithm This package can be installed on LinuxWindows and MacOS

33 High Performance Signal Analysis Tools (HiPerSAT)HiPerSAT is written in C++ for processing electroen-cephalography (EEG) data with whitening of data and ICA[20] MPI and OpenMP are used to perform parallel analysisof ICA Basically this software is used to analyse EEGdata in order to understand the neurological components ofbrain activity In this software FastICA SOBI and Informaxalgorithms are offered HiPerSAT is integrated intoMATLABand EEGLAB [60] EEGLAB is MATLAB-based softwarethat is used for analysing EEG data However the advantageof HiPerSAT is that it can handle larger datasets comparedto MATLAB In comparison to EEGLAB HiPerSAT is ableto handle large datasets without partitioning but EEGLABrequires data partitioning Data whitening is performedbefore implementing the algorithms This software can beinstalled on all platforms

34 MineICA MineICA is an R package that supplies theimplementation of ICA on transcriptomic data [21] Themain purpose of MineICA is to provide an easier way ofinterpreting the decomposition results from ICA Besidesthat this software also provides a correlation-based graphfor comparing the components from different datasets Thenewest version of this package is maintained by Biton [61]This package provides some features such as storage of ICAresults annotation of features and visualization of the resultsof ICA This package can be installed on Linux MacOS andWindows

35 Pearson Independent Component Analysis Karnanen[22] developed an R package for a feature extraction tech-nique based on the Pearson ICA algorithm This is a mutualinformation-based blind source separation approach whichapplies the Pearson system as a parametric model In orderto extract the independent components using the ICA algo-rithm the mutual information of the components has to beminimized However minimization of mutual information isrequired to use a score functionThePearson systemwas usedto model the score function The parameters of the Pearsonsystem are estimated by the method of moments In order to

BioMed Research International 5

Table 1 A summary for PCA software

Number Software Authoryear Language Features

1 FactoMineR Le et al [5] R

(i) Various dimension reduction methods such as PCA CA and MCA(ii) Different types of variables data structures and supplementaryinformation are considered(iii) The PCA function can handle missing values

2 ExPosition Beaton et al [8] R

(i) Numerous multivariate analysis methods such as PCA and GeneralizedPrincipal Component Analysis (GPCA)(ii) Can be applied to quantitative and qualitative data(iii) Implementation of Singular Value Decomposition

3 amap Lucas [9] R

(i) Different types of PCA are provided PCA Generalized PCA and RobustPCA(ii) Clustering methods are provided such as hierarchical clustering and119896-means clustering(iii) Plotting function for PCA(iv) Computing distance and dissimilarity matrices

4 ADE-4 Thioulouse et al[10] R A variety of methods such as PCA CA Principal Analysis Regression PLS

and others are offered

5 MADE4 Culhane et al[11] R

(i) Functions provided by ADE-4(ii) Integration of multiple datasets for multivariate analysis(iii) Functions for visualizing and plotting the results of analysis including 3Dplots(iv) Addition of LLSimpute algorithm for imputation of missing values

6 XLMiner Witten andFrank [12]

Implemented inExcel

(i) Provision of data reduction methods such as PCA(ii) Can be used for classification clustering data preprocessing datanormalization and others

7 ViSta Young et al [13]C++ FortranXLisp andViDAL

(i) Multivariate analysis methods are offered such as PCA Interactive ClusterAnalysis and Parallel Boxplots(ii) Provision of dynamic and high-interaction visualization for displayingmultiple views of data

8 imDEV Grapov andNewman [14]

Visual Basic andR

(i) Data preprocessing missing values imputation and data transformations(ii) Clustering methods are offered(iii) Dimension reduction methods PCA and ICA(iv) Feature selection methods(v) Visualization of data dependencies

9 StatisticsToolbox

The MathWorks[15] MATLAB

(i) Multivariate statistics such as PCA clustering and others(ii) Statistical plots probability distributions linear models nonlinear modelsfor regression and others are provided

10 Weka Hall et al [16] JavaA variety of machine learning algorithms are provided such as featureselection data preprocessing regression dimension reduction classificationand clustering methods

11 NAG LibraryNAG Toolboxfor MATLAB[17]

Fortran and C(i) Provision of more than 1700 mathematical and statistical algorithms(ii) Multivariate analysis using PCA can be implemented using the g03aaroutine

speed up the algorithm tanh nonlinearity is used when thedistribution is far from Gaussian

36 Maximum Likelihood Independent Component Analy-sis Teschenforff [23] developed an R package for ICA byusing maximum likelihood estimation This method wasfirst introduced by Hyvaerinen et al [62] This methoduses a fixed-point algorithm as the Maximum Likelihoodestimation For a fixed set of data and underlying statisticalmodel Maximum Likelihood selects the set of values of themodel parameters that maximizes the likelihood function

Maximum Likelihood estimation gives a unified approachto estimation which is well-defined in the case of normaldistribution By using a maximum likelihood frameworkand controlling the number of algorithm runs this fixed-point algorithm provides a very fast implementation formaximization of likelihood

37 Sample Case Study In this section we utilize MineICAfor microarray-based gene expression data of 200 breastcancer tumors kept in the package breastCancerMAINZ [63]based on a study done by Biton et al [21] In this study we

6 BioMed Research International

Table 2 Sources of PCA software

Number Software Sources1 FactoMineR httpcranr-projectorgwebpackagesFactoMineRindexhtml2 ExPosition httpcranr-projectorgwebpackagesExPositionindexhtml3 Amap httpcranr-projectorgwebpackagesamapindexhtml4 ADE-4 httpcranr-projectorgwebpackagesade4indexhtml5 MADE4 httpwwwbioconductororgpackages211biochtmlmade4html6 XLMiner httpwwwsolvercomxlminer-data-mining

7 ViSta httpwwwvisualstatsorghttpwwwmdpeduarpsicologiavistavistahtm

8 imDEV httpsourceforgenetprojectsimdev9 Statistics Toolbox httpwwwmathworkscommatlabcentralfileexchange30792-pca-principal-component-analysis10 Weka httpwwwcswaikatoacnzmlwekadownloadinghtml11 NAG Library httpwwwnagcomdownloadscldownloadsasp

Table 3 Related work

Software Author Motivation Advantage

FactoMineR Le et al (2009) [5]

(i) Providing a multivariate data analytictechnique for applications in biological systems(ii) To combine ldquoOmicsrdquo data structured intogroups(iii) To help on their functional interpretations

(i) It provides a geometrical point of view and alot of graphical outputs(ii) It can take into account a structure on thedata(iii) A GUI is available

MADE4 Culhane et al [11] To provide a simple-to-use tool for multivariateanalysis of microarray data

(i) Accepts a wide variety of gene-expressiondata input formats(ii) No additional data processing is required

Statistictoolbox

The MathWorks[15]

High-dimensional and complex microarraydata need automaticcomputer aided tools foranalysis

Elegant matrix support visualization

imDev Grapov andNewman 2012 [14]

Omics experiments generate complexhigh-dimensional data requiring multivariateanalyses

(i) User-friendly graphical interface(ii) Visualizations can be exported directlyfrom the R plotting interface in a variety of fileformats(iii) Dynamic loading of R objects betweenanalyses sessions

focused on how MineICA can be utilized to study an ICA-based decomposition Pseudo code for this case study is asfollows

(1) Loading the library and the data(2) Creation of an IcaSet object

(21) Load an example of expression data(22) Run ICA(23) Create aMineICAParams object function build-

MineICAParams(24) Create an IcaSet instance function buildIcaSet(25) IcaSet basics

(3) Run global analysis(4) Run analysis by calling individual functions

(41) Write description of contributing genes or fea-tures function writeProjByComp

(42) Plot heatmaps of the contributing elementsfunction plot heatmapsOnSel

(43) Gene enrichment analysis function runEnrich(44) Association with sample variables(45) Clustering of the samples according to each

component(46) Comparison of IcaSet objects function run-

CompareIcaSets

Figure 2 explains the correlation based graph denoting rela-tionship between independent components (IC) attained onfour breast cancer samplesrsquo microarray data Every noderepresents an IC and respective colors denote the origin ofdatasetThickness of edge represents the extent of correlationamong the linked ICs Black edges represent reciprocal nodes

38 Summary of ICA Software Tables 4 and 5 show thesummary and sources of ICA software respectively

BioMed Research International 7

22

22

5

4

1

3

3

1

5

4

3

1

5

44

1

3

5

Figure 2 Correlation-based graph

4 Software for Partial Least Squares (PLS)

The fundamental hypothesis of PLS is that the experimentalinformation is created by a framework ormethodologywhichis determined by a small number of latent characteristicsThusly PLS goes for discovering uncorrelated linear transfor-mation of the initial indicator characteristics which have highcovariance with the reaction characteristics In light of theselatent components PLS predicts reaction characteristics 119910the assignment of regression and reproduce initial matrix119883 the undertaking of data modelling in the meantime Thepurpose of building components in PS is to optimize thecovariance among the variable 119910 and the initial predictorvariables 119883

119908119902

= argmax (Cov (119883119908

119910))

119908119879

119908 = 1

(6)

Restricted to constraint 119908119879

119894119878119909119908119895

= 0 for all 1 le 119894 lt 119895The crucial assignment of PLS is to attain the vectors ofmaximum weights 119908

119894(119894 = 1 119902) to build a small number

of components while PCA is an ldquounsupervisedrdquo methodthat utilizes the 119883 data only To develop the components[1199051 1199052 119905

119902] PLS decomposes 119883 and 119910 to yield a bilinear

denotation of the data [64]

119883 = 1199051119908119879

1+ 1199052119908119879

2+ sdot sdot sdot + 119905

119902119908119879

119902+ 119890

119910 = 1199051V1198791

+ 1199052V1198792

+ sdot sdot sdot + 119905119870V119879119902

+ 119891

(7)

where 119908rsquos are vectors of weights for building the PLS compo-nents 119905 = 119883

119908 Vrsquos are scalars and 119890 and 119891 are the residuals

The concept of PLS is to assume 119908 and V by regression

41 Partial Least Squares Discriminant Analysis Barker andRayens [24] developed a PLS for discriminant analysis

However the original PLS was not designed for discriminantpurposes PLS Discriminant Analysis is used to find alinear regression model by projecting the dependent fea-tures and the independent features to a new space Thenthe fundamental relations can be extracted from the latentvariables This method was developed for software calledUnscrambler which was first developed by Martens andNaes [65] Unscrambler is a commercial software product formultivariate data analysis Unscrambler is used for analysinglarge and complex datasets quickly and easily using thepower of multivariate analysis Moreover this multivariatedata analysis also offers exceptional data visualization

42 Least Squares Partial Least Squares Joslashrgensen et al [25]proposed a method of using an iterative combination of PLSand ordinary least squares to extract the relationship betweenthe predictor variable and the responsesThismethod is basedon a combination of least squares estimates for the designvariables and PLS regression on the spectra The PLS scoreswere incorporated into the ordinary least squares equationon the spectra The idea is to separate the information fromthe spectral and design matrices in a nice way Howeverthis method is able to extract the information even whenfewer components are used In addition this method isinsensitive to the relative scaling of the spectra and theprocess Moreover this combination method is also lessbiased than the individual PLS technique

43 Powered Partial Least Squares Discriminant AnalysisLiland and Indahl [26] extended the Powered PLS to Pow-ered PLS Discriminant Analysis to overcome the extractionof information for the multivariate classification problemThis method can construct more efficient group separationand generate more interpretive outcomes than the ordinaryPartial Least Square Discriminant Analysis technique Thefeatures extracted by the Powered PLS can contribute torevealing the relevance of particular predictors and oftenrequires smaller and simpler components than ordinary PLSMoreover the optimization task is equivalent to maximizingthe correlation between the transformed predictors and thegroups This makes it possible to discard the influence of lessimportant predictorsThis method was also developed by theauthors for availability in an R package

44 Penalized Partial Least Squares Kramer et al [27] pro-posed a combination of the feature extraction technique PLSwith a penalization framework This method is an extensionof PLS regression using a penalization technique OrdinaryPLS is suited for regression problems by minimizing aquadratic loss function iteratively In addition the repre-sentation in terms of kernel matrices provides an intuitivegeometric interpretation of the penalty term The penaltyterms control the roughness of the estimated functions Withthe incorporation of penalization into this framework theresearch direction became more promising This method isused to extract relevant information for high-dimensionalregression problems and also for noisy dataThismethod wasalso developed by the Kramer and her colleagues colleagues[66] for availability in an R package

8 BioMed Research International

Table 4 Summary of ICA software

Number Software Authoryear Language Features

1 FastICA Marchini et al[18] R and MATLAB ICA algorithm is provided for implementing the analysis using

ICA

2 JADE Nordhausen etal [19] R (i) JADE algorithm is provided for ICA

(ii) Other BSS methods such as AMUSE and SOBI are offered

3 HiPerSAT Keith et al [20] C++ MATLABand EEGLAB

(i) Integration of FastICA Informax and SOBI algorithms(ii) Data whitening is provided

4 MineICA Biton et al [21] R (i) Storage and visualization of ICA results(ii) Annotation of features

5 Pearson ICA Karnanen [22] R Extraction of the independent components using the minimizationof mutual information from the Pearson system

6 MaximumLikelihood ICA

Teschenforff[23] R Implementation of the Maximum Likelihood and fixed-point

algorithm into ICA

Table 5 Sources of ICA software

Number Software Sources

1 FastICA R httpcranr-projectorgwebpackagesfastICAindexhtmlMATLAB httpresearchicsaaltofiicafastica

2 JADE httpcranr-projectorgwebpackagesJADEindexhtml3 HiPerSAT httpnicuoregoneduprojectshipersatindexphp4 MineICA httpwwwbioconductororgpackages212biochtmlMineICAhtml5 Pearson ICA httpcranr-projectorgwebpackagesPearsonICAindexhtml6 Maximum Likelihood ICA httpcranr-projectorgwebpackagesmlica2indexhtml

45 SlimPLS Gutkin et al [33] proposed a feature extractionmethod based on PLS called SlimPLS Ranking-based filtersusually utilize a univariate method when selecting featuresThe filter methods can produce reasonable feature setsespecially when the original feature sets are uncorrelatedHowever the chosen feature set will be suboptimal whenthe features of the original set are dependent Some ofthe features will add little discriminative power on top ofpreviously selected features SlimPLS is a multivariate featureextraction method which incorporates feature dependenciesinto calculation This multivariate property is constructedby combining the highly predictive feature with some lesspredictive but correlated features This is because the addedfeatures will provide more information on the behaviour ofthe samples

46 Sparse Partial Least Squares Discriminant Analysis andSparse Generalized Partial Least Squares Chung and Keles[28] proposed two extension feature extraction approachesbased on Sparse PLS These approaches are Sparse PLSDiscriminant Analysis and Sparse Generalized PLS for high-dimensional datasets These two approaches improved ordi-nary PLS by employing feature extraction and dimensionreduction simultaneously These two approaches performwell even with unbalanced sample sizes of the classes SparsePLS Discrimination Analysis is computationally efficientbecause it only requires computational time for one run ofSparse PLS and a classifier Moreover Sparse GeneralizedPLS extends Sparse PLS to the generalized linear model

framework These methods were also developed by Chungand Keles for availability in an R package

47 Degrees of Freedom of Partial Least Squares Kramer andSugiyama [29] proposed a method of unbiased estimationof the degrees of freedom for PLS regression The authorsstated that the construction of latent components fromthe independent variable also depended on the dependentvariable However for PLS regression the optimal number ofcomponents needs to be determined first One of the ways ofdetermining the optimal number of components is throughthe degrees of freedom for the complexity of fitted modelsMoreover the degrees of freedom estimate can be used for thecomparison of different regression methods Furthermorethe two implementations for the degrees of freedom utilizethe connection between PLS regression and numerical linearmethods from numerical linear The authors also developedan R package for this unbiased estimation of the degrees offreedom of PLS

48 Surrogate Variable Analysis Partial Least SquaresChakraborty and Datta [30] proposed a surrogate variableanalysismethod based on PLS In differential gene expressionanalysis one of the important issues is to avoid the hiddenconfounders in the dataset The hidden confounders ofgene expression are caused by different environmentalconditions of the samples However this problem cannot besimply overcome by modifying the gene expression data byusing a normalizing technique This method can extract the

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 2: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

2 BioMed Research International

to decrease the dimensionality of a given data set whilstmaintaining as plentiful as possible the variation existing inthe initial predictor variablesThis is attained by transformingthe 119901 initial variables 119883 = [119909

1 1199092 119909

119901] to a latest set

of 119902 predictor variables Linear amalgamation of the initialvariables is 119879 = [119905

1 1199052 119905

119902] In mathematical domain PCA

successively optimizes the variance of a linear amalgamationof the initial predictor variables

119906119902

= argmax (Var (119883119906))

119906119905119906 = 1

(1)

conditional upon the constraint 119906119879

119894119878119883

119906119895

= 0 for every 1 le

119894 le 119895 The orthogonal constraint makes sure that the linearcombinations are uncorrelated that is Cov(119883119906

119894 119883119906119895) = 0

119894 = 119895 These linear combinations are denoted as the principlecomponents (PCs)

119905119894

= 119883119906119894 (2)

The projection vectors (or known as the weighting vectors)119906 can be attained by eigenvalue decomposition on thecovariance matrix 119878

119883

119878119883

119906119894

= 120574119894119906119894 (3)

where 120574119894is the 119894th eigenvalue in the decreasing order for 119894 =

1 119902 and 119906119894is the resultant eigenvector The eigenvalue 120574

119894

calculates the variance of the 119894th PC and the eigenvector 119906119894

gives the weights for the linear transformation (projection)

21 FactoMineR FactoMineR is an R package that providesvarious functions for the analysis of multivariate data [5]Thenewest version of this package is maintained by Hussen et al[6] There are a few main features provided by this packagefor example different types of variables data structuresand supplementary information can be taken into accountBesides that it offers dimension reduction methods such asPrincipal Component Analysis (PCA) Multiple Correspon-dence Analysis (MCA) and Correspondence Analysis (CA)The steps in implementing PCA are described in Le et al [5]and Hoffmann [7] For PCA there are three main functionsfor performing the PCA plotting it and printing its resultsThis package is mainly for Windows MacOS and Linux

22 ExPosition ExPosition is an R package for the multivari-ate analysis of quantitative and qualitative data ExPositionstands for Exploratory Analysis with the Singular ValueDecomposition The newest version of this package is main-tained by Beaton et al [8] A variety of multivariate methodsare provided in this package such as PCA multidimensionalscaling (MDS) and Generalized PCA All of these methodscan be performed by using the corePCA function in this pack-age Another function epPCA can be applied to implementPCA Besides that Generalized PCA can be implementedusing the function epGPCA as well All of these methods areused to analyse quantitative data A plotting function is alsooffered by this package in order to plot the results of theanalysis This package can be installed on Windows Linuxand MacOS

23 amap The R package ldquoamaprdquo was developed for clus-tering as well as PCA for both parallelized functions androbust methods It is an R package for multidimensionalanalysisThe newest version ismaintained by Lucas [9]Threedifferent types of PCA are provided by this package Themethods are PCA Generalized PCA and Robust PCA PCAmethods can be implemented using the functions acp andpca for PCA acpgen for Generalized PCA and acprob forRobust PCA This package also allows the implementationof correspondence factorial analysis through the functionafc Besides that a plotting function is also provided forplotting the results of PCA as a graphical representationTheclustering methods offered by this package are k-means andhierarchical clustering The dissimilarity matrix and distancematrix can be computed using this package as well Thispackage is mainly for Windows Linux and MacOS

24 ADE-4 ADE-4 was originally developed by Thioulouseet al [10] as software for analyzing multivariate data and dis-playing graphics This software includes a variety of methodssuch as PCA CA Principal Component Regression PLSCanonical Correspondence Analysis Discriminant Analysisand others Besides that this software is implemented in anR environment as an R package ldquoade4rdquo The newest versionof this package is maintained by Penel [37] In this packagePCA can be performed by using the dudipca function Avisualization function is also provided in order to visualizethe results as a graphical representation In previous studiesthis package was implemented by Dray and Dufour [38] toidentify and understand ecological community structuresThis package is mainly for Linux Windows and MacOS

25 MADE4 MADE4 (microarray ade4) was developed byCulhane et al [11] for multivariate analysis of gene expressiondata based on the R package ldquoade4rdquo Basically it is the ext-ensions of the R package ldquoade4rdquo formicroarray dataThe pur-pose of writing this software was to help users in the analysisof microarray data using multivariate analysis methods Thissoftware is able to handle a variety of gene expression dataformats and new visualization software has been added to thepackage in order to facilitate the visualization of microarraydata Other extra features such as data preprocessing andgene filtering are included as well However this packagewas further improved by the addition of the LLSimpute alg-orithm to handle themissing values in themicroarray data byMoorthy et al [39] It is implemented in an R environmentThe advance of this package is that multiple datasets canbe integrated to carry out analysis of microarray data Thenewest version is maintained by Culhane [40] This packagecan be installed on Linux Windows and MacOS

26 XLMiner XLMiner is add-in software for MicrosoftExcel that offers numerous data mining methods for ana-lysing data [12] It offers a quick start in the use of a variety ofdata miningmethods for analysing dataThis software can beused for data reduction using PCA classification using Neu-ral Networks or Decision Trees [41 42] class prediction dataexploration affinity analysis and clustering In this softwarePCA can be implemented using the Principle Component

BioMed Research International 3

tab [43] This software is implemented in Excel As a resultthe dataset should be in an Excel spreadsheet In order tostart the implementation of XLMiner the dataset needs tobe manually partitioned into training validation and testsets Please see httpwwwsolvercomxlminer-data-miningfor further detailsThis software can be installed onWindowsand MacOS

27 ViSta ViSta stands for Visual Statistics System and canbe used for multivariate data analysis and visualization inorder to provide a better understanding of the data [13]This software is based on the Lisp-Stat system [44] It isan open source system that can be freely distributed formultivariate analysis and visualization PCA and multipleand simple CA are provided in this software Its mainadvance is that the data analysis is guided in a visualizationenvironment in order to generate more reliable and accu-rate results The four state-of-the-art visualization methodsoffered by this software are GuideMaps [45] WorkMaps[46] Dynamic Statistical Visualization [47] and StatisticalRe-Vision [48] The plug-ins for PCA can be downloadedfrom httpwwwmdpeduarpsicologiavistavistahtm Anexample of implementation of the analysis using PCA can beviewed in Valero-Mora and Ledesma [49] This software canbe installed on Windows Unix and Macintosh

28 imDEV Interactive Modules for Data Exploration andVisualization (imDEV) [14] is an application of RExcel thatintegrates R and Excel for the analysis visualization andexploration of multivariate data It is used in Microsoft Excelas add-ins by using an R package Basically it is implementedin Visual Basic and R In this software numerous dimensionreduction methods are provided such as PCA ICA PLSregression andDiscriminant Analysis Besides that this soft-ware also offers clustering imputing ofmissing values featureselection and data visualization The 2 times 3 visualizationmethods are offered such as dendrograms distribution plotsbiplots and correlation networksThis software is compatiblewith a few versions of Microsoft Excel such as Excel 2007 and2010

29 Statistics Toolbox Statistical Toolbox offers a variety ofalgorithms and tools for data modelling and data analy-sis Multivariate data analysis methods are offered by thistoolbox The methods include PCA clustering dimensionreduction factor analysis visualization and others In thestatistical toolbox of MATLAB several PCA functions areprovided for multivariate analysis for example pcacov prin-comp and pcares (MathWorks) Most of these functions areused for dimensional reduction pcacov is used for covariancematrices princomp for raw data matrices and pcares forresiduals from PCA All of these functions are implementedin MATLAB

210 Weka Weka [16] is data mining software that pro-vides a variety of machine learning algorithms This soft-ware offers feature selection data preprocessing regressionclassification and clustering methods [50] This software isimplemented in a Java environment PCA is used as a dim-

ension reduction method in Weka to reduce the dimension-ality of complex data through transformation However notall of the datasets are complete Prabhume and Sathe [51]introduced a new filter PCA for Weka in order to solve theproblem of incomplete datasets It works by estimating thecomplete dataset from the incomplete dataset This softwareis mainly for Windows Linux and MacOS

211 NAG Library In NAG Library the function of PCAis provided as the g03aa routine [17] in both C and Fort-ran This routine performs PCA on data matrices This soft-ware was developed by the Numerical Algorithms GroupIn the NAG Library more than 1700 algorithms are offeredfor mathematical and statistical analysis For PCA it is suit-able for multivariate methods G03 Other methods prov-ided are correlation analysis wavelet transforms and par-tial differential equations Please refer to httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf for fur-ther details about the g03aaa routine This software can beinstalled on Windows Linux MacOS AIX HP UX andSolaris

212 Case Study In this section we will discuss the imple-mentation of coinertia analysis (CIA) to cross-platformvisualization in MADE4 and ADE4 to perform multivariateanalysis of microarray datasets To demonstrate PCA wasapplied on 4 childhood tumors (NB BL-NHL EWS andRMS) from amicroarray gene expression profiling study [52]From these data a subset (khan$train 206 genes times 64 cases)each casersquos factor denoting the respective class (khan$trainclasses length = 64) and a gene annotationrsquos data frame areaccessible in aforementioned dataset in MADE4

lt library (made4)

lt data (khan)

lt dataset = khan$train

lt fac = khan$trainclasses

lt geneSym = khan$annotation$Symbol

lt resultscoa lt- ord (dataset type = ldquocoardquo)

lt par (mfrow = c (1 2))

lt plotarrays (resultscoa classvec = fac)

lt plotgenes (resultscoa genelabels = geneSym)

Figure 1 shows the PCA of a 306-gene subset As origin asthe point of reference the more advanced gene and case areprojected in the similar direction the stronger the associationbetween involved gene and case is (gene is upregulated in thatarray sample)

213 Summary of PCA Software Tables 1 and 2 show thesummary and sources of PCA software respectively Table 3discusses the related work of this software

4 BioMed Research International

LYN

TNNT2 IGF2

AIF1

- MYL4 IGF2

HCLS1 ELF1 HLA-DMA

TNNT1

CDKN3 ISG20 BUB1 MME

COL3A1

PRKAR2B -

FNDC5

MYC

METAP2 GSTM5

TNA CAV1 TLE2

TUBB5

TNFAIP6

MYC

PTPN13

LOXL2

FCGRT OLFM1

GYG2

NFIX FVT1

d = 05

Figure 1 Plot of genes

3 Software for IndependentComponent Analysis (ICA)

ICA is considered as a valuable extension of PCA thathas been established considering the blind separation ofindependent sources from their linear combination [53] In away the initial point of ICA is the property of uncorrelation ofgeneral PCA Based on 119899times119901datamatrix119883 whose rows 119903

119894(119895 =

1 119899) tally to observational variables and whose columns119888119895

(119895 = 1 119901) are the individuals of the correspondingvariables the ICA model of 119883 can be written as

119883 = 119860119878 (4)

With generality intact119860 is a 119899times119899 mixingmatrix whereas 119878 isa 119899times119901 sourcematrix under the necessity of 119878being statisticallyindependent as possible ldquoIndependent componentsrdquo are thenew variables confined in the rows of 119878 to wit the variablesobserved are linearly collected independent componentsMutual information 119868 = sum

119896119867(119878119896) minus 119867(119878) where 119867(119878

119896) =

minus int 119901(119878119896) log119901(119878

119896)119889119904119896is themarginal entropy of the variables

119878119896 119901(119878119896) is the probabilistic density function and 119867(119878) is the

joint entropy [54] Value the independent components able tobe attained by discovering the correct linear mixtures of theobservational variables since mixing can be inverted as

119880 = 119878 = 119860minus1

119883 = 119882119883 (5)

31 FastICA FastICA is themost widely usedmethod of ICA[55] It is implemented in an R environment as the R packageldquoFastICArdquo for performing ICA and Projection Pursuit byusing the FastICA algorithm FastICA was first introducedby Hyvarinen [54] for single and multiple componentextraction The FastICA algorithm is based on a fixed-pointiteration scheme maximizing non-Gaussianity as a measureof statistical independence This package is maintained byMarchini et al [18] ICA is used to extract the informativefeatures through a transformation of the observed multidi-mensional random vectors into independent components

This package is mainly for Windows Linux and MacOSFastICA is also implemented in MATLAB In MATLABFastICA implements a fast fixed-point algorithm for ICA aswell as projection pursuit It provides a simple user interfaceand also a powerful algorithm for computation

32 JADE JADE is an R package that provides a functionfor implementing ICA This package is maintained by Nord-hausen et al [19] In this package Cardosorsquos JADE algorithm[56] is provided for ICA Instead of the JADE algorithm otherBlind Source Separation (BSS)methods such as the SOBI [57]andAMUSE [58]methods are offered Both of thesemethodsare mainly used for solving second order BSS problemsAmari error [59] is offered to evaluate the performance ofthe ICA algorithm This package can be installed on LinuxWindows and MacOS

33 High Performance Signal Analysis Tools (HiPerSAT)HiPerSAT is written in C++ for processing electroen-cephalography (EEG) data with whitening of data and ICA[20] MPI and OpenMP are used to perform parallel analysisof ICA Basically this software is used to analyse EEGdata in order to understand the neurological components ofbrain activity In this software FastICA SOBI and Informaxalgorithms are offered HiPerSAT is integrated intoMATLABand EEGLAB [60] EEGLAB is MATLAB-based softwarethat is used for analysing EEG data However the advantageof HiPerSAT is that it can handle larger datasets comparedto MATLAB In comparison to EEGLAB HiPerSAT is ableto handle large datasets without partitioning but EEGLABrequires data partitioning Data whitening is performedbefore implementing the algorithms This software can beinstalled on all platforms

34 MineICA MineICA is an R package that supplies theimplementation of ICA on transcriptomic data [21] Themain purpose of MineICA is to provide an easier way ofinterpreting the decomposition results from ICA Besidesthat this software also provides a correlation-based graphfor comparing the components from different datasets Thenewest version of this package is maintained by Biton [61]This package provides some features such as storage of ICAresults annotation of features and visualization of the resultsof ICA This package can be installed on Linux MacOS andWindows

35 Pearson Independent Component Analysis Karnanen[22] developed an R package for a feature extraction tech-nique based on the Pearson ICA algorithm This is a mutualinformation-based blind source separation approach whichapplies the Pearson system as a parametric model In orderto extract the independent components using the ICA algo-rithm the mutual information of the components has to beminimized However minimization of mutual information isrequired to use a score functionThePearson systemwas usedto model the score function The parameters of the Pearsonsystem are estimated by the method of moments In order to

BioMed Research International 5

Table 1 A summary for PCA software

Number Software Authoryear Language Features

1 FactoMineR Le et al [5] R

(i) Various dimension reduction methods such as PCA CA and MCA(ii) Different types of variables data structures and supplementaryinformation are considered(iii) The PCA function can handle missing values

2 ExPosition Beaton et al [8] R

(i) Numerous multivariate analysis methods such as PCA and GeneralizedPrincipal Component Analysis (GPCA)(ii) Can be applied to quantitative and qualitative data(iii) Implementation of Singular Value Decomposition

3 amap Lucas [9] R

(i) Different types of PCA are provided PCA Generalized PCA and RobustPCA(ii) Clustering methods are provided such as hierarchical clustering and119896-means clustering(iii) Plotting function for PCA(iv) Computing distance and dissimilarity matrices

4 ADE-4 Thioulouse et al[10] R A variety of methods such as PCA CA Principal Analysis Regression PLS

and others are offered

5 MADE4 Culhane et al[11] R

(i) Functions provided by ADE-4(ii) Integration of multiple datasets for multivariate analysis(iii) Functions for visualizing and plotting the results of analysis including 3Dplots(iv) Addition of LLSimpute algorithm for imputation of missing values

6 XLMiner Witten andFrank [12]

Implemented inExcel

(i) Provision of data reduction methods such as PCA(ii) Can be used for classification clustering data preprocessing datanormalization and others

7 ViSta Young et al [13]C++ FortranXLisp andViDAL

(i) Multivariate analysis methods are offered such as PCA Interactive ClusterAnalysis and Parallel Boxplots(ii) Provision of dynamic and high-interaction visualization for displayingmultiple views of data

8 imDEV Grapov andNewman [14]

Visual Basic andR

(i) Data preprocessing missing values imputation and data transformations(ii) Clustering methods are offered(iii) Dimension reduction methods PCA and ICA(iv) Feature selection methods(v) Visualization of data dependencies

9 StatisticsToolbox

The MathWorks[15] MATLAB

(i) Multivariate statistics such as PCA clustering and others(ii) Statistical plots probability distributions linear models nonlinear modelsfor regression and others are provided

10 Weka Hall et al [16] JavaA variety of machine learning algorithms are provided such as featureselection data preprocessing regression dimension reduction classificationand clustering methods

11 NAG LibraryNAG Toolboxfor MATLAB[17]

Fortran and C(i) Provision of more than 1700 mathematical and statistical algorithms(ii) Multivariate analysis using PCA can be implemented using the g03aaroutine

speed up the algorithm tanh nonlinearity is used when thedistribution is far from Gaussian

36 Maximum Likelihood Independent Component Analy-sis Teschenforff [23] developed an R package for ICA byusing maximum likelihood estimation This method wasfirst introduced by Hyvaerinen et al [62] This methoduses a fixed-point algorithm as the Maximum Likelihoodestimation For a fixed set of data and underlying statisticalmodel Maximum Likelihood selects the set of values of themodel parameters that maximizes the likelihood function

Maximum Likelihood estimation gives a unified approachto estimation which is well-defined in the case of normaldistribution By using a maximum likelihood frameworkand controlling the number of algorithm runs this fixed-point algorithm provides a very fast implementation formaximization of likelihood

37 Sample Case Study In this section we utilize MineICAfor microarray-based gene expression data of 200 breastcancer tumors kept in the package breastCancerMAINZ [63]based on a study done by Biton et al [21] In this study we

6 BioMed Research International

Table 2 Sources of PCA software

Number Software Sources1 FactoMineR httpcranr-projectorgwebpackagesFactoMineRindexhtml2 ExPosition httpcranr-projectorgwebpackagesExPositionindexhtml3 Amap httpcranr-projectorgwebpackagesamapindexhtml4 ADE-4 httpcranr-projectorgwebpackagesade4indexhtml5 MADE4 httpwwwbioconductororgpackages211biochtmlmade4html6 XLMiner httpwwwsolvercomxlminer-data-mining

7 ViSta httpwwwvisualstatsorghttpwwwmdpeduarpsicologiavistavistahtm

8 imDEV httpsourceforgenetprojectsimdev9 Statistics Toolbox httpwwwmathworkscommatlabcentralfileexchange30792-pca-principal-component-analysis10 Weka httpwwwcswaikatoacnzmlwekadownloadinghtml11 NAG Library httpwwwnagcomdownloadscldownloadsasp

Table 3 Related work

Software Author Motivation Advantage

FactoMineR Le et al (2009) [5]

(i) Providing a multivariate data analytictechnique for applications in biological systems(ii) To combine ldquoOmicsrdquo data structured intogroups(iii) To help on their functional interpretations

(i) It provides a geometrical point of view and alot of graphical outputs(ii) It can take into account a structure on thedata(iii) A GUI is available

MADE4 Culhane et al [11] To provide a simple-to-use tool for multivariateanalysis of microarray data

(i) Accepts a wide variety of gene-expressiondata input formats(ii) No additional data processing is required

Statistictoolbox

The MathWorks[15]

High-dimensional and complex microarraydata need automaticcomputer aided tools foranalysis

Elegant matrix support visualization

imDev Grapov andNewman 2012 [14]

Omics experiments generate complexhigh-dimensional data requiring multivariateanalyses

(i) User-friendly graphical interface(ii) Visualizations can be exported directlyfrom the R plotting interface in a variety of fileformats(iii) Dynamic loading of R objects betweenanalyses sessions

focused on how MineICA can be utilized to study an ICA-based decomposition Pseudo code for this case study is asfollows

(1) Loading the library and the data(2) Creation of an IcaSet object

(21) Load an example of expression data(22) Run ICA(23) Create aMineICAParams object function build-

MineICAParams(24) Create an IcaSet instance function buildIcaSet(25) IcaSet basics

(3) Run global analysis(4) Run analysis by calling individual functions

(41) Write description of contributing genes or fea-tures function writeProjByComp

(42) Plot heatmaps of the contributing elementsfunction plot heatmapsOnSel

(43) Gene enrichment analysis function runEnrich(44) Association with sample variables(45) Clustering of the samples according to each

component(46) Comparison of IcaSet objects function run-

CompareIcaSets

Figure 2 explains the correlation based graph denoting rela-tionship between independent components (IC) attained onfour breast cancer samplesrsquo microarray data Every noderepresents an IC and respective colors denote the origin ofdatasetThickness of edge represents the extent of correlationamong the linked ICs Black edges represent reciprocal nodes

38 Summary of ICA Software Tables 4 and 5 show thesummary and sources of ICA software respectively

BioMed Research International 7

22

22

5

4

1

3

3

1

5

4

3

1

5

44

1

3

5

Figure 2 Correlation-based graph

4 Software for Partial Least Squares (PLS)

The fundamental hypothesis of PLS is that the experimentalinformation is created by a framework ormethodologywhichis determined by a small number of latent characteristicsThusly PLS goes for discovering uncorrelated linear transfor-mation of the initial indicator characteristics which have highcovariance with the reaction characteristics In light of theselatent components PLS predicts reaction characteristics 119910the assignment of regression and reproduce initial matrix119883 the undertaking of data modelling in the meantime Thepurpose of building components in PS is to optimize thecovariance among the variable 119910 and the initial predictorvariables 119883

119908119902

= argmax (Cov (119883119908

119910))

119908119879

119908 = 1

(6)

Restricted to constraint 119908119879

119894119878119909119908119895

= 0 for all 1 le 119894 lt 119895The crucial assignment of PLS is to attain the vectors ofmaximum weights 119908

119894(119894 = 1 119902) to build a small number

of components while PCA is an ldquounsupervisedrdquo methodthat utilizes the 119883 data only To develop the components[1199051 1199052 119905

119902] PLS decomposes 119883 and 119910 to yield a bilinear

denotation of the data [64]

119883 = 1199051119908119879

1+ 1199052119908119879

2+ sdot sdot sdot + 119905

119902119908119879

119902+ 119890

119910 = 1199051V1198791

+ 1199052V1198792

+ sdot sdot sdot + 119905119870V119879119902

+ 119891

(7)

where 119908rsquos are vectors of weights for building the PLS compo-nents 119905 = 119883

119908 Vrsquos are scalars and 119890 and 119891 are the residuals

The concept of PLS is to assume 119908 and V by regression

41 Partial Least Squares Discriminant Analysis Barker andRayens [24] developed a PLS for discriminant analysis

However the original PLS was not designed for discriminantpurposes PLS Discriminant Analysis is used to find alinear regression model by projecting the dependent fea-tures and the independent features to a new space Thenthe fundamental relations can be extracted from the latentvariables This method was developed for software calledUnscrambler which was first developed by Martens andNaes [65] Unscrambler is a commercial software product formultivariate data analysis Unscrambler is used for analysinglarge and complex datasets quickly and easily using thepower of multivariate analysis Moreover this multivariatedata analysis also offers exceptional data visualization

42 Least Squares Partial Least Squares Joslashrgensen et al [25]proposed a method of using an iterative combination of PLSand ordinary least squares to extract the relationship betweenthe predictor variable and the responsesThismethod is basedon a combination of least squares estimates for the designvariables and PLS regression on the spectra The PLS scoreswere incorporated into the ordinary least squares equationon the spectra The idea is to separate the information fromthe spectral and design matrices in a nice way Howeverthis method is able to extract the information even whenfewer components are used In addition this method isinsensitive to the relative scaling of the spectra and theprocess Moreover this combination method is also lessbiased than the individual PLS technique

43 Powered Partial Least Squares Discriminant AnalysisLiland and Indahl [26] extended the Powered PLS to Pow-ered PLS Discriminant Analysis to overcome the extractionof information for the multivariate classification problemThis method can construct more efficient group separationand generate more interpretive outcomes than the ordinaryPartial Least Square Discriminant Analysis technique Thefeatures extracted by the Powered PLS can contribute torevealing the relevance of particular predictors and oftenrequires smaller and simpler components than ordinary PLSMoreover the optimization task is equivalent to maximizingthe correlation between the transformed predictors and thegroups This makes it possible to discard the influence of lessimportant predictorsThis method was also developed by theauthors for availability in an R package

44 Penalized Partial Least Squares Kramer et al [27] pro-posed a combination of the feature extraction technique PLSwith a penalization framework This method is an extensionof PLS regression using a penalization technique OrdinaryPLS is suited for regression problems by minimizing aquadratic loss function iteratively In addition the repre-sentation in terms of kernel matrices provides an intuitivegeometric interpretation of the penalty term The penaltyterms control the roughness of the estimated functions Withthe incorporation of penalization into this framework theresearch direction became more promising This method isused to extract relevant information for high-dimensionalregression problems and also for noisy dataThismethod wasalso developed by the Kramer and her colleagues colleagues[66] for availability in an R package

8 BioMed Research International

Table 4 Summary of ICA software

Number Software Authoryear Language Features

1 FastICA Marchini et al[18] R and MATLAB ICA algorithm is provided for implementing the analysis using

ICA

2 JADE Nordhausen etal [19] R (i) JADE algorithm is provided for ICA

(ii) Other BSS methods such as AMUSE and SOBI are offered

3 HiPerSAT Keith et al [20] C++ MATLABand EEGLAB

(i) Integration of FastICA Informax and SOBI algorithms(ii) Data whitening is provided

4 MineICA Biton et al [21] R (i) Storage and visualization of ICA results(ii) Annotation of features

5 Pearson ICA Karnanen [22] R Extraction of the independent components using the minimizationof mutual information from the Pearson system

6 MaximumLikelihood ICA

Teschenforff[23] R Implementation of the Maximum Likelihood and fixed-point

algorithm into ICA

Table 5 Sources of ICA software

Number Software Sources

1 FastICA R httpcranr-projectorgwebpackagesfastICAindexhtmlMATLAB httpresearchicsaaltofiicafastica

2 JADE httpcranr-projectorgwebpackagesJADEindexhtml3 HiPerSAT httpnicuoregoneduprojectshipersatindexphp4 MineICA httpwwwbioconductororgpackages212biochtmlMineICAhtml5 Pearson ICA httpcranr-projectorgwebpackagesPearsonICAindexhtml6 Maximum Likelihood ICA httpcranr-projectorgwebpackagesmlica2indexhtml

45 SlimPLS Gutkin et al [33] proposed a feature extractionmethod based on PLS called SlimPLS Ranking-based filtersusually utilize a univariate method when selecting featuresThe filter methods can produce reasonable feature setsespecially when the original feature sets are uncorrelatedHowever the chosen feature set will be suboptimal whenthe features of the original set are dependent Some ofthe features will add little discriminative power on top ofpreviously selected features SlimPLS is a multivariate featureextraction method which incorporates feature dependenciesinto calculation This multivariate property is constructedby combining the highly predictive feature with some lesspredictive but correlated features This is because the addedfeatures will provide more information on the behaviour ofthe samples

46 Sparse Partial Least Squares Discriminant Analysis andSparse Generalized Partial Least Squares Chung and Keles[28] proposed two extension feature extraction approachesbased on Sparse PLS These approaches are Sparse PLSDiscriminant Analysis and Sparse Generalized PLS for high-dimensional datasets These two approaches improved ordi-nary PLS by employing feature extraction and dimensionreduction simultaneously These two approaches performwell even with unbalanced sample sizes of the classes SparsePLS Discrimination Analysis is computationally efficientbecause it only requires computational time for one run ofSparse PLS and a classifier Moreover Sparse GeneralizedPLS extends Sparse PLS to the generalized linear model

framework These methods were also developed by Chungand Keles for availability in an R package

47 Degrees of Freedom of Partial Least Squares Kramer andSugiyama [29] proposed a method of unbiased estimationof the degrees of freedom for PLS regression The authorsstated that the construction of latent components fromthe independent variable also depended on the dependentvariable However for PLS regression the optimal number ofcomponents needs to be determined first One of the ways ofdetermining the optimal number of components is throughthe degrees of freedom for the complexity of fitted modelsMoreover the degrees of freedom estimate can be used for thecomparison of different regression methods Furthermorethe two implementations for the degrees of freedom utilizethe connection between PLS regression and numerical linearmethods from numerical linear The authors also developedan R package for this unbiased estimation of the degrees offreedom of PLS

48 Surrogate Variable Analysis Partial Least SquaresChakraborty and Datta [30] proposed a surrogate variableanalysismethod based on PLS In differential gene expressionanalysis one of the important issues is to avoid the hiddenconfounders in the dataset The hidden confounders ofgene expression are caused by different environmentalconditions of the samples However this problem cannot besimply overcome by modifying the gene expression data byusing a normalizing technique This method can extract the

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 3: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

BioMed Research International 3

tab [43] This software is implemented in Excel As a resultthe dataset should be in an Excel spreadsheet In order tostart the implementation of XLMiner the dataset needs tobe manually partitioned into training validation and testsets Please see httpwwwsolvercomxlminer-data-miningfor further detailsThis software can be installed onWindowsand MacOS

27 ViSta ViSta stands for Visual Statistics System and canbe used for multivariate data analysis and visualization inorder to provide a better understanding of the data [13]This software is based on the Lisp-Stat system [44] It isan open source system that can be freely distributed formultivariate analysis and visualization PCA and multipleand simple CA are provided in this software Its mainadvance is that the data analysis is guided in a visualizationenvironment in order to generate more reliable and accu-rate results The four state-of-the-art visualization methodsoffered by this software are GuideMaps [45] WorkMaps[46] Dynamic Statistical Visualization [47] and StatisticalRe-Vision [48] The plug-ins for PCA can be downloadedfrom httpwwwmdpeduarpsicologiavistavistahtm Anexample of implementation of the analysis using PCA can beviewed in Valero-Mora and Ledesma [49] This software canbe installed on Windows Unix and Macintosh

28 imDEV Interactive Modules for Data Exploration andVisualization (imDEV) [14] is an application of RExcel thatintegrates R and Excel for the analysis visualization andexploration of multivariate data It is used in Microsoft Excelas add-ins by using an R package Basically it is implementedin Visual Basic and R In this software numerous dimensionreduction methods are provided such as PCA ICA PLSregression andDiscriminant Analysis Besides that this soft-ware also offers clustering imputing ofmissing values featureselection and data visualization The 2 times 3 visualizationmethods are offered such as dendrograms distribution plotsbiplots and correlation networksThis software is compatiblewith a few versions of Microsoft Excel such as Excel 2007 and2010

29 Statistics Toolbox Statistical Toolbox offers a variety ofalgorithms and tools for data modelling and data analy-sis Multivariate data analysis methods are offered by thistoolbox The methods include PCA clustering dimensionreduction factor analysis visualization and others In thestatistical toolbox of MATLAB several PCA functions areprovided for multivariate analysis for example pcacov prin-comp and pcares (MathWorks) Most of these functions areused for dimensional reduction pcacov is used for covariancematrices princomp for raw data matrices and pcares forresiduals from PCA All of these functions are implementedin MATLAB

210 Weka Weka [16] is data mining software that pro-vides a variety of machine learning algorithms This soft-ware offers feature selection data preprocessing regressionclassification and clustering methods [50] This software isimplemented in a Java environment PCA is used as a dim-

ension reduction method in Weka to reduce the dimension-ality of complex data through transformation However notall of the datasets are complete Prabhume and Sathe [51]introduced a new filter PCA for Weka in order to solve theproblem of incomplete datasets It works by estimating thecomplete dataset from the incomplete dataset This softwareis mainly for Windows Linux and MacOS

211 NAG Library In NAG Library the function of PCAis provided as the g03aa routine [17] in both C and Fort-ran This routine performs PCA on data matrices This soft-ware was developed by the Numerical Algorithms GroupIn the NAG Library more than 1700 algorithms are offeredfor mathematical and statistical analysis For PCA it is suit-able for multivariate methods G03 Other methods prov-ided are correlation analysis wavelet transforms and par-tial differential equations Please refer to httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf for fur-ther details about the g03aaa routine This software can beinstalled on Windows Linux MacOS AIX HP UX andSolaris

212 Case Study In this section we will discuss the imple-mentation of coinertia analysis (CIA) to cross-platformvisualization in MADE4 and ADE4 to perform multivariateanalysis of microarray datasets To demonstrate PCA wasapplied on 4 childhood tumors (NB BL-NHL EWS andRMS) from amicroarray gene expression profiling study [52]From these data a subset (khan$train 206 genes times 64 cases)each casersquos factor denoting the respective class (khan$trainclasses length = 64) and a gene annotationrsquos data frame areaccessible in aforementioned dataset in MADE4

lt library (made4)

lt data (khan)

lt dataset = khan$train

lt fac = khan$trainclasses

lt geneSym = khan$annotation$Symbol

lt resultscoa lt- ord (dataset type = ldquocoardquo)

lt par (mfrow = c (1 2))

lt plotarrays (resultscoa classvec = fac)

lt plotgenes (resultscoa genelabels = geneSym)

Figure 1 shows the PCA of a 306-gene subset As origin asthe point of reference the more advanced gene and case areprojected in the similar direction the stronger the associationbetween involved gene and case is (gene is upregulated in thatarray sample)

213 Summary of PCA Software Tables 1 and 2 show thesummary and sources of PCA software respectively Table 3discusses the related work of this software

4 BioMed Research International

LYN

TNNT2 IGF2

AIF1

- MYL4 IGF2

HCLS1 ELF1 HLA-DMA

TNNT1

CDKN3 ISG20 BUB1 MME

COL3A1

PRKAR2B -

FNDC5

MYC

METAP2 GSTM5

TNA CAV1 TLE2

TUBB5

TNFAIP6

MYC

PTPN13

LOXL2

FCGRT OLFM1

GYG2

NFIX FVT1

d = 05

Figure 1 Plot of genes

3 Software for IndependentComponent Analysis (ICA)

ICA is considered as a valuable extension of PCA thathas been established considering the blind separation ofindependent sources from their linear combination [53] In away the initial point of ICA is the property of uncorrelation ofgeneral PCA Based on 119899times119901datamatrix119883 whose rows 119903

119894(119895 =

1 119899) tally to observational variables and whose columns119888119895

(119895 = 1 119901) are the individuals of the correspondingvariables the ICA model of 119883 can be written as

119883 = 119860119878 (4)

With generality intact119860 is a 119899times119899 mixingmatrix whereas 119878 isa 119899times119901 sourcematrix under the necessity of 119878being statisticallyindependent as possible ldquoIndependent componentsrdquo are thenew variables confined in the rows of 119878 to wit the variablesobserved are linearly collected independent componentsMutual information 119868 = sum

119896119867(119878119896) minus 119867(119878) where 119867(119878

119896) =

minus int 119901(119878119896) log119901(119878

119896)119889119904119896is themarginal entropy of the variables

119878119896 119901(119878119896) is the probabilistic density function and 119867(119878) is the

joint entropy [54] Value the independent components able tobe attained by discovering the correct linear mixtures of theobservational variables since mixing can be inverted as

119880 = 119878 = 119860minus1

119883 = 119882119883 (5)

31 FastICA FastICA is themost widely usedmethod of ICA[55] It is implemented in an R environment as the R packageldquoFastICArdquo for performing ICA and Projection Pursuit byusing the FastICA algorithm FastICA was first introducedby Hyvarinen [54] for single and multiple componentextraction The FastICA algorithm is based on a fixed-pointiteration scheme maximizing non-Gaussianity as a measureof statistical independence This package is maintained byMarchini et al [18] ICA is used to extract the informativefeatures through a transformation of the observed multidi-mensional random vectors into independent components

This package is mainly for Windows Linux and MacOSFastICA is also implemented in MATLAB In MATLABFastICA implements a fast fixed-point algorithm for ICA aswell as projection pursuit It provides a simple user interfaceand also a powerful algorithm for computation

32 JADE JADE is an R package that provides a functionfor implementing ICA This package is maintained by Nord-hausen et al [19] In this package Cardosorsquos JADE algorithm[56] is provided for ICA Instead of the JADE algorithm otherBlind Source Separation (BSS)methods such as the SOBI [57]andAMUSE [58]methods are offered Both of thesemethodsare mainly used for solving second order BSS problemsAmari error [59] is offered to evaluate the performance ofthe ICA algorithm This package can be installed on LinuxWindows and MacOS

33 High Performance Signal Analysis Tools (HiPerSAT)HiPerSAT is written in C++ for processing electroen-cephalography (EEG) data with whitening of data and ICA[20] MPI and OpenMP are used to perform parallel analysisof ICA Basically this software is used to analyse EEGdata in order to understand the neurological components ofbrain activity In this software FastICA SOBI and Informaxalgorithms are offered HiPerSAT is integrated intoMATLABand EEGLAB [60] EEGLAB is MATLAB-based softwarethat is used for analysing EEG data However the advantageof HiPerSAT is that it can handle larger datasets comparedto MATLAB In comparison to EEGLAB HiPerSAT is ableto handle large datasets without partitioning but EEGLABrequires data partitioning Data whitening is performedbefore implementing the algorithms This software can beinstalled on all platforms

34 MineICA MineICA is an R package that supplies theimplementation of ICA on transcriptomic data [21] Themain purpose of MineICA is to provide an easier way ofinterpreting the decomposition results from ICA Besidesthat this software also provides a correlation-based graphfor comparing the components from different datasets Thenewest version of this package is maintained by Biton [61]This package provides some features such as storage of ICAresults annotation of features and visualization of the resultsof ICA This package can be installed on Linux MacOS andWindows

35 Pearson Independent Component Analysis Karnanen[22] developed an R package for a feature extraction tech-nique based on the Pearson ICA algorithm This is a mutualinformation-based blind source separation approach whichapplies the Pearson system as a parametric model In orderto extract the independent components using the ICA algo-rithm the mutual information of the components has to beminimized However minimization of mutual information isrequired to use a score functionThePearson systemwas usedto model the score function The parameters of the Pearsonsystem are estimated by the method of moments In order to

BioMed Research International 5

Table 1 A summary for PCA software

Number Software Authoryear Language Features

1 FactoMineR Le et al [5] R

(i) Various dimension reduction methods such as PCA CA and MCA(ii) Different types of variables data structures and supplementaryinformation are considered(iii) The PCA function can handle missing values

2 ExPosition Beaton et al [8] R

(i) Numerous multivariate analysis methods such as PCA and GeneralizedPrincipal Component Analysis (GPCA)(ii) Can be applied to quantitative and qualitative data(iii) Implementation of Singular Value Decomposition

3 amap Lucas [9] R

(i) Different types of PCA are provided PCA Generalized PCA and RobustPCA(ii) Clustering methods are provided such as hierarchical clustering and119896-means clustering(iii) Plotting function for PCA(iv) Computing distance and dissimilarity matrices

4 ADE-4 Thioulouse et al[10] R A variety of methods such as PCA CA Principal Analysis Regression PLS

and others are offered

5 MADE4 Culhane et al[11] R

(i) Functions provided by ADE-4(ii) Integration of multiple datasets for multivariate analysis(iii) Functions for visualizing and plotting the results of analysis including 3Dplots(iv) Addition of LLSimpute algorithm for imputation of missing values

6 XLMiner Witten andFrank [12]

Implemented inExcel

(i) Provision of data reduction methods such as PCA(ii) Can be used for classification clustering data preprocessing datanormalization and others

7 ViSta Young et al [13]C++ FortranXLisp andViDAL

(i) Multivariate analysis methods are offered such as PCA Interactive ClusterAnalysis and Parallel Boxplots(ii) Provision of dynamic and high-interaction visualization for displayingmultiple views of data

8 imDEV Grapov andNewman [14]

Visual Basic andR

(i) Data preprocessing missing values imputation and data transformations(ii) Clustering methods are offered(iii) Dimension reduction methods PCA and ICA(iv) Feature selection methods(v) Visualization of data dependencies

9 StatisticsToolbox

The MathWorks[15] MATLAB

(i) Multivariate statistics such as PCA clustering and others(ii) Statistical plots probability distributions linear models nonlinear modelsfor regression and others are provided

10 Weka Hall et al [16] JavaA variety of machine learning algorithms are provided such as featureselection data preprocessing regression dimension reduction classificationand clustering methods

11 NAG LibraryNAG Toolboxfor MATLAB[17]

Fortran and C(i) Provision of more than 1700 mathematical and statistical algorithms(ii) Multivariate analysis using PCA can be implemented using the g03aaroutine

speed up the algorithm tanh nonlinearity is used when thedistribution is far from Gaussian

36 Maximum Likelihood Independent Component Analy-sis Teschenforff [23] developed an R package for ICA byusing maximum likelihood estimation This method wasfirst introduced by Hyvaerinen et al [62] This methoduses a fixed-point algorithm as the Maximum Likelihoodestimation For a fixed set of data and underlying statisticalmodel Maximum Likelihood selects the set of values of themodel parameters that maximizes the likelihood function

Maximum Likelihood estimation gives a unified approachto estimation which is well-defined in the case of normaldistribution By using a maximum likelihood frameworkand controlling the number of algorithm runs this fixed-point algorithm provides a very fast implementation formaximization of likelihood

37 Sample Case Study In this section we utilize MineICAfor microarray-based gene expression data of 200 breastcancer tumors kept in the package breastCancerMAINZ [63]based on a study done by Biton et al [21] In this study we

6 BioMed Research International

Table 2 Sources of PCA software

Number Software Sources1 FactoMineR httpcranr-projectorgwebpackagesFactoMineRindexhtml2 ExPosition httpcranr-projectorgwebpackagesExPositionindexhtml3 Amap httpcranr-projectorgwebpackagesamapindexhtml4 ADE-4 httpcranr-projectorgwebpackagesade4indexhtml5 MADE4 httpwwwbioconductororgpackages211biochtmlmade4html6 XLMiner httpwwwsolvercomxlminer-data-mining

7 ViSta httpwwwvisualstatsorghttpwwwmdpeduarpsicologiavistavistahtm

8 imDEV httpsourceforgenetprojectsimdev9 Statistics Toolbox httpwwwmathworkscommatlabcentralfileexchange30792-pca-principal-component-analysis10 Weka httpwwwcswaikatoacnzmlwekadownloadinghtml11 NAG Library httpwwwnagcomdownloadscldownloadsasp

Table 3 Related work

Software Author Motivation Advantage

FactoMineR Le et al (2009) [5]

(i) Providing a multivariate data analytictechnique for applications in biological systems(ii) To combine ldquoOmicsrdquo data structured intogroups(iii) To help on their functional interpretations

(i) It provides a geometrical point of view and alot of graphical outputs(ii) It can take into account a structure on thedata(iii) A GUI is available

MADE4 Culhane et al [11] To provide a simple-to-use tool for multivariateanalysis of microarray data

(i) Accepts a wide variety of gene-expressiondata input formats(ii) No additional data processing is required

Statistictoolbox

The MathWorks[15]

High-dimensional and complex microarraydata need automaticcomputer aided tools foranalysis

Elegant matrix support visualization

imDev Grapov andNewman 2012 [14]

Omics experiments generate complexhigh-dimensional data requiring multivariateanalyses

(i) User-friendly graphical interface(ii) Visualizations can be exported directlyfrom the R plotting interface in a variety of fileformats(iii) Dynamic loading of R objects betweenanalyses sessions

focused on how MineICA can be utilized to study an ICA-based decomposition Pseudo code for this case study is asfollows

(1) Loading the library and the data(2) Creation of an IcaSet object

(21) Load an example of expression data(22) Run ICA(23) Create aMineICAParams object function build-

MineICAParams(24) Create an IcaSet instance function buildIcaSet(25) IcaSet basics

(3) Run global analysis(4) Run analysis by calling individual functions

(41) Write description of contributing genes or fea-tures function writeProjByComp

(42) Plot heatmaps of the contributing elementsfunction plot heatmapsOnSel

(43) Gene enrichment analysis function runEnrich(44) Association with sample variables(45) Clustering of the samples according to each

component(46) Comparison of IcaSet objects function run-

CompareIcaSets

Figure 2 explains the correlation based graph denoting rela-tionship between independent components (IC) attained onfour breast cancer samplesrsquo microarray data Every noderepresents an IC and respective colors denote the origin ofdatasetThickness of edge represents the extent of correlationamong the linked ICs Black edges represent reciprocal nodes

38 Summary of ICA Software Tables 4 and 5 show thesummary and sources of ICA software respectively

BioMed Research International 7

22

22

5

4

1

3

3

1

5

4

3

1

5

44

1

3

5

Figure 2 Correlation-based graph

4 Software for Partial Least Squares (PLS)

The fundamental hypothesis of PLS is that the experimentalinformation is created by a framework ormethodologywhichis determined by a small number of latent characteristicsThusly PLS goes for discovering uncorrelated linear transfor-mation of the initial indicator characteristics which have highcovariance with the reaction characteristics In light of theselatent components PLS predicts reaction characteristics 119910the assignment of regression and reproduce initial matrix119883 the undertaking of data modelling in the meantime Thepurpose of building components in PS is to optimize thecovariance among the variable 119910 and the initial predictorvariables 119883

119908119902

= argmax (Cov (119883119908

119910))

119908119879

119908 = 1

(6)

Restricted to constraint 119908119879

119894119878119909119908119895

= 0 for all 1 le 119894 lt 119895The crucial assignment of PLS is to attain the vectors ofmaximum weights 119908

119894(119894 = 1 119902) to build a small number

of components while PCA is an ldquounsupervisedrdquo methodthat utilizes the 119883 data only To develop the components[1199051 1199052 119905

119902] PLS decomposes 119883 and 119910 to yield a bilinear

denotation of the data [64]

119883 = 1199051119908119879

1+ 1199052119908119879

2+ sdot sdot sdot + 119905

119902119908119879

119902+ 119890

119910 = 1199051V1198791

+ 1199052V1198792

+ sdot sdot sdot + 119905119870V119879119902

+ 119891

(7)

where 119908rsquos are vectors of weights for building the PLS compo-nents 119905 = 119883

119908 Vrsquos are scalars and 119890 and 119891 are the residuals

The concept of PLS is to assume 119908 and V by regression

41 Partial Least Squares Discriminant Analysis Barker andRayens [24] developed a PLS for discriminant analysis

However the original PLS was not designed for discriminantpurposes PLS Discriminant Analysis is used to find alinear regression model by projecting the dependent fea-tures and the independent features to a new space Thenthe fundamental relations can be extracted from the latentvariables This method was developed for software calledUnscrambler which was first developed by Martens andNaes [65] Unscrambler is a commercial software product formultivariate data analysis Unscrambler is used for analysinglarge and complex datasets quickly and easily using thepower of multivariate analysis Moreover this multivariatedata analysis also offers exceptional data visualization

42 Least Squares Partial Least Squares Joslashrgensen et al [25]proposed a method of using an iterative combination of PLSand ordinary least squares to extract the relationship betweenthe predictor variable and the responsesThismethod is basedon a combination of least squares estimates for the designvariables and PLS regression on the spectra The PLS scoreswere incorporated into the ordinary least squares equationon the spectra The idea is to separate the information fromthe spectral and design matrices in a nice way Howeverthis method is able to extract the information even whenfewer components are used In addition this method isinsensitive to the relative scaling of the spectra and theprocess Moreover this combination method is also lessbiased than the individual PLS technique

43 Powered Partial Least Squares Discriminant AnalysisLiland and Indahl [26] extended the Powered PLS to Pow-ered PLS Discriminant Analysis to overcome the extractionof information for the multivariate classification problemThis method can construct more efficient group separationand generate more interpretive outcomes than the ordinaryPartial Least Square Discriminant Analysis technique Thefeatures extracted by the Powered PLS can contribute torevealing the relevance of particular predictors and oftenrequires smaller and simpler components than ordinary PLSMoreover the optimization task is equivalent to maximizingthe correlation between the transformed predictors and thegroups This makes it possible to discard the influence of lessimportant predictorsThis method was also developed by theauthors for availability in an R package

44 Penalized Partial Least Squares Kramer et al [27] pro-posed a combination of the feature extraction technique PLSwith a penalization framework This method is an extensionof PLS regression using a penalization technique OrdinaryPLS is suited for regression problems by minimizing aquadratic loss function iteratively In addition the repre-sentation in terms of kernel matrices provides an intuitivegeometric interpretation of the penalty term The penaltyterms control the roughness of the estimated functions Withthe incorporation of penalization into this framework theresearch direction became more promising This method isused to extract relevant information for high-dimensionalregression problems and also for noisy dataThismethod wasalso developed by the Kramer and her colleagues colleagues[66] for availability in an R package

8 BioMed Research International

Table 4 Summary of ICA software

Number Software Authoryear Language Features

1 FastICA Marchini et al[18] R and MATLAB ICA algorithm is provided for implementing the analysis using

ICA

2 JADE Nordhausen etal [19] R (i) JADE algorithm is provided for ICA

(ii) Other BSS methods such as AMUSE and SOBI are offered

3 HiPerSAT Keith et al [20] C++ MATLABand EEGLAB

(i) Integration of FastICA Informax and SOBI algorithms(ii) Data whitening is provided

4 MineICA Biton et al [21] R (i) Storage and visualization of ICA results(ii) Annotation of features

5 Pearson ICA Karnanen [22] R Extraction of the independent components using the minimizationof mutual information from the Pearson system

6 MaximumLikelihood ICA

Teschenforff[23] R Implementation of the Maximum Likelihood and fixed-point

algorithm into ICA

Table 5 Sources of ICA software

Number Software Sources

1 FastICA R httpcranr-projectorgwebpackagesfastICAindexhtmlMATLAB httpresearchicsaaltofiicafastica

2 JADE httpcranr-projectorgwebpackagesJADEindexhtml3 HiPerSAT httpnicuoregoneduprojectshipersatindexphp4 MineICA httpwwwbioconductororgpackages212biochtmlMineICAhtml5 Pearson ICA httpcranr-projectorgwebpackagesPearsonICAindexhtml6 Maximum Likelihood ICA httpcranr-projectorgwebpackagesmlica2indexhtml

45 SlimPLS Gutkin et al [33] proposed a feature extractionmethod based on PLS called SlimPLS Ranking-based filtersusually utilize a univariate method when selecting featuresThe filter methods can produce reasonable feature setsespecially when the original feature sets are uncorrelatedHowever the chosen feature set will be suboptimal whenthe features of the original set are dependent Some ofthe features will add little discriminative power on top ofpreviously selected features SlimPLS is a multivariate featureextraction method which incorporates feature dependenciesinto calculation This multivariate property is constructedby combining the highly predictive feature with some lesspredictive but correlated features This is because the addedfeatures will provide more information on the behaviour ofthe samples

46 Sparse Partial Least Squares Discriminant Analysis andSparse Generalized Partial Least Squares Chung and Keles[28] proposed two extension feature extraction approachesbased on Sparse PLS These approaches are Sparse PLSDiscriminant Analysis and Sparse Generalized PLS for high-dimensional datasets These two approaches improved ordi-nary PLS by employing feature extraction and dimensionreduction simultaneously These two approaches performwell even with unbalanced sample sizes of the classes SparsePLS Discrimination Analysis is computationally efficientbecause it only requires computational time for one run ofSparse PLS and a classifier Moreover Sparse GeneralizedPLS extends Sparse PLS to the generalized linear model

framework These methods were also developed by Chungand Keles for availability in an R package

47 Degrees of Freedom of Partial Least Squares Kramer andSugiyama [29] proposed a method of unbiased estimationof the degrees of freedom for PLS regression The authorsstated that the construction of latent components fromthe independent variable also depended on the dependentvariable However for PLS regression the optimal number ofcomponents needs to be determined first One of the ways ofdetermining the optimal number of components is throughthe degrees of freedom for the complexity of fitted modelsMoreover the degrees of freedom estimate can be used for thecomparison of different regression methods Furthermorethe two implementations for the degrees of freedom utilizethe connection between PLS regression and numerical linearmethods from numerical linear The authors also developedan R package for this unbiased estimation of the degrees offreedom of PLS

48 Surrogate Variable Analysis Partial Least SquaresChakraborty and Datta [30] proposed a surrogate variableanalysismethod based on PLS In differential gene expressionanalysis one of the important issues is to avoid the hiddenconfounders in the dataset The hidden confounders ofgene expression are caused by different environmentalconditions of the samples However this problem cannot besimply overcome by modifying the gene expression data byusing a normalizing technique This method can extract the

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 4: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

4 BioMed Research International

LYN

TNNT2 IGF2

AIF1

- MYL4 IGF2

HCLS1 ELF1 HLA-DMA

TNNT1

CDKN3 ISG20 BUB1 MME

COL3A1

PRKAR2B -

FNDC5

MYC

METAP2 GSTM5

TNA CAV1 TLE2

TUBB5

TNFAIP6

MYC

PTPN13

LOXL2

FCGRT OLFM1

GYG2

NFIX FVT1

d = 05

Figure 1 Plot of genes

3 Software for IndependentComponent Analysis (ICA)

ICA is considered as a valuable extension of PCA thathas been established considering the blind separation ofindependent sources from their linear combination [53] In away the initial point of ICA is the property of uncorrelation ofgeneral PCA Based on 119899times119901datamatrix119883 whose rows 119903

119894(119895 =

1 119899) tally to observational variables and whose columns119888119895

(119895 = 1 119901) are the individuals of the correspondingvariables the ICA model of 119883 can be written as

119883 = 119860119878 (4)

With generality intact119860 is a 119899times119899 mixingmatrix whereas 119878 isa 119899times119901 sourcematrix under the necessity of 119878being statisticallyindependent as possible ldquoIndependent componentsrdquo are thenew variables confined in the rows of 119878 to wit the variablesobserved are linearly collected independent componentsMutual information 119868 = sum

119896119867(119878119896) minus 119867(119878) where 119867(119878

119896) =

minus int 119901(119878119896) log119901(119878

119896)119889119904119896is themarginal entropy of the variables

119878119896 119901(119878119896) is the probabilistic density function and 119867(119878) is the

joint entropy [54] Value the independent components able tobe attained by discovering the correct linear mixtures of theobservational variables since mixing can be inverted as

119880 = 119878 = 119860minus1

119883 = 119882119883 (5)

31 FastICA FastICA is themost widely usedmethod of ICA[55] It is implemented in an R environment as the R packageldquoFastICArdquo for performing ICA and Projection Pursuit byusing the FastICA algorithm FastICA was first introducedby Hyvarinen [54] for single and multiple componentextraction The FastICA algorithm is based on a fixed-pointiteration scheme maximizing non-Gaussianity as a measureof statistical independence This package is maintained byMarchini et al [18] ICA is used to extract the informativefeatures through a transformation of the observed multidi-mensional random vectors into independent components

This package is mainly for Windows Linux and MacOSFastICA is also implemented in MATLAB In MATLABFastICA implements a fast fixed-point algorithm for ICA aswell as projection pursuit It provides a simple user interfaceand also a powerful algorithm for computation

32 JADE JADE is an R package that provides a functionfor implementing ICA This package is maintained by Nord-hausen et al [19] In this package Cardosorsquos JADE algorithm[56] is provided for ICA Instead of the JADE algorithm otherBlind Source Separation (BSS)methods such as the SOBI [57]andAMUSE [58]methods are offered Both of thesemethodsare mainly used for solving second order BSS problemsAmari error [59] is offered to evaluate the performance ofthe ICA algorithm This package can be installed on LinuxWindows and MacOS

33 High Performance Signal Analysis Tools (HiPerSAT)HiPerSAT is written in C++ for processing electroen-cephalography (EEG) data with whitening of data and ICA[20] MPI and OpenMP are used to perform parallel analysisof ICA Basically this software is used to analyse EEGdata in order to understand the neurological components ofbrain activity In this software FastICA SOBI and Informaxalgorithms are offered HiPerSAT is integrated intoMATLABand EEGLAB [60] EEGLAB is MATLAB-based softwarethat is used for analysing EEG data However the advantageof HiPerSAT is that it can handle larger datasets comparedto MATLAB In comparison to EEGLAB HiPerSAT is ableto handle large datasets without partitioning but EEGLABrequires data partitioning Data whitening is performedbefore implementing the algorithms This software can beinstalled on all platforms

34 MineICA MineICA is an R package that supplies theimplementation of ICA on transcriptomic data [21] Themain purpose of MineICA is to provide an easier way ofinterpreting the decomposition results from ICA Besidesthat this software also provides a correlation-based graphfor comparing the components from different datasets Thenewest version of this package is maintained by Biton [61]This package provides some features such as storage of ICAresults annotation of features and visualization of the resultsof ICA This package can be installed on Linux MacOS andWindows

35 Pearson Independent Component Analysis Karnanen[22] developed an R package for a feature extraction tech-nique based on the Pearson ICA algorithm This is a mutualinformation-based blind source separation approach whichapplies the Pearson system as a parametric model In orderto extract the independent components using the ICA algo-rithm the mutual information of the components has to beminimized However minimization of mutual information isrequired to use a score functionThePearson systemwas usedto model the score function The parameters of the Pearsonsystem are estimated by the method of moments In order to

BioMed Research International 5

Table 1 A summary for PCA software

Number Software Authoryear Language Features

1 FactoMineR Le et al [5] R

(i) Various dimension reduction methods such as PCA CA and MCA(ii) Different types of variables data structures and supplementaryinformation are considered(iii) The PCA function can handle missing values

2 ExPosition Beaton et al [8] R

(i) Numerous multivariate analysis methods such as PCA and GeneralizedPrincipal Component Analysis (GPCA)(ii) Can be applied to quantitative and qualitative data(iii) Implementation of Singular Value Decomposition

3 amap Lucas [9] R

(i) Different types of PCA are provided PCA Generalized PCA and RobustPCA(ii) Clustering methods are provided such as hierarchical clustering and119896-means clustering(iii) Plotting function for PCA(iv) Computing distance and dissimilarity matrices

4 ADE-4 Thioulouse et al[10] R A variety of methods such as PCA CA Principal Analysis Regression PLS

and others are offered

5 MADE4 Culhane et al[11] R

(i) Functions provided by ADE-4(ii) Integration of multiple datasets for multivariate analysis(iii) Functions for visualizing and plotting the results of analysis including 3Dplots(iv) Addition of LLSimpute algorithm for imputation of missing values

6 XLMiner Witten andFrank [12]

Implemented inExcel

(i) Provision of data reduction methods such as PCA(ii) Can be used for classification clustering data preprocessing datanormalization and others

7 ViSta Young et al [13]C++ FortranXLisp andViDAL

(i) Multivariate analysis methods are offered such as PCA Interactive ClusterAnalysis and Parallel Boxplots(ii) Provision of dynamic and high-interaction visualization for displayingmultiple views of data

8 imDEV Grapov andNewman [14]

Visual Basic andR

(i) Data preprocessing missing values imputation and data transformations(ii) Clustering methods are offered(iii) Dimension reduction methods PCA and ICA(iv) Feature selection methods(v) Visualization of data dependencies

9 StatisticsToolbox

The MathWorks[15] MATLAB

(i) Multivariate statistics such as PCA clustering and others(ii) Statistical plots probability distributions linear models nonlinear modelsfor regression and others are provided

10 Weka Hall et al [16] JavaA variety of machine learning algorithms are provided such as featureselection data preprocessing regression dimension reduction classificationand clustering methods

11 NAG LibraryNAG Toolboxfor MATLAB[17]

Fortran and C(i) Provision of more than 1700 mathematical and statistical algorithms(ii) Multivariate analysis using PCA can be implemented using the g03aaroutine

speed up the algorithm tanh nonlinearity is used when thedistribution is far from Gaussian

36 Maximum Likelihood Independent Component Analy-sis Teschenforff [23] developed an R package for ICA byusing maximum likelihood estimation This method wasfirst introduced by Hyvaerinen et al [62] This methoduses a fixed-point algorithm as the Maximum Likelihoodestimation For a fixed set of data and underlying statisticalmodel Maximum Likelihood selects the set of values of themodel parameters that maximizes the likelihood function

Maximum Likelihood estimation gives a unified approachto estimation which is well-defined in the case of normaldistribution By using a maximum likelihood frameworkand controlling the number of algorithm runs this fixed-point algorithm provides a very fast implementation formaximization of likelihood

37 Sample Case Study In this section we utilize MineICAfor microarray-based gene expression data of 200 breastcancer tumors kept in the package breastCancerMAINZ [63]based on a study done by Biton et al [21] In this study we

6 BioMed Research International

Table 2 Sources of PCA software

Number Software Sources1 FactoMineR httpcranr-projectorgwebpackagesFactoMineRindexhtml2 ExPosition httpcranr-projectorgwebpackagesExPositionindexhtml3 Amap httpcranr-projectorgwebpackagesamapindexhtml4 ADE-4 httpcranr-projectorgwebpackagesade4indexhtml5 MADE4 httpwwwbioconductororgpackages211biochtmlmade4html6 XLMiner httpwwwsolvercomxlminer-data-mining

7 ViSta httpwwwvisualstatsorghttpwwwmdpeduarpsicologiavistavistahtm

8 imDEV httpsourceforgenetprojectsimdev9 Statistics Toolbox httpwwwmathworkscommatlabcentralfileexchange30792-pca-principal-component-analysis10 Weka httpwwwcswaikatoacnzmlwekadownloadinghtml11 NAG Library httpwwwnagcomdownloadscldownloadsasp

Table 3 Related work

Software Author Motivation Advantage

FactoMineR Le et al (2009) [5]

(i) Providing a multivariate data analytictechnique for applications in biological systems(ii) To combine ldquoOmicsrdquo data structured intogroups(iii) To help on their functional interpretations

(i) It provides a geometrical point of view and alot of graphical outputs(ii) It can take into account a structure on thedata(iii) A GUI is available

MADE4 Culhane et al [11] To provide a simple-to-use tool for multivariateanalysis of microarray data

(i) Accepts a wide variety of gene-expressiondata input formats(ii) No additional data processing is required

Statistictoolbox

The MathWorks[15]

High-dimensional and complex microarraydata need automaticcomputer aided tools foranalysis

Elegant matrix support visualization

imDev Grapov andNewman 2012 [14]

Omics experiments generate complexhigh-dimensional data requiring multivariateanalyses

(i) User-friendly graphical interface(ii) Visualizations can be exported directlyfrom the R plotting interface in a variety of fileformats(iii) Dynamic loading of R objects betweenanalyses sessions

focused on how MineICA can be utilized to study an ICA-based decomposition Pseudo code for this case study is asfollows

(1) Loading the library and the data(2) Creation of an IcaSet object

(21) Load an example of expression data(22) Run ICA(23) Create aMineICAParams object function build-

MineICAParams(24) Create an IcaSet instance function buildIcaSet(25) IcaSet basics

(3) Run global analysis(4) Run analysis by calling individual functions

(41) Write description of contributing genes or fea-tures function writeProjByComp

(42) Plot heatmaps of the contributing elementsfunction plot heatmapsOnSel

(43) Gene enrichment analysis function runEnrich(44) Association with sample variables(45) Clustering of the samples according to each

component(46) Comparison of IcaSet objects function run-

CompareIcaSets

Figure 2 explains the correlation based graph denoting rela-tionship between independent components (IC) attained onfour breast cancer samplesrsquo microarray data Every noderepresents an IC and respective colors denote the origin ofdatasetThickness of edge represents the extent of correlationamong the linked ICs Black edges represent reciprocal nodes

38 Summary of ICA Software Tables 4 and 5 show thesummary and sources of ICA software respectively

BioMed Research International 7

22

22

5

4

1

3

3

1

5

4

3

1

5

44

1

3

5

Figure 2 Correlation-based graph

4 Software for Partial Least Squares (PLS)

The fundamental hypothesis of PLS is that the experimentalinformation is created by a framework ormethodologywhichis determined by a small number of latent characteristicsThusly PLS goes for discovering uncorrelated linear transfor-mation of the initial indicator characteristics which have highcovariance with the reaction characteristics In light of theselatent components PLS predicts reaction characteristics 119910the assignment of regression and reproduce initial matrix119883 the undertaking of data modelling in the meantime Thepurpose of building components in PS is to optimize thecovariance among the variable 119910 and the initial predictorvariables 119883

119908119902

= argmax (Cov (119883119908

119910))

119908119879

119908 = 1

(6)

Restricted to constraint 119908119879

119894119878119909119908119895

= 0 for all 1 le 119894 lt 119895The crucial assignment of PLS is to attain the vectors ofmaximum weights 119908

119894(119894 = 1 119902) to build a small number

of components while PCA is an ldquounsupervisedrdquo methodthat utilizes the 119883 data only To develop the components[1199051 1199052 119905

119902] PLS decomposes 119883 and 119910 to yield a bilinear

denotation of the data [64]

119883 = 1199051119908119879

1+ 1199052119908119879

2+ sdot sdot sdot + 119905

119902119908119879

119902+ 119890

119910 = 1199051V1198791

+ 1199052V1198792

+ sdot sdot sdot + 119905119870V119879119902

+ 119891

(7)

where 119908rsquos are vectors of weights for building the PLS compo-nents 119905 = 119883

119908 Vrsquos are scalars and 119890 and 119891 are the residuals

The concept of PLS is to assume 119908 and V by regression

41 Partial Least Squares Discriminant Analysis Barker andRayens [24] developed a PLS for discriminant analysis

However the original PLS was not designed for discriminantpurposes PLS Discriminant Analysis is used to find alinear regression model by projecting the dependent fea-tures and the independent features to a new space Thenthe fundamental relations can be extracted from the latentvariables This method was developed for software calledUnscrambler which was first developed by Martens andNaes [65] Unscrambler is a commercial software product formultivariate data analysis Unscrambler is used for analysinglarge and complex datasets quickly and easily using thepower of multivariate analysis Moreover this multivariatedata analysis also offers exceptional data visualization

42 Least Squares Partial Least Squares Joslashrgensen et al [25]proposed a method of using an iterative combination of PLSand ordinary least squares to extract the relationship betweenthe predictor variable and the responsesThismethod is basedon a combination of least squares estimates for the designvariables and PLS regression on the spectra The PLS scoreswere incorporated into the ordinary least squares equationon the spectra The idea is to separate the information fromthe spectral and design matrices in a nice way Howeverthis method is able to extract the information even whenfewer components are used In addition this method isinsensitive to the relative scaling of the spectra and theprocess Moreover this combination method is also lessbiased than the individual PLS technique

43 Powered Partial Least Squares Discriminant AnalysisLiland and Indahl [26] extended the Powered PLS to Pow-ered PLS Discriminant Analysis to overcome the extractionof information for the multivariate classification problemThis method can construct more efficient group separationand generate more interpretive outcomes than the ordinaryPartial Least Square Discriminant Analysis technique Thefeatures extracted by the Powered PLS can contribute torevealing the relevance of particular predictors and oftenrequires smaller and simpler components than ordinary PLSMoreover the optimization task is equivalent to maximizingthe correlation between the transformed predictors and thegroups This makes it possible to discard the influence of lessimportant predictorsThis method was also developed by theauthors for availability in an R package

44 Penalized Partial Least Squares Kramer et al [27] pro-posed a combination of the feature extraction technique PLSwith a penalization framework This method is an extensionof PLS regression using a penalization technique OrdinaryPLS is suited for regression problems by minimizing aquadratic loss function iteratively In addition the repre-sentation in terms of kernel matrices provides an intuitivegeometric interpretation of the penalty term The penaltyterms control the roughness of the estimated functions Withthe incorporation of penalization into this framework theresearch direction became more promising This method isused to extract relevant information for high-dimensionalregression problems and also for noisy dataThismethod wasalso developed by the Kramer and her colleagues colleagues[66] for availability in an R package

8 BioMed Research International

Table 4 Summary of ICA software

Number Software Authoryear Language Features

1 FastICA Marchini et al[18] R and MATLAB ICA algorithm is provided for implementing the analysis using

ICA

2 JADE Nordhausen etal [19] R (i) JADE algorithm is provided for ICA

(ii) Other BSS methods such as AMUSE and SOBI are offered

3 HiPerSAT Keith et al [20] C++ MATLABand EEGLAB

(i) Integration of FastICA Informax and SOBI algorithms(ii) Data whitening is provided

4 MineICA Biton et al [21] R (i) Storage and visualization of ICA results(ii) Annotation of features

5 Pearson ICA Karnanen [22] R Extraction of the independent components using the minimizationof mutual information from the Pearson system

6 MaximumLikelihood ICA

Teschenforff[23] R Implementation of the Maximum Likelihood and fixed-point

algorithm into ICA

Table 5 Sources of ICA software

Number Software Sources

1 FastICA R httpcranr-projectorgwebpackagesfastICAindexhtmlMATLAB httpresearchicsaaltofiicafastica

2 JADE httpcranr-projectorgwebpackagesJADEindexhtml3 HiPerSAT httpnicuoregoneduprojectshipersatindexphp4 MineICA httpwwwbioconductororgpackages212biochtmlMineICAhtml5 Pearson ICA httpcranr-projectorgwebpackagesPearsonICAindexhtml6 Maximum Likelihood ICA httpcranr-projectorgwebpackagesmlica2indexhtml

45 SlimPLS Gutkin et al [33] proposed a feature extractionmethod based on PLS called SlimPLS Ranking-based filtersusually utilize a univariate method when selecting featuresThe filter methods can produce reasonable feature setsespecially when the original feature sets are uncorrelatedHowever the chosen feature set will be suboptimal whenthe features of the original set are dependent Some ofthe features will add little discriminative power on top ofpreviously selected features SlimPLS is a multivariate featureextraction method which incorporates feature dependenciesinto calculation This multivariate property is constructedby combining the highly predictive feature with some lesspredictive but correlated features This is because the addedfeatures will provide more information on the behaviour ofthe samples

46 Sparse Partial Least Squares Discriminant Analysis andSparse Generalized Partial Least Squares Chung and Keles[28] proposed two extension feature extraction approachesbased on Sparse PLS These approaches are Sparse PLSDiscriminant Analysis and Sparse Generalized PLS for high-dimensional datasets These two approaches improved ordi-nary PLS by employing feature extraction and dimensionreduction simultaneously These two approaches performwell even with unbalanced sample sizes of the classes SparsePLS Discrimination Analysis is computationally efficientbecause it only requires computational time for one run ofSparse PLS and a classifier Moreover Sparse GeneralizedPLS extends Sparse PLS to the generalized linear model

framework These methods were also developed by Chungand Keles for availability in an R package

47 Degrees of Freedom of Partial Least Squares Kramer andSugiyama [29] proposed a method of unbiased estimationof the degrees of freedom for PLS regression The authorsstated that the construction of latent components fromthe independent variable also depended on the dependentvariable However for PLS regression the optimal number ofcomponents needs to be determined first One of the ways ofdetermining the optimal number of components is throughthe degrees of freedom for the complexity of fitted modelsMoreover the degrees of freedom estimate can be used for thecomparison of different regression methods Furthermorethe two implementations for the degrees of freedom utilizethe connection between PLS regression and numerical linearmethods from numerical linear The authors also developedan R package for this unbiased estimation of the degrees offreedom of PLS

48 Surrogate Variable Analysis Partial Least SquaresChakraborty and Datta [30] proposed a surrogate variableanalysismethod based on PLS In differential gene expressionanalysis one of the important issues is to avoid the hiddenconfounders in the dataset The hidden confounders ofgene expression are caused by different environmentalconditions of the samples However this problem cannot besimply overcome by modifying the gene expression data byusing a normalizing technique This method can extract the

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 5: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

BioMed Research International 5

Table 1 A summary for PCA software

Number Software Authoryear Language Features

1 FactoMineR Le et al [5] R

(i) Various dimension reduction methods such as PCA CA and MCA(ii) Different types of variables data structures and supplementaryinformation are considered(iii) The PCA function can handle missing values

2 ExPosition Beaton et al [8] R

(i) Numerous multivariate analysis methods such as PCA and GeneralizedPrincipal Component Analysis (GPCA)(ii) Can be applied to quantitative and qualitative data(iii) Implementation of Singular Value Decomposition

3 amap Lucas [9] R

(i) Different types of PCA are provided PCA Generalized PCA and RobustPCA(ii) Clustering methods are provided such as hierarchical clustering and119896-means clustering(iii) Plotting function for PCA(iv) Computing distance and dissimilarity matrices

4 ADE-4 Thioulouse et al[10] R A variety of methods such as PCA CA Principal Analysis Regression PLS

and others are offered

5 MADE4 Culhane et al[11] R

(i) Functions provided by ADE-4(ii) Integration of multiple datasets for multivariate analysis(iii) Functions for visualizing and plotting the results of analysis including 3Dplots(iv) Addition of LLSimpute algorithm for imputation of missing values

6 XLMiner Witten andFrank [12]

Implemented inExcel

(i) Provision of data reduction methods such as PCA(ii) Can be used for classification clustering data preprocessing datanormalization and others

7 ViSta Young et al [13]C++ FortranXLisp andViDAL

(i) Multivariate analysis methods are offered such as PCA Interactive ClusterAnalysis and Parallel Boxplots(ii) Provision of dynamic and high-interaction visualization for displayingmultiple views of data

8 imDEV Grapov andNewman [14]

Visual Basic andR

(i) Data preprocessing missing values imputation and data transformations(ii) Clustering methods are offered(iii) Dimension reduction methods PCA and ICA(iv) Feature selection methods(v) Visualization of data dependencies

9 StatisticsToolbox

The MathWorks[15] MATLAB

(i) Multivariate statistics such as PCA clustering and others(ii) Statistical plots probability distributions linear models nonlinear modelsfor regression and others are provided

10 Weka Hall et al [16] JavaA variety of machine learning algorithms are provided such as featureselection data preprocessing regression dimension reduction classificationand clustering methods

11 NAG LibraryNAG Toolboxfor MATLAB[17]

Fortran and C(i) Provision of more than 1700 mathematical and statistical algorithms(ii) Multivariate analysis using PCA can be implemented using the g03aaroutine

speed up the algorithm tanh nonlinearity is used when thedistribution is far from Gaussian

36 Maximum Likelihood Independent Component Analy-sis Teschenforff [23] developed an R package for ICA byusing maximum likelihood estimation This method wasfirst introduced by Hyvaerinen et al [62] This methoduses a fixed-point algorithm as the Maximum Likelihoodestimation For a fixed set of data and underlying statisticalmodel Maximum Likelihood selects the set of values of themodel parameters that maximizes the likelihood function

Maximum Likelihood estimation gives a unified approachto estimation which is well-defined in the case of normaldistribution By using a maximum likelihood frameworkand controlling the number of algorithm runs this fixed-point algorithm provides a very fast implementation formaximization of likelihood

37 Sample Case Study In this section we utilize MineICAfor microarray-based gene expression data of 200 breastcancer tumors kept in the package breastCancerMAINZ [63]based on a study done by Biton et al [21] In this study we

6 BioMed Research International

Table 2 Sources of PCA software

Number Software Sources1 FactoMineR httpcranr-projectorgwebpackagesFactoMineRindexhtml2 ExPosition httpcranr-projectorgwebpackagesExPositionindexhtml3 Amap httpcranr-projectorgwebpackagesamapindexhtml4 ADE-4 httpcranr-projectorgwebpackagesade4indexhtml5 MADE4 httpwwwbioconductororgpackages211biochtmlmade4html6 XLMiner httpwwwsolvercomxlminer-data-mining

7 ViSta httpwwwvisualstatsorghttpwwwmdpeduarpsicologiavistavistahtm

8 imDEV httpsourceforgenetprojectsimdev9 Statistics Toolbox httpwwwmathworkscommatlabcentralfileexchange30792-pca-principal-component-analysis10 Weka httpwwwcswaikatoacnzmlwekadownloadinghtml11 NAG Library httpwwwnagcomdownloadscldownloadsasp

Table 3 Related work

Software Author Motivation Advantage

FactoMineR Le et al (2009) [5]

(i) Providing a multivariate data analytictechnique for applications in biological systems(ii) To combine ldquoOmicsrdquo data structured intogroups(iii) To help on their functional interpretations

(i) It provides a geometrical point of view and alot of graphical outputs(ii) It can take into account a structure on thedata(iii) A GUI is available

MADE4 Culhane et al [11] To provide a simple-to-use tool for multivariateanalysis of microarray data

(i) Accepts a wide variety of gene-expressiondata input formats(ii) No additional data processing is required

Statistictoolbox

The MathWorks[15]

High-dimensional and complex microarraydata need automaticcomputer aided tools foranalysis

Elegant matrix support visualization

imDev Grapov andNewman 2012 [14]

Omics experiments generate complexhigh-dimensional data requiring multivariateanalyses

(i) User-friendly graphical interface(ii) Visualizations can be exported directlyfrom the R plotting interface in a variety of fileformats(iii) Dynamic loading of R objects betweenanalyses sessions

focused on how MineICA can be utilized to study an ICA-based decomposition Pseudo code for this case study is asfollows

(1) Loading the library and the data(2) Creation of an IcaSet object

(21) Load an example of expression data(22) Run ICA(23) Create aMineICAParams object function build-

MineICAParams(24) Create an IcaSet instance function buildIcaSet(25) IcaSet basics

(3) Run global analysis(4) Run analysis by calling individual functions

(41) Write description of contributing genes or fea-tures function writeProjByComp

(42) Plot heatmaps of the contributing elementsfunction plot heatmapsOnSel

(43) Gene enrichment analysis function runEnrich(44) Association with sample variables(45) Clustering of the samples according to each

component(46) Comparison of IcaSet objects function run-

CompareIcaSets

Figure 2 explains the correlation based graph denoting rela-tionship between independent components (IC) attained onfour breast cancer samplesrsquo microarray data Every noderepresents an IC and respective colors denote the origin ofdatasetThickness of edge represents the extent of correlationamong the linked ICs Black edges represent reciprocal nodes

38 Summary of ICA Software Tables 4 and 5 show thesummary and sources of ICA software respectively

BioMed Research International 7

22

22

5

4

1

3

3

1

5

4

3

1

5

44

1

3

5

Figure 2 Correlation-based graph

4 Software for Partial Least Squares (PLS)

The fundamental hypothesis of PLS is that the experimentalinformation is created by a framework ormethodologywhichis determined by a small number of latent characteristicsThusly PLS goes for discovering uncorrelated linear transfor-mation of the initial indicator characteristics which have highcovariance with the reaction characteristics In light of theselatent components PLS predicts reaction characteristics 119910the assignment of regression and reproduce initial matrix119883 the undertaking of data modelling in the meantime Thepurpose of building components in PS is to optimize thecovariance among the variable 119910 and the initial predictorvariables 119883

119908119902

= argmax (Cov (119883119908

119910))

119908119879

119908 = 1

(6)

Restricted to constraint 119908119879

119894119878119909119908119895

= 0 for all 1 le 119894 lt 119895The crucial assignment of PLS is to attain the vectors ofmaximum weights 119908

119894(119894 = 1 119902) to build a small number

of components while PCA is an ldquounsupervisedrdquo methodthat utilizes the 119883 data only To develop the components[1199051 1199052 119905

119902] PLS decomposes 119883 and 119910 to yield a bilinear

denotation of the data [64]

119883 = 1199051119908119879

1+ 1199052119908119879

2+ sdot sdot sdot + 119905

119902119908119879

119902+ 119890

119910 = 1199051V1198791

+ 1199052V1198792

+ sdot sdot sdot + 119905119870V119879119902

+ 119891

(7)

where 119908rsquos are vectors of weights for building the PLS compo-nents 119905 = 119883

119908 Vrsquos are scalars and 119890 and 119891 are the residuals

The concept of PLS is to assume 119908 and V by regression

41 Partial Least Squares Discriminant Analysis Barker andRayens [24] developed a PLS for discriminant analysis

However the original PLS was not designed for discriminantpurposes PLS Discriminant Analysis is used to find alinear regression model by projecting the dependent fea-tures and the independent features to a new space Thenthe fundamental relations can be extracted from the latentvariables This method was developed for software calledUnscrambler which was first developed by Martens andNaes [65] Unscrambler is a commercial software product formultivariate data analysis Unscrambler is used for analysinglarge and complex datasets quickly and easily using thepower of multivariate analysis Moreover this multivariatedata analysis also offers exceptional data visualization

42 Least Squares Partial Least Squares Joslashrgensen et al [25]proposed a method of using an iterative combination of PLSand ordinary least squares to extract the relationship betweenthe predictor variable and the responsesThismethod is basedon a combination of least squares estimates for the designvariables and PLS regression on the spectra The PLS scoreswere incorporated into the ordinary least squares equationon the spectra The idea is to separate the information fromthe spectral and design matrices in a nice way Howeverthis method is able to extract the information even whenfewer components are used In addition this method isinsensitive to the relative scaling of the spectra and theprocess Moreover this combination method is also lessbiased than the individual PLS technique

43 Powered Partial Least Squares Discriminant AnalysisLiland and Indahl [26] extended the Powered PLS to Pow-ered PLS Discriminant Analysis to overcome the extractionof information for the multivariate classification problemThis method can construct more efficient group separationand generate more interpretive outcomes than the ordinaryPartial Least Square Discriminant Analysis technique Thefeatures extracted by the Powered PLS can contribute torevealing the relevance of particular predictors and oftenrequires smaller and simpler components than ordinary PLSMoreover the optimization task is equivalent to maximizingthe correlation between the transformed predictors and thegroups This makes it possible to discard the influence of lessimportant predictorsThis method was also developed by theauthors for availability in an R package

44 Penalized Partial Least Squares Kramer et al [27] pro-posed a combination of the feature extraction technique PLSwith a penalization framework This method is an extensionof PLS regression using a penalization technique OrdinaryPLS is suited for regression problems by minimizing aquadratic loss function iteratively In addition the repre-sentation in terms of kernel matrices provides an intuitivegeometric interpretation of the penalty term The penaltyterms control the roughness of the estimated functions Withthe incorporation of penalization into this framework theresearch direction became more promising This method isused to extract relevant information for high-dimensionalregression problems and also for noisy dataThismethod wasalso developed by the Kramer and her colleagues colleagues[66] for availability in an R package

8 BioMed Research International

Table 4 Summary of ICA software

Number Software Authoryear Language Features

1 FastICA Marchini et al[18] R and MATLAB ICA algorithm is provided for implementing the analysis using

ICA

2 JADE Nordhausen etal [19] R (i) JADE algorithm is provided for ICA

(ii) Other BSS methods such as AMUSE and SOBI are offered

3 HiPerSAT Keith et al [20] C++ MATLABand EEGLAB

(i) Integration of FastICA Informax and SOBI algorithms(ii) Data whitening is provided

4 MineICA Biton et al [21] R (i) Storage and visualization of ICA results(ii) Annotation of features

5 Pearson ICA Karnanen [22] R Extraction of the independent components using the minimizationof mutual information from the Pearson system

6 MaximumLikelihood ICA

Teschenforff[23] R Implementation of the Maximum Likelihood and fixed-point

algorithm into ICA

Table 5 Sources of ICA software

Number Software Sources

1 FastICA R httpcranr-projectorgwebpackagesfastICAindexhtmlMATLAB httpresearchicsaaltofiicafastica

2 JADE httpcranr-projectorgwebpackagesJADEindexhtml3 HiPerSAT httpnicuoregoneduprojectshipersatindexphp4 MineICA httpwwwbioconductororgpackages212biochtmlMineICAhtml5 Pearson ICA httpcranr-projectorgwebpackagesPearsonICAindexhtml6 Maximum Likelihood ICA httpcranr-projectorgwebpackagesmlica2indexhtml

45 SlimPLS Gutkin et al [33] proposed a feature extractionmethod based on PLS called SlimPLS Ranking-based filtersusually utilize a univariate method when selecting featuresThe filter methods can produce reasonable feature setsespecially when the original feature sets are uncorrelatedHowever the chosen feature set will be suboptimal whenthe features of the original set are dependent Some ofthe features will add little discriminative power on top ofpreviously selected features SlimPLS is a multivariate featureextraction method which incorporates feature dependenciesinto calculation This multivariate property is constructedby combining the highly predictive feature with some lesspredictive but correlated features This is because the addedfeatures will provide more information on the behaviour ofthe samples

46 Sparse Partial Least Squares Discriminant Analysis andSparse Generalized Partial Least Squares Chung and Keles[28] proposed two extension feature extraction approachesbased on Sparse PLS These approaches are Sparse PLSDiscriminant Analysis and Sparse Generalized PLS for high-dimensional datasets These two approaches improved ordi-nary PLS by employing feature extraction and dimensionreduction simultaneously These two approaches performwell even with unbalanced sample sizes of the classes SparsePLS Discrimination Analysis is computationally efficientbecause it only requires computational time for one run ofSparse PLS and a classifier Moreover Sparse GeneralizedPLS extends Sparse PLS to the generalized linear model

framework These methods were also developed by Chungand Keles for availability in an R package

47 Degrees of Freedom of Partial Least Squares Kramer andSugiyama [29] proposed a method of unbiased estimationof the degrees of freedom for PLS regression The authorsstated that the construction of latent components fromthe independent variable also depended on the dependentvariable However for PLS regression the optimal number ofcomponents needs to be determined first One of the ways ofdetermining the optimal number of components is throughthe degrees of freedom for the complexity of fitted modelsMoreover the degrees of freedom estimate can be used for thecomparison of different regression methods Furthermorethe two implementations for the degrees of freedom utilizethe connection between PLS regression and numerical linearmethods from numerical linear The authors also developedan R package for this unbiased estimation of the degrees offreedom of PLS

48 Surrogate Variable Analysis Partial Least SquaresChakraborty and Datta [30] proposed a surrogate variableanalysismethod based on PLS In differential gene expressionanalysis one of the important issues is to avoid the hiddenconfounders in the dataset The hidden confounders ofgene expression are caused by different environmentalconditions of the samples However this problem cannot besimply overcome by modifying the gene expression data byusing a normalizing technique This method can extract the

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 6: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

6 BioMed Research International

Table 2 Sources of PCA software

Number Software Sources1 FactoMineR httpcranr-projectorgwebpackagesFactoMineRindexhtml2 ExPosition httpcranr-projectorgwebpackagesExPositionindexhtml3 Amap httpcranr-projectorgwebpackagesamapindexhtml4 ADE-4 httpcranr-projectorgwebpackagesade4indexhtml5 MADE4 httpwwwbioconductororgpackages211biochtmlmade4html6 XLMiner httpwwwsolvercomxlminer-data-mining

7 ViSta httpwwwvisualstatsorghttpwwwmdpeduarpsicologiavistavistahtm

8 imDEV httpsourceforgenetprojectsimdev9 Statistics Toolbox httpwwwmathworkscommatlabcentralfileexchange30792-pca-principal-component-analysis10 Weka httpwwwcswaikatoacnzmlwekadownloadinghtml11 NAG Library httpwwwnagcomdownloadscldownloadsasp

Table 3 Related work

Software Author Motivation Advantage

FactoMineR Le et al (2009) [5]

(i) Providing a multivariate data analytictechnique for applications in biological systems(ii) To combine ldquoOmicsrdquo data structured intogroups(iii) To help on their functional interpretations

(i) It provides a geometrical point of view and alot of graphical outputs(ii) It can take into account a structure on thedata(iii) A GUI is available

MADE4 Culhane et al [11] To provide a simple-to-use tool for multivariateanalysis of microarray data

(i) Accepts a wide variety of gene-expressiondata input formats(ii) No additional data processing is required

Statistictoolbox

The MathWorks[15]

High-dimensional and complex microarraydata need automaticcomputer aided tools foranalysis

Elegant matrix support visualization

imDev Grapov andNewman 2012 [14]

Omics experiments generate complexhigh-dimensional data requiring multivariateanalyses

(i) User-friendly graphical interface(ii) Visualizations can be exported directlyfrom the R plotting interface in a variety of fileformats(iii) Dynamic loading of R objects betweenanalyses sessions

focused on how MineICA can be utilized to study an ICA-based decomposition Pseudo code for this case study is asfollows

(1) Loading the library and the data(2) Creation of an IcaSet object

(21) Load an example of expression data(22) Run ICA(23) Create aMineICAParams object function build-

MineICAParams(24) Create an IcaSet instance function buildIcaSet(25) IcaSet basics

(3) Run global analysis(4) Run analysis by calling individual functions

(41) Write description of contributing genes or fea-tures function writeProjByComp

(42) Plot heatmaps of the contributing elementsfunction plot heatmapsOnSel

(43) Gene enrichment analysis function runEnrich(44) Association with sample variables(45) Clustering of the samples according to each

component(46) Comparison of IcaSet objects function run-

CompareIcaSets

Figure 2 explains the correlation based graph denoting rela-tionship between independent components (IC) attained onfour breast cancer samplesrsquo microarray data Every noderepresents an IC and respective colors denote the origin ofdatasetThickness of edge represents the extent of correlationamong the linked ICs Black edges represent reciprocal nodes

38 Summary of ICA Software Tables 4 and 5 show thesummary and sources of ICA software respectively

BioMed Research International 7

22

22

5

4

1

3

3

1

5

4

3

1

5

44

1

3

5

Figure 2 Correlation-based graph

4 Software for Partial Least Squares (PLS)

The fundamental hypothesis of PLS is that the experimentalinformation is created by a framework ormethodologywhichis determined by a small number of latent characteristicsThusly PLS goes for discovering uncorrelated linear transfor-mation of the initial indicator characteristics which have highcovariance with the reaction characteristics In light of theselatent components PLS predicts reaction characteristics 119910the assignment of regression and reproduce initial matrix119883 the undertaking of data modelling in the meantime Thepurpose of building components in PS is to optimize thecovariance among the variable 119910 and the initial predictorvariables 119883

119908119902

= argmax (Cov (119883119908

119910))

119908119879

119908 = 1

(6)

Restricted to constraint 119908119879

119894119878119909119908119895

= 0 for all 1 le 119894 lt 119895The crucial assignment of PLS is to attain the vectors ofmaximum weights 119908

119894(119894 = 1 119902) to build a small number

of components while PCA is an ldquounsupervisedrdquo methodthat utilizes the 119883 data only To develop the components[1199051 1199052 119905

119902] PLS decomposes 119883 and 119910 to yield a bilinear

denotation of the data [64]

119883 = 1199051119908119879

1+ 1199052119908119879

2+ sdot sdot sdot + 119905

119902119908119879

119902+ 119890

119910 = 1199051V1198791

+ 1199052V1198792

+ sdot sdot sdot + 119905119870V119879119902

+ 119891

(7)

where 119908rsquos are vectors of weights for building the PLS compo-nents 119905 = 119883

119908 Vrsquos are scalars and 119890 and 119891 are the residuals

The concept of PLS is to assume 119908 and V by regression

41 Partial Least Squares Discriminant Analysis Barker andRayens [24] developed a PLS for discriminant analysis

However the original PLS was not designed for discriminantpurposes PLS Discriminant Analysis is used to find alinear regression model by projecting the dependent fea-tures and the independent features to a new space Thenthe fundamental relations can be extracted from the latentvariables This method was developed for software calledUnscrambler which was first developed by Martens andNaes [65] Unscrambler is a commercial software product formultivariate data analysis Unscrambler is used for analysinglarge and complex datasets quickly and easily using thepower of multivariate analysis Moreover this multivariatedata analysis also offers exceptional data visualization

42 Least Squares Partial Least Squares Joslashrgensen et al [25]proposed a method of using an iterative combination of PLSand ordinary least squares to extract the relationship betweenthe predictor variable and the responsesThismethod is basedon a combination of least squares estimates for the designvariables and PLS regression on the spectra The PLS scoreswere incorporated into the ordinary least squares equationon the spectra The idea is to separate the information fromthe spectral and design matrices in a nice way Howeverthis method is able to extract the information even whenfewer components are used In addition this method isinsensitive to the relative scaling of the spectra and theprocess Moreover this combination method is also lessbiased than the individual PLS technique

43 Powered Partial Least Squares Discriminant AnalysisLiland and Indahl [26] extended the Powered PLS to Pow-ered PLS Discriminant Analysis to overcome the extractionof information for the multivariate classification problemThis method can construct more efficient group separationand generate more interpretive outcomes than the ordinaryPartial Least Square Discriminant Analysis technique Thefeatures extracted by the Powered PLS can contribute torevealing the relevance of particular predictors and oftenrequires smaller and simpler components than ordinary PLSMoreover the optimization task is equivalent to maximizingthe correlation between the transformed predictors and thegroups This makes it possible to discard the influence of lessimportant predictorsThis method was also developed by theauthors for availability in an R package

44 Penalized Partial Least Squares Kramer et al [27] pro-posed a combination of the feature extraction technique PLSwith a penalization framework This method is an extensionof PLS regression using a penalization technique OrdinaryPLS is suited for regression problems by minimizing aquadratic loss function iteratively In addition the repre-sentation in terms of kernel matrices provides an intuitivegeometric interpretation of the penalty term The penaltyterms control the roughness of the estimated functions Withthe incorporation of penalization into this framework theresearch direction became more promising This method isused to extract relevant information for high-dimensionalregression problems and also for noisy dataThismethod wasalso developed by the Kramer and her colleagues colleagues[66] for availability in an R package

8 BioMed Research International

Table 4 Summary of ICA software

Number Software Authoryear Language Features

1 FastICA Marchini et al[18] R and MATLAB ICA algorithm is provided for implementing the analysis using

ICA

2 JADE Nordhausen etal [19] R (i) JADE algorithm is provided for ICA

(ii) Other BSS methods such as AMUSE and SOBI are offered

3 HiPerSAT Keith et al [20] C++ MATLABand EEGLAB

(i) Integration of FastICA Informax and SOBI algorithms(ii) Data whitening is provided

4 MineICA Biton et al [21] R (i) Storage and visualization of ICA results(ii) Annotation of features

5 Pearson ICA Karnanen [22] R Extraction of the independent components using the minimizationof mutual information from the Pearson system

6 MaximumLikelihood ICA

Teschenforff[23] R Implementation of the Maximum Likelihood and fixed-point

algorithm into ICA

Table 5 Sources of ICA software

Number Software Sources

1 FastICA R httpcranr-projectorgwebpackagesfastICAindexhtmlMATLAB httpresearchicsaaltofiicafastica

2 JADE httpcranr-projectorgwebpackagesJADEindexhtml3 HiPerSAT httpnicuoregoneduprojectshipersatindexphp4 MineICA httpwwwbioconductororgpackages212biochtmlMineICAhtml5 Pearson ICA httpcranr-projectorgwebpackagesPearsonICAindexhtml6 Maximum Likelihood ICA httpcranr-projectorgwebpackagesmlica2indexhtml

45 SlimPLS Gutkin et al [33] proposed a feature extractionmethod based on PLS called SlimPLS Ranking-based filtersusually utilize a univariate method when selecting featuresThe filter methods can produce reasonable feature setsespecially when the original feature sets are uncorrelatedHowever the chosen feature set will be suboptimal whenthe features of the original set are dependent Some ofthe features will add little discriminative power on top ofpreviously selected features SlimPLS is a multivariate featureextraction method which incorporates feature dependenciesinto calculation This multivariate property is constructedby combining the highly predictive feature with some lesspredictive but correlated features This is because the addedfeatures will provide more information on the behaviour ofthe samples

46 Sparse Partial Least Squares Discriminant Analysis andSparse Generalized Partial Least Squares Chung and Keles[28] proposed two extension feature extraction approachesbased on Sparse PLS These approaches are Sparse PLSDiscriminant Analysis and Sparse Generalized PLS for high-dimensional datasets These two approaches improved ordi-nary PLS by employing feature extraction and dimensionreduction simultaneously These two approaches performwell even with unbalanced sample sizes of the classes SparsePLS Discrimination Analysis is computationally efficientbecause it only requires computational time for one run ofSparse PLS and a classifier Moreover Sparse GeneralizedPLS extends Sparse PLS to the generalized linear model

framework These methods were also developed by Chungand Keles for availability in an R package

47 Degrees of Freedom of Partial Least Squares Kramer andSugiyama [29] proposed a method of unbiased estimationof the degrees of freedom for PLS regression The authorsstated that the construction of latent components fromthe independent variable also depended on the dependentvariable However for PLS regression the optimal number ofcomponents needs to be determined first One of the ways ofdetermining the optimal number of components is throughthe degrees of freedom for the complexity of fitted modelsMoreover the degrees of freedom estimate can be used for thecomparison of different regression methods Furthermorethe two implementations for the degrees of freedom utilizethe connection between PLS regression and numerical linearmethods from numerical linear The authors also developedan R package for this unbiased estimation of the degrees offreedom of PLS

48 Surrogate Variable Analysis Partial Least SquaresChakraborty and Datta [30] proposed a surrogate variableanalysismethod based on PLS In differential gene expressionanalysis one of the important issues is to avoid the hiddenconfounders in the dataset The hidden confounders ofgene expression are caused by different environmentalconditions of the samples However this problem cannot besimply overcome by modifying the gene expression data byusing a normalizing technique This method can extract the

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 7: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

BioMed Research International 7

22

22

5

4

1

3

3

1

5

4

3

1

5

44

1

3

5

Figure 2 Correlation-based graph

4 Software for Partial Least Squares (PLS)

The fundamental hypothesis of PLS is that the experimentalinformation is created by a framework ormethodologywhichis determined by a small number of latent characteristicsThusly PLS goes for discovering uncorrelated linear transfor-mation of the initial indicator characteristics which have highcovariance with the reaction characteristics In light of theselatent components PLS predicts reaction characteristics 119910the assignment of regression and reproduce initial matrix119883 the undertaking of data modelling in the meantime Thepurpose of building components in PS is to optimize thecovariance among the variable 119910 and the initial predictorvariables 119883

119908119902

= argmax (Cov (119883119908

119910))

119908119879

119908 = 1

(6)

Restricted to constraint 119908119879

119894119878119909119908119895

= 0 for all 1 le 119894 lt 119895The crucial assignment of PLS is to attain the vectors ofmaximum weights 119908

119894(119894 = 1 119902) to build a small number

of components while PCA is an ldquounsupervisedrdquo methodthat utilizes the 119883 data only To develop the components[1199051 1199052 119905

119902] PLS decomposes 119883 and 119910 to yield a bilinear

denotation of the data [64]

119883 = 1199051119908119879

1+ 1199052119908119879

2+ sdot sdot sdot + 119905

119902119908119879

119902+ 119890

119910 = 1199051V1198791

+ 1199052V1198792

+ sdot sdot sdot + 119905119870V119879119902

+ 119891

(7)

where 119908rsquos are vectors of weights for building the PLS compo-nents 119905 = 119883

119908 Vrsquos are scalars and 119890 and 119891 are the residuals

The concept of PLS is to assume 119908 and V by regression

41 Partial Least Squares Discriminant Analysis Barker andRayens [24] developed a PLS for discriminant analysis

However the original PLS was not designed for discriminantpurposes PLS Discriminant Analysis is used to find alinear regression model by projecting the dependent fea-tures and the independent features to a new space Thenthe fundamental relations can be extracted from the latentvariables This method was developed for software calledUnscrambler which was first developed by Martens andNaes [65] Unscrambler is a commercial software product formultivariate data analysis Unscrambler is used for analysinglarge and complex datasets quickly and easily using thepower of multivariate analysis Moreover this multivariatedata analysis also offers exceptional data visualization

42 Least Squares Partial Least Squares Joslashrgensen et al [25]proposed a method of using an iterative combination of PLSand ordinary least squares to extract the relationship betweenthe predictor variable and the responsesThismethod is basedon a combination of least squares estimates for the designvariables and PLS regression on the spectra The PLS scoreswere incorporated into the ordinary least squares equationon the spectra The idea is to separate the information fromthe spectral and design matrices in a nice way Howeverthis method is able to extract the information even whenfewer components are used In addition this method isinsensitive to the relative scaling of the spectra and theprocess Moreover this combination method is also lessbiased than the individual PLS technique

43 Powered Partial Least Squares Discriminant AnalysisLiland and Indahl [26] extended the Powered PLS to Pow-ered PLS Discriminant Analysis to overcome the extractionof information for the multivariate classification problemThis method can construct more efficient group separationand generate more interpretive outcomes than the ordinaryPartial Least Square Discriminant Analysis technique Thefeatures extracted by the Powered PLS can contribute torevealing the relevance of particular predictors and oftenrequires smaller and simpler components than ordinary PLSMoreover the optimization task is equivalent to maximizingthe correlation between the transformed predictors and thegroups This makes it possible to discard the influence of lessimportant predictorsThis method was also developed by theauthors for availability in an R package

44 Penalized Partial Least Squares Kramer et al [27] pro-posed a combination of the feature extraction technique PLSwith a penalization framework This method is an extensionof PLS regression using a penalization technique OrdinaryPLS is suited for regression problems by minimizing aquadratic loss function iteratively In addition the repre-sentation in terms of kernel matrices provides an intuitivegeometric interpretation of the penalty term The penaltyterms control the roughness of the estimated functions Withthe incorporation of penalization into this framework theresearch direction became more promising This method isused to extract relevant information for high-dimensionalregression problems and also for noisy dataThismethod wasalso developed by the Kramer and her colleagues colleagues[66] for availability in an R package

8 BioMed Research International

Table 4 Summary of ICA software

Number Software Authoryear Language Features

1 FastICA Marchini et al[18] R and MATLAB ICA algorithm is provided for implementing the analysis using

ICA

2 JADE Nordhausen etal [19] R (i) JADE algorithm is provided for ICA

(ii) Other BSS methods such as AMUSE and SOBI are offered

3 HiPerSAT Keith et al [20] C++ MATLABand EEGLAB

(i) Integration of FastICA Informax and SOBI algorithms(ii) Data whitening is provided

4 MineICA Biton et al [21] R (i) Storage and visualization of ICA results(ii) Annotation of features

5 Pearson ICA Karnanen [22] R Extraction of the independent components using the minimizationof mutual information from the Pearson system

6 MaximumLikelihood ICA

Teschenforff[23] R Implementation of the Maximum Likelihood and fixed-point

algorithm into ICA

Table 5 Sources of ICA software

Number Software Sources

1 FastICA R httpcranr-projectorgwebpackagesfastICAindexhtmlMATLAB httpresearchicsaaltofiicafastica

2 JADE httpcranr-projectorgwebpackagesJADEindexhtml3 HiPerSAT httpnicuoregoneduprojectshipersatindexphp4 MineICA httpwwwbioconductororgpackages212biochtmlMineICAhtml5 Pearson ICA httpcranr-projectorgwebpackagesPearsonICAindexhtml6 Maximum Likelihood ICA httpcranr-projectorgwebpackagesmlica2indexhtml

45 SlimPLS Gutkin et al [33] proposed a feature extractionmethod based on PLS called SlimPLS Ranking-based filtersusually utilize a univariate method when selecting featuresThe filter methods can produce reasonable feature setsespecially when the original feature sets are uncorrelatedHowever the chosen feature set will be suboptimal whenthe features of the original set are dependent Some ofthe features will add little discriminative power on top ofpreviously selected features SlimPLS is a multivariate featureextraction method which incorporates feature dependenciesinto calculation This multivariate property is constructedby combining the highly predictive feature with some lesspredictive but correlated features This is because the addedfeatures will provide more information on the behaviour ofthe samples

46 Sparse Partial Least Squares Discriminant Analysis andSparse Generalized Partial Least Squares Chung and Keles[28] proposed two extension feature extraction approachesbased on Sparse PLS These approaches are Sparse PLSDiscriminant Analysis and Sparse Generalized PLS for high-dimensional datasets These two approaches improved ordi-nary PLS by employing feature extraction and dimensionreduction simultaneously These two approaches performwell even with unbalanced sample sizes of the classes SparsePLS Discrimination Analysis is computationally efficientbecause it only requires computational time for one run ofSparse PLS and a classifier Moreover Sparse GeneralizedPLS extends Sparse PLS to the generalized linear model

framework These methods were also developed by Chungand Keles for availability in an R package

47 Degrees of Freedom of Partial Least Squares Kramer andSugiyama [29] proposed a method of unbiased estimationof the degrees of freedom for PLS regression The authorsstated that the construction of latent components fromthe independent variable also depended on the dependentvariable However for PLS regression the optimal number ofcomponents needs to be determined first One of the ways ofdetermining the optimal number of components is throughthe degrees of freedom for the complexity of fitted modelsMoreover the degrees of freedom estimate can be used for thecomparison of different regression methods Furthermorethe two implementations for the degrees of freedom utilizethe connection between PLS regression and numerical linearmethods from numerical linear The authors also developedan R package for this unbiased estimation of the degrees offreedom of PLS

48 Surrogate Variable Analysis Partial Least SquaresChakraborty and Datta [30] proposed a surrogate variableanalysismethod based on PLS In differential gene expressionanalysis one of the important issues is to avoid the hiddenconfounders in the dataset The hidden confounders ofgene expression are caused by different environmentalconditions of the samples However this problem cannot besimply overcome by modifying the gene expression data byusing a normalizing technique This method can extract the

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 8: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

8 BioMed Research International

Table 4 Summary of ICA software

Number Software Authoryear Language Features

1 FastICA Marchini et al[18] R and MATLAB ICA algorithm is provided for implementing the analysis using

ICA

2 JADE Nordhausen etal [19] R (i) JADE algorithm is provided for ICA

(ii) Other BSS methods such as AMUSE and SOBI are offered

3 HiPerSAT Keith et al [20] C++ MATLABand EEGLAB

(i) Integration of FastICA Informax and SOBI algorithms(ii) Data whitening is provided

4 MineICA Biton et al [21] R (i) Storage and visualization of ICA results(ii) Annotation of features

5 Pearson ICA Karnanen [22] R Extraction of the independent components using the minimizationof mutual information from the Pearson system

6 MaximumLikelihood ICA

Teschenforff[23] R Implementation of the Maximum Likelihood and fixed-point

algorithm into ICA

Table 5 Sources of ICA software

Number Software Sources

1 FastICA R httpcranr-projectorgwebpackagesfastICAindexhtmlMATLAB httpresearchicsaaltofiicafastica

2 JADE httpcranr-projectorgwebpackagesJADEindexhtml3 HiPerSAT httpnicuoregoneduprojectshipersatindexphp4 MineICA httpwwwbioconductororgpackages212biochtmlMineICAhtml5 Pearson ICA httpcranr-projectorgwebpackagesPearsonICAindexhtml6 Maximum Likelihood ICA httpcranr-projectorgwebpackagesmlica2indexhtml

45 SlimPLS Gutkin et al [33] proposed a feature extractionmethod based on PLS called SlimPLS Ranking-based filtersusually utilize a univariate method when selecting featuresThe filter methods can produce reasonable feature setsespecially when the original feature sets are uncorrelatedHowever the chosen feature set will be suboptimal whenthe features of the original set are dependent Some ofthe features will add little discriminative power on top ofpreviously selected features SlimPLS is a multivariate featureextraction method which incorporates feature dependenciesinto calculation This multivariate property is constructedby combining the highly predictive feature with some lesspredictive but correlated features This is because the addedfeatures will provide more information on the behaviour ofthe samples

46 Sparse Partial Least Squares Discriminant Analysis andSparse Generalized Partial Least Squares Chung and Keles[28] proposed two extension feature extraction approachesbased on Sparse PLS These approaches are Sparse PLSDiscriminant Analysis and Sparse Generalized PLS for high-dimensional datasets These two approaches improved ordi-nary PLS by employing feature extraction and dimensionreduction simultaneously These two approaches performwell even with unbalanced sample sizes of the classes SparsePLS Discrimination Analysis is computationally efficientbecause it only requires computational time for one run ofSparse PLS and a classifier Moreover Sparse GeneralizedPLS extends Sparse PLS to the generalized linear model

framework These methods were also developed by Chungand Keles for availability in an R package

47 Degrees of Freedom of Partial Least Squares Kramer andSugiyama [29] proposed a method of unbiased estimationof the degrees of freedom for PLS regression The authorsstated that the construction of latent components fromthe independent variable also depended on the dependentvariable However for PLS regression the optimal number ofcomponents needs to be determined first One of the ways ofdetermining the optimal number of components is throughthe degrees of freedom for the complexity of fitted modelsMoreover the degrees of freedom estimate can be used for thecomparison of different regression methods Furthermorethe two implementations for the degrees of freedom utilizethe connection between PLS regression and numerical linearmethods from numerical linear The authors also developedan R package for this unbiased estimation of the degrees offreedom of PLS

48 Surrogate Variable Analysis Partial Least SquaresChakraborty and Datta [30] proposed a surrogate variableanalysismethod based on PLS In differential gene expressionanalysis one of the important issues is to avoid the hiddenconfounders in the dataset The hidden confounders ofgene expression are caused by different environmentalconditions of the samples However this problem cannot besimply overcome by modifying the gene expression data byusing a normalizing technique This method can extract the

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 9: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

BioMed Research International 9

informative features by identifying the hidden effects of theunderlying latent factors using ordinary PLS and applyinganalysis of covariance (ANCOVA) ANCOVA is appliedwith the PLS signatures of these hidden effects as covariatesin order to identify the genes that are truly differentiallyexpressed This method was also developed by the authorsfor availability in an R package

49 Partial Least Squares Path Modelling Sanchez andTrinchera [31] developed an R package for Partial LeastSquares Path Modelling (PLS-PM) PLS-PM was first intro-duced byWold [67] and is also known as Structural EquationModelling (SEM) It can be used as a composite-basedalternative to factor-based SEM PLS-PM can be used whenthe distributions are highly skewed Moreover PLS-PM canalso be used to estimate relationships between latent variableswith several indicators even though the sample size is smallBasically PLS-PM consists of two sets of linear equationsthe inner model and the outer model The inner modelspecifies the relations between latent variables while theouter model specifies the relations between a latent variableand its observed indicator PLS-PM is a multivariate featureextraction analysis technique based on the cause-effect rela-tionships of the unobserved and observed features

410 Partial Least Squares Regression for Generalized LinearModels Bertrand et al [32] developed a software applicationof PLS regression for generalized linear models Generalizedlinear models are important to allow the response featuresto have a distribution other than normal Generalized linearmodels can be viewed as a case of generalized linear modelswith an identity link From the perspective of generalizedlinear models however it is useful to suppose that thedistribution function is the normal distributionwith constantvariance and the link function is the identity which isthe canonical link if the variance is known However thegeneralized linear models preserve all the predictive powerof the features where the predicted means are not assumed tobe normally distributed PLS regression is used to extract thepredictive features from the generalized linear models

411 Case Study In this section wewill discuss the R packageconsists of svpls This function will call fitModel function inorder to appropriate a number of ANCOVA models that areidentified by pmax to the data and opt for the best modelby looking the minimum value of the Akaikersquos informationCriterion (AIC) [68] Subsequently this model is utilized toforecast the real pattern of genesrsquo differential expression Thecommand lines in R are as follows

gt Fitting the optimal ANCOVA model to the datagivesgt fit lt-svpls (10 10 hidden facdat pmax = 5 fdr =005)gt The optimal ANCOVAmodel its AIC value andthe positive genes detectedgt from it are givenLgt fit$optmodel [1]

gt fit$AICopt [1]

gt fit$genesgt The corrected gene expression matrix obtainedafter removing the effects of the hidden variability isgiven bygt Ycorrected lt- fit$Ycorrgt pvaladj lt-fit$pvaluesadj

For instance we study the efficacy of svapls on ALLAMLpreprocessed dataset [69] This data consists of expressionlevels of 7129 genes that have been log-transformed overtwo samples of patients These two sets of 47 patients and25 patients reported to suffer from Acute lymphoblasticLeukemia (ALL) and Acute Myeloid Leukemia (AML)respectively By using svpls function we yielded initial 1000genes with corrected expression matrix Random samplesrsquodistribution from four sources in the abovementionedmatrixremoves the extra effects owing to reported batch specificclustering in the initial data In this context svapls performedequally efficient relative to another popular R package berfor removing batch effects in microarray data as shown inFigure 3

412 Summary of PLS Software Tables 6 and 7 show thesummary and sources of PLS software respectively Table 8shows the related works on discussed software

5 Software for Local Linear Embedding (LLE)

Straightforward geometric intuitions are the basis for LLEalgorithm Assume that given data comprise of 119873 real-valuedvectors 119883

119894 for each 119863 dimensionality tested by some core

manifold Given that there is adequate data every data pointand their neighbors are expected to be situated on or nearto a locally linear patch of the manifold Abovementionedpatches are described by linear coefficients that rebuild everydata point from respective neighbors Equation (8) is the costfunction used to calculate reconstruction errors which sumsthe squared distances between all the data points and theirreconstructionsTheweights119882

119894119895summarize the contribution

of the 119895th data point to the 119894th reconstruction The optimalweights119882

119894119895are found by solving a least-squares problem [70]

120598 (119882) =

119873

sum

119894=1

10038161003816100381610038161003816100381610038161003816100381610038161003816

119883119894

minus

119870

sum

119895=1

119882119894119895

119883119895

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

= sum

119894=1

120598119894(119882) (8)

120598119894(119882) =

10038161003816100381610038161003816100381610038161003816100381610038161003816

119870

sum

119895=1

119882119894

119895(119909119894

minus 119909119895)

10038161003816100381610038161003816100381610038161003816100381610038161003816

2

=

119896

sum

119895=1

119896

sum

119898=1

119882119894

119895119882119894

119898119876119894

119895119898 (9)

119876119894

119895119898= (119909119894

minus 119909119895)119879

(119909119894

minus 119909119898

) =(119863119894119895

+ 119863119894119898

minus 119863119895119898

)

2 (10)

51 lle An R package ldquollerdquo has been developed in order toimplement LLE for feature extraction This package providesthe algorithm of LLE in order to transform high-dimensionaldata into low-dimensional data The newest version of this

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 10: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

10 BioMed Research International

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(a)

DFC

ID

FCI

DFC

I

DFC

I

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

B

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

BCA

LGB

CALG

B

CCG

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

de

St-Ju

de

St-Ju

de

Sample source

(b)

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

DFC

ID

FCI

DFC

ID

FCI

DFC

I

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

BCA

LGB

CALG

B

CALG

B

CALG

B

CCG

CCG

CCG

CCG

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

de

St-Ju

deSt

-Jude

St-Ju

deSt

-Jude

Sample source

(c)

Figure 3 (a b and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression dataobtained by use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our Rpackage svapls

Table 6 A summary of PLS software

Number Software Authoryear Language Features

1 PLS DiscriminantAnalysis

Barker and Rayens[24]

CC++Visual Basic PLS for discriminant analysis

2 Least SquaresndashPLS Joslashrgensen et al [25] R Implementation combining PLS and ordinary least squares

3Powered PLSDiscriminantAnalysis

Liland and Indahl[26] R Extraction of information for multivariate classification problems

4 Penalized PLS Kramer et al (2008)[27] R Extension of PLS regression using penalization technique

5 SlimPLS Gutkin et al [22] R Multivariate feature extraction method which incorporates featuredependencies

6

Sparse PLSDiscriminantAnalysis SparseGeneralized PLS

Chung and Keles[28] R Sparse version techniques employing feature extraction and dimension

reduction simultaneously

7 PLS Degrees ofFreedom

Kramer andSugiyama [29] R Using an unbiased estimation of the degrees of freedom for PLS

regression

8 Surrogate VariableAnalysis PLS

Chakraborty andDatta [30] R Extraction of the informative features with hidden confounders which

are unaccounted for

9 PLS PathModelling

Sanchez andTrinchera [31] R A multivariate feature extraction analysis technique based on the

cause-effect relationships of the unobserved and observed features

10PLS Regression forGeneralized LinearModels

Bertrand et al (2013)[32] R PLS regression is used to extract the predictive features from the

generalized linear models

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 11: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

BioMed Research International 11

Table 7 Sources of PLS software

Number Software Sources1 PLS Discriminant Analysis httpwwwcamocomdownloadssample-datahtml2 Least SquaresndashPLS httpcranr-projectorgwebpackageslsplsindexhtml3 Powered PLS Discriminant Analysis httpcranr-projectorgwebpackagesplsindexhtml4 Penalized PLS httpcranr-projectorgwebpackagespplsindexhtml5 SlimPLS httpcranr-projectorgwebpackagesSlimPLSindexhtml6 Sparse PLS Discriminant Analysis Sparse Generalized PLS httpcranr-projectorgwebpackagessplsindexhtml7 Degrees of Freedom of PLS httpcranr-projectorgwebpackagesplsdofindexhtml8 Surrogate Variable Analysis PLS httpcranr-projectorgwebpackagessvaplsindexhtml9 PLS Path Modelling httpcranr-projectorgwebpackagesplspmindexhtml10 PLS Regression for Generalized Linear Models httpcranr-projectorgwebpackagesplsRglmindexhtml

Table 8 Related work

Software Author Motivation Advantage

plsRglm (R package) Bertrand et al(2010) [32]

(i) To deal with incomplete datasets usingcross-validation(ii) To extend PLS regression to generalizedlinear models

(i) Provides formula support(ii) Several new classes and their generics(iii) Custom GLR models and graphics toassess the bootstrap based significance of thepredictors

SVA-PLS Chakrabortyand Datta [30]

(i) To identify the genes that are differentiallyexpressed between the samples from twodifferent tissue types(ii) To identify the hidden effects of theunderlying latent factors in a gene expressionprofiling study

(i) Relatively better at discovering a higherproportion of the truly significant genes(ii) Low error rate(iii) High sensitivity and specificity

SlimPLS Gutkin et al[33]

To obtain a low dimensional approximation ofa matrix that is ldquoas close as possiblerdquo to a givenvector

(i) Focuses solely on feature selection(ii) Can be used as a pre-processing stage withdifferent classifiers

package is maintained by Diedrich and Abel [34] The mainfunctions of this package allow users to perform LLE andalso to plot the results of LLE The implementation of LLE isbased on the idea of Ridder and Duin [71] Besides that someenhancements such as selection of the subset and calculationof the intrinsic dimension are offered This package can beinstalled on Windows Linux and MacOS

52 RDRToolbox RDRToolbox is an R package developedfor nonlinear dimension reduction with LLE and IsomapThe package is maintained by Bartenhagen [35] It offersthe transformation of high-dimensional to low-dimensionaldata by using either LLE or Isomap Besides that a plottingfunction is provided to plot the results In addition theDavis-Bouldin Index is provided for the purposes of validatingclusters It is mainly for Linux MacOS and Windows

53 Scikit-Learn Scikit-learn is software implemented inPython by integrating machine learning algorithms [36] Itis a simple-to-use software that allows users to implementa variety of machine learning algorithms The machinelearning algorithms include classification clustering featureextraction model selection manifold learning and othermethods Isomap LLE and Local Tangent Space Align-ment (LTSA) are provided by this software Please see

httpscikit-learnorgstable for further details This soft-ware can be installed on a variety of platforms such asWindows and Ubuntu

54 Case Study This section demonstrates the dimensionreduction workflow for the publicly available the Golub etal leukemia dataset (see Figure 5) The data is available as Rpackage and can be downloaded and loaded via

gt source (ldquohttpbioconductororgbiocLiteRrdquo)gt biocLite (ldquogolubEsetsrdquo)gt library (golubEsets)gt data (Golub Merge)

The dataset consists of 72 samples divided into 47 ALL and25 AML patients and 7129 expression values In this examplewe compute a two-dimensional LLE and Isomap embeddingand plot the results At first we extract the features and classlabels

gt golubExprs = t (exprs (Golub Merge))gt labels = pData (Golub Merge)$ALLAMLgt dim (golubExprs)

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 12: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

12 BioMed Research International

2 4 6 8 10

01

02

03

04

05

06

Dimension

Resid

ual v

aria

nce

Figure 4 Plot of dimension versus residual variance

ALL

Isomap

AML

minus150000

minus50000

0

50000

150000

minus2e + 05 minus1e + 05 0e + 00 1e + 05 2e + 05

(a)

ALLAML

LLE

minus10 minus05 00 05 10 15 20 25

minus2

minus1

0

1

2

(b)

Figure 5 Two-dimensional embedding of the Golub et al [69] leukemia dataset (top Isomap bottom LLE)

Table 9 A summary of LLE software

Number Software Authoryear Language Features

1 lle Diedrich and Abel[34] R

(i) LLE algorithm is provided for transforminghigh-dimensional data into low-dimensional data(ii) Selection of subset and calculation of the intrinsicdimension are provided

2 RDRToolbox Bartenhagen [35] R (i) LLE and Isomap for feature extraction(ii) Davis-Bouldin Index for the purpose of validating clusters

3 Scikit-learn Pedregosa et al[36] Python

(i) Classification manifold learning feature extractionclustering and other methods are offered(ii) LLE Isomap and LTSA are provided

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 13: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

BioMed Research International 13

Table 10 Sources of LLE software

Number Software Sources1 lle httpcranr-projectorgwebpackageslleindexhtml2 RDRToolbox httpwwwbioconductororgpackages212biochtmlRDRToolboxhtml3 Scikit-learn httpscikit-learnorgdevinstallhtml

Table 11 Related work

Software Author Motivation Advantage

RDRToolbox Bartenhagen [35]

(i) To reduce high dimensionalitymicroarray data(ii) To preserve most of the significantinformation and generate data with similarcharacteristics like the high-dimensionaloriginal

(i) Combine information from all features(ii) Suited for low-dimensionalrepresentations of the whole data

Scikit-learn Pedregosa et al [36] To calculate activity index parametersthrough clustering

(i) Easy-to-use interface(ii) Can easily be integrated into applicationsoutside the traditional range of statisticaldata analysis

lle Diedrich and Abel [34]Currently available data dimensionreduction methods are either supervisedwhere data need to be labeled orcomputational complex

(i) Fast(ii) Purely unsupervised

The residual variance of Isomap can be used to estimate theintrinsic dimension of the dataset

gt Isomap (data = golubExprs dims = 1 10 plotResid-uals = TRUE 119896 = 5)

Based on Figure 4 regarding the dimensions for which theresidual variances stop to decrease significantly we canexpect a low intrinsic dimension of two or three andtherefore visualization true to the structure of the originaldata Next we compute the LLE and Isomap embedding fortwo target dimensions

gt golubIsomap = Isomap (data = golubExprs dims =2 119896 = 5)

gt golubLLE=LLE(data = golubExprs dim= 2 119896 = 5)

The Davis-Bouldin-Index shows that the ALL and AMLpatients are well separated into two clusters

gt DBIndex(data = golubIsomap$dim2 labels =labels)

gt DBIndex(data = golubLLE labels = labels)

Finally we use plotDR to plot the two-dimensional data

gt plotDR(data = golubIsomap$dim2 labels = labelsaxesLabels = c(ldquordquo ldquordquo) legend = TRUE)

gt title (main = ldquoIsomaprdquo)

gt plotDR (data = golubLLE labels = labels axesLabels= c(ldquordquoldquordquo) legend = TRUE)

gt title (main = ldquoLLErdquo)

Both visualizations using either Isomapor LLE showdistinctclusters of ALL and AML patients although the clusteroverlaps less in the Isomap embedding This is consistentwith the DB-Index which is very low for both methods butslightly higher for LLE A three-dimensional visualizationcan be generated in the same manner and is best analyzedinteractively within R

55 Summary of LLE Software Tables 9 and 10 show thesummary and sources of LLE software respectively Table 11shows the related works in discussed software

6 Conclusion

Nowadays numerous software applications have been devel-oped to help users implement feature extraction of geneexpression data In this paper we present a comprehensivereview of software for feature extraction methodsThe meth-ods are PCA ICA PLS and LLEThese software applicationshave some limitations in terms of statistical aspects as well ascomputational performance In conclusion there is a need forthe development of better software

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Universiti Teknologi Mala-ysia for funding this research by Research University Grants

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 14: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

14 BioMed Research International

(Grant nos J130000262808J80 and J130000250705H50)The authors would also like to thank Research ManagementCentre (RMC) Universiti TeknologiMalaysia for supportingthis research

References

[1] S Van Sanden D Lin and T Burzykowski ldquoPerformanceof gene selection and classification methods in a microarraysetting a simulation studyrdquo Communications in Statistics Sim-ulation and Computation vol 37 no 1-2 pp 409ndash424 2008

[2] Q Liu A H Sung Z Chen et al ldquoGene selection andclassification for cancer microarray data based on machinelearning and similarity measuresrdquo BMC Genomics vol 12supplement 5 article S1 2011

[3] M Gheorghe and V Mitrana ldquoA formal language-basedapproach in biologyrdquo Comparative and Functional Genomicsvol 5 no 1 pp 91ndash94 2004

[4] P G Higgs and T Attwood ldquoBioinformatics and molecularevolutionrdquo Comparative and Functional Genomics vol 6 pp317ndash319 2005

[5] S Le J Josse and F Husson ldquoFactoMineR an R package formultivariate analysisrdquo Journal of Statistical Software vol 25 no1 pp 1ndash18 2008

[6] F Hussen J Josse S Le and J Mazet ldquoPackage lsquoFactoMineRrsquordquo2013 httpcranr-projectorgwebpackagesFactoMineRFac-toMineRpdf

[7] I Hoffmann ldquoPrincipal Component Analysis with Fac-toMineRrdquo 2010 httpwwwstatistiktuwienacatpublicfilzstudentsseminarws1011hoffmann ausarbeitungpdf

[8] D Beaton C R C Fatt and H Abdi Package rsquoExPositionrsquo 2013httpcranr-projectorgwebpackagesExPositionExPositionpdf

[9] A Lucas ldquoPackage lsquoamaprsquordquo 2013 httpcranr-projectorgwebpackagesamapvignettesamappdf

[10] JThioulouse D Chessel S Doledec and J-M Olivier ldquoADE-4a multivariate analysis and graphical display softwarerdquo Journalof Statistics and Computing vol 7 no 1 pp 75ndash83 1997

[11] A C Culhane J Thioulouse G Perriere and D G HigginsldquoMADE4 an R package for multivariate analysis of geneexpression datardquo Bioinformatics vol 21 no 11 pp 2789ndash27902005

[12] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Elsevier 2nd edition 2005

[13] F W Young and C M Bann ldquoViSta a visual statistics systemrdquoin Statistical Computing Environments for Social Research R AStine and J Fox Eds pp 207ndash235 Sage 1992

[14] D Grapov and J W Newman ldquoimDEV a graphical userinterface to R multivariate analysis tools in Microsoft ExcelrdquoBioinformatics vol 28 no 17 Article ID bts439 pp 2288ndash22902012

[15] The MathWorks Statistics Toolbox for Use with MATLAB UserGuide Version 4 2003 httpwwwpiingvitsimlongoCorsoMat-labOriginalManualsstatspdf

[16] M Hall E Frank G Holmes B Pfahringer P Reutemann andI H Witten ldquoThe WEKA data mining software an updaterdquoACM SIGKDD Explorations Newsletter vol 11 no 1 pp 10ndash182009

[17] NAG Toolbox for Matlab g03aa G03-Multivariate Meth-ods httpwwwnagcomnumericMBmanual 22 1pdfG03g03aapdf

[18] J L Marchini C Heaton and B D Ripley ldquoPackage lsquofastICArsquordquohttpcranr-projectorgwebpackagesfastICAfastICApdf

[19] K Nordhausen J-F Cardoso J Miettinen H Oja EOllila and S Taskinen ldquoPackage lsquoJADErsquordquo httpcranr-pro-jectorgwebpackagesJADEJADEpdf

[20] D Keith C Hoge R Frank and A D Malony HiPerSATTechnical Report 2005 httpnicuoregonedudocsreportsHiPerSATTechReportpdf

[21] A Biton A Zinovyev E Barillot and F Radvanyi ldquoMineICAindependent component analysis of transcriptomic datardquo 2013httpwwwbioconductororgpackages213biocvignettesMineICAinstdocMineICApdf

[22] J Karnanen ldquoIndependent component analysis using scorefunctions from the Pearson systemrdquo 2006 httpcranr-projectorgwebpackagesPearsonICAPearsonICApdf

[23] A Teschenforff Independent Component Analysis Using Maxi-mum Likelihood 2012 httpcranr-projectorgwebpackagesmlica2mlica2pdf

[24] M Barker andW Rayens ldquoPartial least squares for discrimina-tionrdquo Journal of Chemometrics vol 17 no 3 pp 166ndash173 2003

[25] K Joslashrgensen V Segtnan K Thyholt and T Naeligs ldquoA com-parison of methods for analysing regression models with bothspectral and designed variablesrdquo Journal of Chemometrics vol18 no 10 pp 451ndash464 2004

[26] K H Liland and U G Indahl ldquoPowered partial least squaresdiscriminant analysisrdquo Journal of Chemometrics vol 23 no 1pp 7ndash18 2009

[27] N Kramer A Boulesteix and G Tutz ldquoPenalized PartialLeast Squares with applications to B-spline transformationsand functional datardquo Chemometrics and Intelligent LaboratorySystems vol 94 no 1 pp 60ndash69 2008

[28] D Chung and S Keles ldquoSparse partial least squares classifi-cation for high dimensional datardquo Statistical Applications inGenetics and Molecular Biology vol 9 no 1 2010

[29] N Kramer andM Sugiyama ldquoThe degrees of freedom of partialleast squares regressionrdquo Journal of the American StatisticalAssociation vol 106 no 494 pp 697ndash705 2011

[30] S Chakraborty and S Datta ldquoSurrogate variable analysis usingpartial least squares (SVA-PLS) in gene expression studiesrdquoBioinformatics vol 28 no 6 pp 799ndash806 2012

[31] G Sanchez and L Trinchera Tools for Partial Least SquaresPath Modeling 2013 httpcranr-projectorgwebpackagesplspmplspmpdf

[32] F Bertrand N Meyer and M M Bertrand ldquoPartialLeast Squares Regression for generalized linear modelsrdquohttpcranr-projectorgwebpackagesplsRglmplsRglmpdf

[33] M Gutkin R Shamir and G Dror ldquoSlimPLS a method for fea-ture selection in gene expression-based disease classificationrdquoPLoS ONE vol 4 no 7 Article ID e6416 2009

[34] H Diedrich and M Abel ldquoPackage lsquollersquordquo httpcranr-pro-jectorgwebpackagesllellepdf

[35] C Bartenhagen ldquoRDRToolbox a package for nonlinear dimen-sion reduction with Isomap and LLErdquo 2013 httpbioconduc-tororgpackages213biocvignettesRDRToolboxinstdocvi-gnettepdf

[36] F Pedregosa G Varoquaux A Gramfort et al ldquoScikit-learnmachine learning in Pythonrdquo Journal of Machine LearningResearch vol 12 pp 2825ndash2830 2011

[37] S Penel Package ldquoade4rdquo 2013 httpcranr-projectorgwebpackagesade4ade4pdf

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002

Page 15: A Review of Feature Extraction Software for Microarray ... › a58e › 818fcd408e... · A Review of Feature Extraction Software for Microarray Gene Expression Data ... When gene

BioMed Research International 15

[38] S Dray and A Dufour ldquoThe ade4 package implementing theduality diagram for ecologistsrdquo Journal of Statistical Softwarevol 22 no 4 pp 1ndash20 2007

[39] K Moorthy M S Mohamad S Deris and Z Ibrahim ldquoMul-tivariate analysis of gene expression data and missing valueimputation based on llsimpute algorithmrdquo International Journalof Innovative Computing Information and Control vol 6 no 5pp 1335ndash1339 2012

[40] A Culhane Package lsquomade4rsquo 2013 httpbioconductororgpackagesreleasebiocmanualsmade4manmade4pdf

[41] C V Subbulakshmi S N Deepa andNMalathi ldquoComparativeanalysis of XLMiner and WEKA for pattern classificationrdquo inProceedings of the IEEE International Conference on AdvancedCommunication Control and Computing Technologies (ICAC-CCT rsquo12) pp 453ndash457 Ramanathapuram Tamil Nadu IndiaAugust 2012

[42] S Jothi and S Anita ldquoData mining classification techniquesapplied for cancer diseasemdasha case study using Xlminerrdquo Inter-national Journal of Engineering ResearchampTechnology vol 1 no8 2012

[43] T Anh and S Magi Principal Component Analysis Final Paperin Financial Pricing National Cheng Kung University 2009

[44] L Tierney Lisp-Stat An Object-Oriented Environment forStatistical Computing amp Dynamic Graphics Addison-WesleyReading Mass USA 1990

[45] F W Young and D J Lubinsky ldquoGuiding data analysiswith visual statistical strategiesrdquo Journal of Computational andGraphical Statistics vol 4 pp 229ndash250 1995

[46] FW Young and J B Smith ldquoTowards a structured data analysisenvironment a cognition-based designrdquo in Computing andGraphics in Statistics A Buja and P A Tukey Eds vol 36 pp253ndash279 Springer New York NY USA 1991

[47] FW Young R A Faldowski andMMMcFarlane ldquoMultivari-ate statistical visualizationrdquo inHandbook of Statistics C R RaoEd pp 958ndash998 1993

[48] M McFarlane and F W Young ldquoGraphical sensitivity analysisfor multidimensional scalingrdquo Journal of Computational andGraphical Statistics vol 3 no 1 pp 23ndash34 1994

[49] P M Valero-Mora and R D Ledesma ldquoUsing interactivegraphics to teach multivariate data analysis to psychologystudentsrdquo Journal of Statistics Education vol 19 no 1 2011

[50] E Frank M Hall G Holmes et al ldquoWekamdashamachine learningworkbench for data miningrdquo in Data Mining and KnowledgeDiscoveryHandbook OMaimon and L Rokach Eds pp 1269ndash1277 2010

[51] S S Prabhume and S R Sathe ldquoReconstruction of a completedataset from an incomplete dataset by PCA (principal compo-nent analysis) technique some resultsrdquo International Journal ofComputer Science and Network Security vol 10 no 12 pp 195ndash199 2010

[52] J Khan J S Wei M Ringner et al ldquoClassification anddiagnostic prediction of cancers using gene expression profilingand artificial neural networksrdquoNatureMedicine vol 7 no 6 pp673ndash679 2001

[53] P Comon ldquoIndependent component analysis a new conceptrdquoSignal Processing vol 36 no 3 pp 287ndash314 1994

[54] A Hyvarinen ldquoFast and robust fixed-point algorithms forindependent component analysisrdquo IEEE Transactions on NeuralNetworks vol 10 no 3 pp 626ndash634 1999

[55] V Zarzoso and P Comon ldquoComparative speed analysis ofFastICArdquo in Independent Component Analysis and Signal Sep-aration M E Davies C J James S A Abdallah and M D

Plumbley Eds vol 4666 of Lecture Notes in Computer Sciencepp 293ndash300 Springer Berlin Germany 2007

[56] J F Cardoso and A Souloumiac ldquoBlind beamforming for non-Gaussian signalsrdquo IEE Proceedings Part F Radar and SignalProcessing vol 140 no 6 pp 362ndash370 1993

[57] A Belouchrani K Abed-Meraim J Cardoso and E MoulinesldquoA blind source separation technique using second-order statis-ticsrdquo IEEE Transactions on Signal Processing vol 45 no 2 pp434ndash444 1997

[58] L Tong V C Soon Y F Huang and R Liu ldquoAMUSE anew blind identification algorithmrdquo in Proceedings of the IEEEInternational Symposium onCircuits and Systems pp 1784ndash1787May 1990

[59] S Amari A Cichocki and H H Yang ldquoA new learningalgorithm for blind signal separationrdquo in Proceedings of theAdvances in Neural Information Processing Systems Conferencepp 757ndash763 1996

[60] A Delorme and S Makeig ldquoEEGLAB an open source toolboxfor analysis of single-trial EEG dynamics including indepen-dent component analysisrdquo Journal of NeuroscienceMethods vol134 no 1 pp 9ndash21 2004

[61] A Biton ldquoPackage lsquoMineICArsquordquo 2013 httpwwwbioconductororgpackages213biocmanualsMineICAmanMineICApdf

[62] A Hyvaerinen J Karhunen and E Oja Independent Compo-nent Analysis John Wiley amp Sons New York NY USA 2001

[63] M Schmidt D Bohm C von Torne et al ldquoThe humoralimmune system has a key prognostic impact in node-negativebreast cancerrdquo Cancer Research vol 68 no 13 pp 5405ndash54132008

[64] I S Helland ldquoOn the structure of partial least squares regres-sionrdquo Communications in Statistics Simulation and Computa-tion vol 17 no 2 pp 581ndash607 1988

[65] H Martens and T NaesMultivariate calibration JohnWiley ampSons London UK 1989

[66] N Kramer and A Boulesteix ldquoPackage ldquopplsrdquordquo 2013 httpcranrprojectorgwebpackagespplspplspdf

[67] HWold ldquoSoftmodeling the basic design and some extensionsrdquoin Systems under Indirect Observations Causality StructurePrediction K G Joreskog and H Wold Eds Part 2 pp 1ndash54North-Holland Amsterdam The Netherlands 1982

[68] H Akaike ldquoLikelihood and the Bayes procedurerdquo Trabajos deEstadıstica y de Investigacion Operativa vol 31 no 1 pp 143ndash166 1980

[69] T R Golub D K Slonim P Tamayo et al ldquoMolecularclassification of cancer class discovery and class prediction bygene expressionmonitoringrdquo Science vol 286 no 5439 pp 531ndash537 1999

[70] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion linear embeddingrdquo Science vol 290 no 5500 pp 2323ndash2326 2000

[71] D Ridder and R P W Duin Locally Linear Embedding Uni-versity of Technology Delft The Netherlands 2002