Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

Identifying Diagnostic Biomarkers of Breast Cancer Based on Gene Expression Data and Ensemble Feature Selection

Author(s): Lingyu Li, Yousif A. Algabri and Zhi-Ping Liu*

Volume 18, Issue 3, 2023

Published on: 01 March, 2023

Page: [232 - 246] Pages: 15

DOI: 10.2174/1574893618666230111153243

Price: $65

Abstract

Background: In recent years, the identification of biomarkers or signatures based on gene expression profiling data has attracted much attention in bioinformatics. The successful discovery of breast cancer (BRCA) biomarkers will be beneficial in reducing the risk of BRCA among patients for early detection.

Methods: This paper proposes an Ensemble Feature Selection method to screen biomarkers (abbreviated as EFSmarker) for BRCA from publically available gene expression data. Firstly, we employ twelve filter feature selection methods, namely median, variance, Chi-square, Relief, Pearson and Spearman correlation, mutual information, minimal-redundancy-maximal-relevance criterion, ridge regression, decision tree and random forest with Gini index and accuracy index, to calculate the importance (weights or coefficients) of all features on the training dataset. Secondly, we apply the logistic regression classifier on the test dataset to calculate the classification AUC value of each feature subset individually selected by twelve methods. Thirdly, we provide an ensemble feature selection method by aggregating feature importance with classification AUC value. In particular, we establish a feature importance score (FIS) to evaluate the importance of each feature underlying all feature selection methods. Finally, the features with higher FIS are taken as identified biomarkers.

Results: With the direction of the FIS index induced by the EFSmarker method, 12 genes (COL10A1, COL11A1, MMP11, LOC728264, FIGF, GJB2, INHBA, CD300LG, IGFBP6, PAMR1, CXCL2 and FXYD1) are regarded as diagnostic biomarkers for BRCA. Especially, COL10A1, ranked first with a FIS value of 0.663, is identified as the most credible biomarker. The findings justified via gene and protein expression validation, functional enrichment analysis, literature checking and independent dataset validation verify the effectiveness and efficiency of these selected biomarkers.

Conclusion: Our proposed biomarker discovery strategy not only utilizes the feature contribution but also considers the prediction accuracy simultaneously, which may also serve as a model for identifying unknown biomarkers for other diseases from high-throughput gene expression data. The source code and data are available at https://github.com/zpliulab/EFSmarker.

Keywords: Biomarker, machine learning, ensemble feature selection, gene expression data, breast cancer, early detection.

Graphical Abstract
[1]
Huang H, Hu J, Maryam A, et al. Defining super-enhancer landscape in triple-negative breast cancer by multiomic profiling. Nat Commun 2021; 12(1): 2242.
[http://dx.doi.org/10.1038/s41467-021-22445-0] [PMID: 33854062]
[2]
Zarotti C, Papassotiropoulos B, Elfgen C, et al. Biomarker dynamics and prognosis in breast cancer after neoadjuvant chemotherapy. Sci Rep 2022; 12(1): 91.
[http://dx.doi.org/10.1038/s41598-021-04032-x] [PMID: 34997055]
[3]
Li L, Liu ZP. Detecting prognostic biomarkers of breast cancer by regularized Cox proportional hazards models. J Transl Med 2021; 19(1): 514.
[http://dx.doi.org/10.1186/s12967-021-03180-y] [PMID: 34930307]
[4]
Rajkumar T, Amritha S, Sridevi V, et al. Identification and validation of plasma biomarkers for diagnosis of breast cancer in South Asian women. Sci Rep 2022; 12(1): 100.
[http://dx.doi.org/10.1038/s41598-021-04176-w] [PMID: 34997107]
[5]
El Bairi K, Haynes HR, Blackley E, et al. The tale of TILs in breast cancer: A report from the international immuno-oncology biomarker working group. NPJ Breast Cancer 2021; 7(1): 150.
[http://dx.doi.org/10.1038/s41523-021-00346-1] [PMID: 34853355]
[6]
Li L, Liu Z. A connected network-regularized logistic regression model for feature selection. Appl Intell 2022; 52: 1-31.
[http://dx.doi.org/10.1007/s10489-021-02377-4]
[7]
Li L, Liu ZP. Biomarker discovery for predicting spontaneous preterm birth from gene expression data by regularized logistic regression. Comput Struct Biotechnol J 2020; 18: 3434-46.
[http://dx.doi.org/10.1016/j.csbj.2020.10.028] [PMID: 33294138]
[8]
Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 1947; 18(1): 50-60.
[http://dx.doi.org/10.1214/aoms/1177730491]
[9]
Dai YH, Wang YF, Shen PC, et al. Radiosensitivity index emerges as a potential biomarker for combined radiotherapy and immunotherapy. NPJ Genom Med 2021; 6(1): 40.
[http://dx.doi.org/10.1038/s41525-021-00200-0] [PMID: 34078917]
[10]
Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci 1900; 50(302): 157-75.
[http://dx.doi.org/10.1080/14786440009463897]
[11]
Kononenko I. Estimating attributes: Analysis and extensions of relief. European conference on machine learning. In European conference on machine learning. Berlin: Springer 1994; pp. 171-82.
[12]
Zuber V, Strimmer K. Gene ranking and biomarker discovery under correlation. Bioinformatics 2009; 25(20): 2700-7.
[http://dx.doi.org/10.1093/bioinformatics/btp460] [PMID: 19648135]
[13]
Wang Y, Liu ZP. Identifying biomarkers for breast cancer by gene regulatory network rewiring. BMC Bioinformatics 2022; 22(12): 308.
[PMID: 35045805]
[14]
De Jay N, Papillon-Cavanagh S, Olsen C, El-Hachem N, Bontempi G, Haibe-Kains B. mRMRe: An R package for parallelized mRMR ensemble feature selection. Bioinformatics 2013; 29(18): 2365-8.
[http://dx.doi.org/10.1093/bioinformatics/btt383] [PMID: 23825369]
[15]
Zhang Z, Liu ZP. Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods. BMC Med Genomics 2021; 14(S1): 112.
[http://dx.doi.org/10.1186/s12920-021-00957-4] [PMID: 34433487]
[16]
Ben Brahim A, Limam M. Ensemble feature selection for high dimensional data: A new method and a comparative study. Adv Data Anal Classif 2018; 12(4): 937-52.
[http://dx.doi.org/10.1007/s11634-017-0285-y]
[17]
Li L, Ching WK, Liu ZP. Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods. Comput Biol Chem 2022; 100: 107747.
[http://dx.doi.org/10.1016/j.compbiolchem.2022.107747] [PMID: 35932551]
[18]
Mera-Gaona M, López DM, Vargas-Canas R, Neumann U. Framework for the ensemble of feature selection methods. Appl Sci 2021; 11(17): 8122.
[http://dx.doi.org/10.3390/app11178122]
[19]
Chiew KL, Tan CL, Wong K, Yong KSC, Tiong WK. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Inf Sci 2019; 484: 153-66.
[http://dx.doi.org/10.1016/j.ins.2019.01.064]
[20]
Wang J, Xu J, Zhao C, Peng Y, Wang H. An ensemble feature selection method for high-dimensional data based on sort aggregation. Syst Sci Control Eng 2019; 7(2): 32-9.
[http://dx.doi.org/10.1080/21642583.2019.1620658]
[21]
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 2010; 26(3): 392-8.
[http://dx.doi.org/10.1093/bioinformatics/btp630] [PMID: 19942583]
[22]
Zhao S, Zhang Y, Xu H, Han T. Ensemble classification based on feature selection for environmental sound recognition. Math Probl Eng 2019; 2019(3): 1-7.
[http://dx.doi.org/10.1155/2019/4318463]
[23]
Awada W, Khoshgoftaar TM, Dittman D, Wald R, Napolitano A. A review of the stability of feature selection techniques for bioinformatics data. In: IEEE 13th International Conference on Information Reuse & Integration (IRI) 2013.; 356-63.
[24]
Cheng LH, Hsu TC, Lin C. Integrating ensemble systems biology feature selection and bimodal deep neural network for breast cancer prognosis prediction. Sci Rep 2021; 11(1): 14914.
[http://dx.doi.org/10.1038/s41598-021-92864-y] [PMID: 34290286]
[25]
Dittman DJ, Khoshgoftaar TM, Wald R, Napolitano A. Comparing two new gene selection ensemble approaches with the commonly-used approach. In: 11th International Conference on Machine Learning and Applications. Boca Raton, FL, USA: IEEE 2012; pp. 184-91.
[26]
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014; 15(12): 550.
[http://dx.doi.org/10.1186/s13059-014-0550-8] [PMID: 25516281]
[27]
Rappaport N, Twik M, Plaschkes I, et al. MalaCards: An amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res 2017; 45(D1): D877-87.
[http://dx.doi.org/10.1093/nar/gkw1012] [PMID: 27899610]
[28]
Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res 2021; 49(D1): D545-51.
[http://dx.doi.org/10.1093/nar/gkaa970] [PMID: 33125081]
[29]
Cardoso F, van’t Veer LJ, Bogaerts J, et al. 70-gene signature as an aid to treatment decisions in early-stage breast cancer. N Engl J Med 2016; 375(8): 717-29.
[http://dx.doi.org/10.1056/NEJMoa1602253] [PMID: 27557300]
[30]
Liu Z, Wu C, Miao H, Wu H. RegNetwork: An integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database 2015; 2015: bavo95.
[http://dx.doi.org/10.1093/database/bav095]
[31]
Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 1997; 30(7): 1145-59.
[http://dx.doi.org/10.1016/S0031-3203(96)00142-2]
[32]
Zhang M, Chen H, Wang M, Bai F, Wu K. Bioinformatics analysis of prognostic significance of COL10A1 in breast cancer. Biosci Rep 2020; 40(2): BSR20193286.
[http://dx.doi.org/10.1042/BSR20193286] [PMID: 32043519]
[33]
Jia X, Lei H, Jiang X, et al. Identification of crucial lncRNAs for Luminal A breast cancer through RNA sequencing. Int J Endocrinol 2022; 2022: 6577942.
[http://dx.doi.org/10.1155/2022/6577942]
[34]
Lochter A, Bissell MJ. Involvement of extracellular matrix constituents in breast cancer. Semin Cancer Biol 1995; 6(3): 165-73.
[http://dx.doi.org/10.1006/scbi.1995.0017] [PMID: 7495985]
[35]
Mamoor S. Vascular endothelial growth factor D, VEGF-D, encoded by FIGF is differentially expressed in metastatic breast cancer, both in metastases to the brain and to the lymph nodes. OSF Preprint 2020.
[36]
Karaglani M, Toumpoulis I, Goutas N, et al. Development of novel real-time PCR methodology for quantification of COL11A1 mRNA variants and evaluation in breast cancer tissue specimens. BMC Cancer 2015; 15(1): 694.
[http://dx.doi.org/10.1186/s12885-015-1725-8] [PMID: 26466668]
[37]
Eiro N, Cid S, Fernández B, et al. MMP11 expression in intratumoral inflammatory cells in breast cancer. Histopathology 2019; 75(6): 916-30.
[http://dx.doi.org/10.1111/his.13956] [PMID: 31342542]
[38]
Liu Y, Pandey PR, Sharma S, et al. ID2 and GJB2 promote early-stage breast cancer progression by regulating cancer stemness. Breast Cancer Res Treat 2019; 175(1): 77-90.
[http://dx.doi.org/10.1007/s10549-018-05126-3] [PMID: 30725231]
[39]
Wang XQ, Liu B, Li BY, Wang T, Chen DQ. Effect of CTCs and INHBA level on the effect and prognosis of different treatment methods for patients with early breast cancer. Eur Rev Med Pharmacol Sci 2020; 24(24): 12735-40.
[PMID: 33378021]
[40]
Mamoor S. CD300LG (Nepmucin) is differentially expressed in brain metastatic breast cancer. OSF Preprint 2020.
[41]
Longhitano L, Forte S, Orlando L, et al. The crosstalk between GPR81/IGFBP6 promotes breast cancer progression by modulating lactate metabolism and oxidative stress. Antioxidants 2022; 11(2): 275.
[http://dx.doi.org/10.3390/antiox11020275] [PMID: 35204157]
[42]
Lo PHY, Tanikawa C, Katagiri T, Nakamura Y, Matsuda K. Identification of novel epigenetically inactivated gene PAMR1 in breast carcinoma. Oncol Rep 2015; 33(1): 267-73.
[http://dx.doi.org/10.3892/or.2014.3581] [PMID: 25370079]
[43]
Pan YC, Nishikawa T, Chang CY, Tai JA, Kaneda Y. CXCL2 combined with HVJ-E suppresses tumor growth and lung metastasis in breast cancer and enhances anti-PD-1 antibody therapy. Mol Ther Oncolytics 2021; 20: 175-86.
[http://dx.doi.org/10.1016/j.omto.2020.12.011] [PMID: 33575480]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy