Feature selection from biological bigdata: identification of significant associations applying multivariate machine learning algorithms to genome-wide association studies (GWAS)
Export citation
Abstract
Crohn's Disease (CD) is a type of Inflammatory Bowel Disease (IBD) affecting the gastrointestinal tract with diverse symptoms. At present, Genome-Wide Association Studies (GWAS) have discovered over 140 genetic loci associated with CD. Usual univariate GWAS methods have allowed the discovery of minor effects from common variants. It assumes independence among them, which can lead to missing subtle combinatorial signals. Considering the importance of CD, multivariate approaches can aid to elucidate the etiology of the disease and facilitate the identification of novel associations. However, current univariate-based and multivariate CD models have a broad performance spectrum and have been assessed in different datasets under diverse methodological settings. Other multivariate methods and models (LASSO, XGBoost, Random Forest, BSWiMS, and LDpred) were compared under a strict sub-sampling and cross-validation approach to predict CD risk in a GWAS dataset (de Lange et al. 2017). The predictions were explored and compared to whether the generated models could provide additional information about variants and genes associated with CD. Additionally, the effect of common strategies was assessed by increasing and decreasing the number of SNP markers (using genotype imputation and LD-clumping). The LDpred model without imputation appears to be the best model among all tested models to predict Crohn’s disease risk (AUROC = 0.667 ± 0.024) in this dataset. The best models were validated in a second dataset (NIDDK IBD Genetics), where LDpred was also the best method with similar performance (AUROC = 0.634 ± 0.009). Finally, based on the importance of the variants yielded by the multivariate models, an unnoticed region was identified within chromosome 6, SNP rs4945943, close to gene MARCKS, which appears to contribute to CD risk.