S. L. Mestizo-Gutiérrez, J. A. Jácome-Delgado, N. Cruz-Ramírez, A. Guerra-Hernández, J. A. Torres-Sosa, V. Y. Rosales-Morales, G. E. Aranda-Abreu. A Study of Gene Expression LEves of Parkinson’s Disease Using Machine Learning. BioMedInformatics, 2025, 5, 60, October 2025. DOI 10.3390/biomedinformatics5040060 | MDPI.
Abstract. Parkinson’s disease (PD) is the second most common neurodegenerative disorder, characterized primarily by motor impairments due to the loss of dopaminergic neurons. Despite extensive research, the precise causes of PD remain unknown, and reliable non-invasive biomarkers are still lacking. This study aimed to explore gene expression profiles in peripheral blood to identify potential biomarkers for PD using machine learning approaches. We analyzed microarray-based gene expression data from 105 individuals (50 PD patients, 33 with other neurodegenerative diseases, and 22 healthy controls) obtained from the GEO database (GSE6613). Preprocessing was performed using the “affy” package in R with Expresso normalization. Feature selection and classification were conducted using a decision tree approach (C4.5/J48 algorithm in WEKA), and model performance was evaluated with 10-fold cross-validation. Additional classifiers such as Support Vector Machine (SVM), the Naive Bayes classifier and Multilayer Perceptron Neural Network (MLP) were used for comparison. ROC curve analysis and Gene Ontology (GO) enrichment analysis were applied to the selected genes. A nine-gene decision tree model (TMEM104, TRIM33, GJB3, SPON2, SNAP25, TRAK2, SHPK, PIEZO1, RPL37) achieved 86.71% accuracy, 88% sensitivity, and 87% specificity. The model significantly outperformed other classifiers (SVM, Naive Bayes, MLP) in terms of overall predictive accuracy. ROC analysis showed moderate discrimination for some genes (e.g., TRAK2, TRIM33, PIEZO1), and GO enrichment revealed associations with synaptic processes, inflammation, mitochondrial transport, and stress response pathways. Our decision tree model based on blood gene expression profiles effectively discriminates between PD, other neurodegenerative conditions, and healthy controls, offering a non-invasive method for potential early diagnosis. Notably, TMEM104, TRIM33, and SNAP25 emerged as promising candidate biomarkers, warranting further investigation in larger and synthetic datasets to validate their clinical relevance.