Introduction
Alzheimer’s disease (AD) is a prevalent neurodegenerative disorder characterized by memory loss and cognitive decline. AD currently ranks as the fifth leading cause of death among those 65 and older, and is expected to affect 13.8 million Americans by 2050. Further, the economic burden of AD care by 2022 was $339.5 billion. Early detection of AD considerably impacts different biological, demographic, psychosocial, and economic factors for individuals and families, making it essential to investigate these social determinants of health.
In this project we analyze a vast National Institute of Health dataset collected by the National Alzheimer's Coordinating Center (NACC). The NACC database comprises data from more than 47,000 participants, encompassing over 174,000 clinical assessments, and includes the Uniform Data Set (UDS) and the Biomarker Data Set (BDS). Measures of cross sectional data from patients' initial visits were analyzed and included health history, demographics, and biomarkers (Aβ1-42 and T-Tau in cerebrospinal fluid), all of which are indicative of AD diagnosis.
Methods
We use a combination of inferential and predictive tools in order to effectively address the two goals of our study. First, we employ bootstrap LASSO regression confidence intervals on the binary response variable “diagnosis”, whose outcome consists of two labels, referring to either a diagnosis of at least a mild cognitive progression of the disease, or the negative assessment of the disease among the subjects. The LASSO model allows us to perform variable selection and handle potential multicollinearity among predictors. We complement this by then utilizing a Random Forest model to further assess the importance of these variables. Finally, to build a predictive model for Alzheimer’s diagnosis, we implement a one-layer Neural Network.
Project Workflow
1. Data Cleaning and Preparation
Imported biomarker and UDS data from the NACCdata package.
Standardized and merged datasets to include first valid visits for each patient.
Addressed missing data and encoded categorical variables:
Recoded unknown and non-collected values as NA.
Converted variables to binary or factors as appropriate.
Final dataset: df with cleaned and structured variables.
2. Linearity Assumption Validation
Checked linearity of log-odds for continuous variables using partial residual plots.
Generated visualizations for all predictors, ensuring linear relationships.
Saved plots in a PDF: partial_residuals_plots.pdf.
3. Logistic Regression with LASSO
Built a LASSO logistic regression model to handle multicollinearity and select important predictors.
Tuned the penalty parameter (lambda) with 10-fold cross-validation using a grid search.
In this project, I worked with a team of undergraduates and professors in order to determine which factors affected the diagnosis of Alzheimer's disease. Though many methods were used, my task was to determine which factors were significant, and build a neural network to predict Alzheimer's diagnosis using these predictors. The link to the whole project, which was published in our school journal on page 150, can be found below.
Exploring Factors Affecting the Diagnosis of Alzheimer’s Disease: A Machine Learning Approach


Identified the optimal model based on ROC-AUC.
Bootstrapped the LASSO model:
Created 1,000 bootstrap samples.
Extracted bootstrapped coefficient estimates for confidence interval construction.
Visualized coefficient distributions and calculated normal-theory confidence intervals.


4. Neural Network Modeling
Used H2O to train deep learning models with a hyperparameter grid search:
Explored various architectures (hidden layers, learning rates, regularization, dropout).
Optimized using a random discrete search strategy.
Evaluated models on a validation set:
Selected the best-performing model based on accuracy and F1-score.
Threshold optimized for best classification accuracy.
Achieved a final accuracy of 83%.
Key Findings
Predictive Biomarkers: Variables such as CSFTTAU, CSFABETA, and demographic factors (age, education, and sex) significantly contributed to Alzheimer's disease predictions.
Model Interpretability: Bootstrapped LASSO coefficients revealed robust confidence intervals, highlighting key predictors.
Performance: Neural network models demonstrated strong classification performance, optimizing F1-score thresholds.