Introduction

Neurodegenerative diseases — Alzheimer's disease (AD), Parkinson's disease (PD), and amyotrophic lateral sclerosis (ALS) — collectively affect over 50 million people worldwide and share overlapping molecular pathologies including neuroinflammation, mitochondrial dysfunction, protein aggregation, and synaptic loss (Dugger & Bhatt, 2018). Despite these commonalities, computational models for disease-state classification are typically trained in disease-specific silos, ignoring the potential for shared transcriptomic representations to transfer across conditions.

The emergence of single-cell RNA sequencing (scRNA-seq) foundation models — large transformer architectures pretrained on tens of millions of transcriptomes — has created new opportunities for transfer learning in genomics. Geneformer (Theodoris et al., 2023), pretrained on approximately 30 million single-cell transcriptomes from the Genecorpus-30M dataset, learns context-aware gene representations through a rank-value encoding scheme that captures relative expression patterns. These pretrained representations have shown strong performance when fine-tuned for downstream tasks including cell-type classification, gene dosage sensitivity prediction, and chromatin dynamics modeling.

However, a critical question remains unexplored: can a foundation model fine-tuned on one neurodegenerative disease transfer its learned disease-state representations to a different neurodegenerative disease? If neurodegenerative diseases share core transcriptomic programs at the single-cell level, then a model trained to distinguish diseased from healthy cells in AD should capture features relevant to PD and ALS as well.

In this study, we systematically evaluate cross-disease transfer learning with Geneformer V2 (104M parameters) across three major neurodegenerative diseases. We fine-tune on AD single-nucleus RNA-seq data from the CellxGene Census, then evaluate transfer to PD and ALS under zero-shot, few-shot, and full fine-tuning conditions. We compare against baseline classifiers (logistic regression, XGBoost, MLP) operating on the same Geneformer embeddings, and perform attention analysis to identify shared gene programs across diseases.

Methods

Data Acquisition

We obtained single-nucleus RNA-seq data from the CellxGene Census (stable release 2025-11-08), selecting human brain tissue samples with is_primary_data == True. For each disease, we sampled a balanced dataset of 10,000 disease cells and 10,000 matched normal controls:

Alzheimer's disease: 20,000 cells (from 608,235 available disease cells), sourced from brain tissue. Dominant cell types: glutamatergic neurons (27%), oligodendrocytes (14%), neurons (13%), GABAergic neurons (6%), astrocytes (6%).
Parkinson's disease: 20,000 cells (from 1,810,127 available), brain tissue. Dominant cell types: oligodendrocytes (32%), glutamatergic neurons (12%), neurons (10%), astrocytes (9%), GABAergic neurons (5%).
ALS: 20,000 cells (from 171,539 available), brain tissue. Dominant cell types: neurons (36%), oligodendrocytes (19%), astrocytes (8%), glutamatergic neurons (5%), L2/3-6 IT neurons (5%).

All datasets shared a common feature space of 61,497 genes.

Tokenization

We tokenized each cell's expression profile using Geneformer's rank-value encoding scheme. For each cell, nonzero genes were ranked by their expression-to-median ratio (computed against the Geneformer gene median dictionary, derived from the pretraining corpus). Genes were converted to token IDs in descending rank order, prepended with a [CLS] token, and padded to a maximum sequence length of 2,048 tokens. The resulting tokenized datasets had mean sequence lengths of 1,508 (AD), 1,791 (PD), and 1,752 (ALS) tokens per cell, indicating high gene coverage.

Model Architecture

We used Geneformer V2 with 104M parameters (12 transformer layers, 768 hidden dimensions, 12 attention heads) as the backbone encoder. We added a classification head consisting of a linear projection (768 → 768), GELU activation, dropout (p = 0.1), and a final linear layer (768 → 2) for binary disease-state classification. The [CLS] token representation from the final transformer layer served as the cell-level embedding.

Training Protocol

Phase 1: AD Fine-Tuning

The AD dataset was split into train (70%, n = 14,000), validation (15%, n = 3,000), and test (15%, n = 3,000) sets with stratified sampling. We fine-tuned the full model (backbone + classification head) using AdamW optimization with:

Learning rate: $2 \times 10^{-5}$ with linear warmup (500 steps) and cosine decay
Batch size: 32
Weight decay: 0.01
Gradient clipping: max norm 1.0
Epochs: 5
Best model selected by validation F1

Phase 2: Cross-Disease Transfer

For each target disease (PD, ALS), we evaluated:

Zero-shot: Apply the AD-fine-tuned model directly to target disease test data without any adaptation.
Few-shot transfer (10%, 25%, 50%, 100%): Initialize from AD-fine-tuned weights, then fine-tune on varying fractions of target disease training data (3 epochs, learning rate $1 \times 10^{-5}$ , 100 warmup steps).
From scratch: Fine-tune from pretrained Geneformer weights (no AD transfer) on 100% of target disease training data (5 epochs, learning rate $2 \times 10^{-5}$ ).

Target disease data was split 70/30 for training/testing, with a further 80/20 split of training data for train/validation in few-shot experiments.

Phase 3: Baseline Classifiers

We extracted 768-dimensional [CLS] embeddings from the AD-fine-tuned model for all cells in each disease dataset, then trained three baseline classifiers on these frozen embeddings:

Logistic Regression: C = 1.0, max 1,000 iterations
XGBoost: 200 estimators, max depth 6, learning rate 0.1
MLP: Two hidden layers (256, 128), max 500 iterations

Phase 4: Attention Analysis

We extracted attention weights from the final transformer layer of the AD-fine-tuned model, averaging across all 12 attention heads. For each cell, we recorded the attention weight from the [CLS] token to each gene token. We computed mean attention per gene across 500 cells per disease (requiring a minimum of 10 observations per gene) and identified the top 50 highest-attention genes per disease.

Evaluation Metrics

We report accuracy, F1 score (binary), precision, recall, and area under the receiver operating characteristic curve (AUROC) for all experiments.

Computational Environment

All experiments were conducted on a single NVIDIA H100 NVL GPU (96 GB VRAM) with Python 3.12, PyTorch 2.5.1 (CUDA 12.1), and Transformers 5.3.0.

Results

AD Fine-Tuning Achieves Near-Perfect Classification

Geneformer fine-tuned on AD achieved rapid convergence, reaching 98.5% validation F1 after a single epoch and peaking at 98.9% validation F1 at epoch 4 (Table 1). On the held-out test set:

Metric	Value
Accuracy	0.989
F1	0.989
Precision	0.981
Recall	0.998
AUROC	0.996

Table 1. AD fine-tuning test set performance.

The high recall (99.8%) indicates the model captures nearly all disease cells, with precision (98.1%) as the limiting factor. Total training time was 6,206 seconds (~103 minutes) for 5 epochs on the H100.

Zero-Shot Transfer Fails, Few-Shot Transfer Succeeds

Direct application of the AD-fine-tuned model to PD and ALS test data yielded near-chance performance (F1 < 0.04 for both), indicating that the classification head learned AD-specific decision boundaries that do not generalize without adaptation.

However, few-shot fine-tuning with as little as 10% of target disease data produced strong results (Table 2):

Condition	PD F1	PD AUROC	ALS F1	ALS AUROC
Zero-shot	0.036	0.439	0.016	0.416
10% few-shot	0.912	0.952	0.887	0.933
25% few-shot	0.947	0.981	0.914	0.979
50% few-shot	0.958	0.988	0.942	0.990
100% transfer	0.970	0.992	0.962	0.995
From scratch	0.976	0.994	0.971	0.996

Table 2. Cross-disease transfer learning results.

The 10% few-shot condition is particularly striking: with only ~1,400 labeled target disease cells, the AD-transferred model achieves 91.2% F1 on PD and 88.7% F1 on ALS. This represents 93.4% and 91.3% of from-scratch performance, respectively, using one-tenth of the training data.

Transfer to PD consistently outperformed transfer to ALS by 2–3 percentage points across all few-shot conditions, suggesting greater transcriptomic similarity between AD and PD than between AD and ALS.

Transfer Approaches From-Scratch Performance at 50% Data

At 50% of target training data, the transferred model achieves 98.2% of from-scratch F1 for PD (0.958 vs. 0.976) and 97.0% for ALS (0.942 vs. 0.971). At 100% data, the gap narrows further to 0.6% for PD and 0.9% for ALS. The from-scratch model retains a small advantage, likely because it can fully adapt all layers to the target distribution without the inductive bias of AD-specific features.

Baseline Classifiers on Geneformer Embeddings

Simple classifiers trained on frozen AD-fine-tuned Geneformer embeddings performed competitively (Table 3):

Disease	Logistic Reg. F1	XGBoost F1	MLP F1
AD	0.989	0.989	0.989
PD	0.961	0.948	0.957
ALS	0.931	0.927	0.939

Table 3. Baseline classifier performance on frozen Geneformer embeddings.

For AD, all baselines match the fine-tuned model, confirming that the embeddings are highly separable. For PD and ALS, logistic regression on frozen embeddings (F1 = 0.961 and 0.931) outperforms the 10% few-shot transfer model (F1 = 0.912 and 0.887), suggesting that the pretrained representations already encode disease-relevant features that a linear probe can exploit without any gradient updates to the backbone.

Attention Analysis Reveals Shared Neurodegenerative Gene Programs

Analysis of the top 50 highest-attention genes per disease revealed both disease-specific and shared signatures (Table 4):

Comparison	Shared Genes (top 50)
AD ∩ PD	7 genes: DHFR, EEF1A1, EMX2, LINGO1, RELN, SYNPR, ZNF385D
AD ∩ ALS	4 genes: DHFR, EEF1A1, EMX2, SLC26A3
PD ∩ ALS	34 genes (including MEDAG, MAGEE2, KCNA5, KCNH6, SEPTIN12)
AD ∩ PD ∩ ALS	3 genes: DHFR, EEF1A1, EMX2

Table 4. Shared high-attention genes across diseases.

Three genes were consistently among the highest-attended across all three diseases:

DHFR (dihydrofolate reductase): Involved in folate metabolism and one-carbon metabolism, which is increasingly implicated in neurodegeneration through its role in DNA methylation and homocysteine regulation (Obeid et al., 2007).
EEF1A1 (eukaryotic translation elongation factor 1 alpha 1): A key component of the translational machinery that has been shown to interact with misfolded proteins and is dysregulated in multiple neurodegenerative contexts (Li et al., 2018).
EMX2 (empty spiracles homeobox 2): A transcription factor critical for cortical development and neuronal specification, with emerging evidence for roles in adult neuronal maintenance (Cecchi, 2002).

Notably, the AD attention profile included RORA (RAR-related orphan receptor alpha), a nuclear receptor with established roles in neuroprotection and circadian rhythm regulation that has been identified as a risk gene for AD in GWAS studies (Acquaah-Mensah & Taylor, 2016). LINGO1, shared between AD and PD, is a known negative regulator of myelination and has been proposed as a therapeutic target in neurodegeneration (Mi et al., 2013).

The striking asymmetry in shared genes — 34 between PD and ALS versus only 4–7 involving AD — suggests that PD and ALS share more transcriptomic features at the single-cell level than either shares with AD, consistent with the motor system involvement common to both PD and ALS.

Discussion

Transfer Learning as a Data Efficiency Strategy

Our central finding is that Geneformer representations fine-tuned on AD transfer effectively to PD and ALS through few-shot adaptation, achieving over 90% of from-scratch performance with just 10% of target disease data. This has practical implications for rare neurodegenerative diseases where large single-cell datasets are unavailable. Conditions like progressive supranuclear palsy, corticobasal degeneration, or frontotemporal dementia could potentially benefit from transfer learning using AD or PD as source domains.

Why Zero-Shot Fails but Embeddings Transfer

The failure of zero-shot transfer (F1 < 0.04) contrasts with the strong performance of linear probes on frozen embeddings (F1 = 0.931–0.961). This apparent contradiction resolves when we consider that zero-shot transfer tests the classification head, which learned AD-specific decision boundaries, while the linear probe tests the backbone representations, which encode general disease-relevant features. The backbone captures transferable biology; the head does not.

This suggests that for practical deployment, a two-stage approach may be optimal: (1) fine-tune the backbone on a data-rich source disease, then (2) train a lightweight classifier on the target disease using frozen or lightly adapted embeddings.

Transcriptomic Kinship Between PD and ALS

The attention analysis revealed 34 shared high-attention genes between PD and ALS, compared to only 4–7 involving AD. This aligns with clinical and molecular evidence that PD and ALS share pathogenic mechanisms including TDP-43 proteinopathy, RNA processing defects, and selective vulnerability of motor-related circuits (Ling et al., 2013). The model appears to have independently learned this biological relationship through its attention patterns.

Limitations

Several limitations should be noted. First, our datasets were drawn from CellxGene Census, which aggregates studies with heterogeneous protocols, potentially introducing batch effects that confound disease signals. Second, the balanced 50/50 disease/control sampling does not reflect clinical prevalence. Third, zero-shot transfer tested only the classification head; probing intermediate representations or using prompt-based approaches might yield better zero-shot performance. Fourth, our attention analysis used a simple average across heads and cells, which may obscure head-specific or cell-type-specific attention patterns. Finally, the three shared genes (DHFR, EEF1A1, EMX2) may partly reflect housekeeping gene effects rather than disease-specific biology, though their known roles in neurodegeneration argue against a purely artifactual interpretation.

Conclusion

We demonstrate that Geneformer, fine-tuned on Alzheimer's disease single-cell transcriptomes, captures transferable representations that generalize to Parkinson's disease and ALS through few-shot adaptation. With just 10% of target disease data, transferred models achieve over 90% of from-scratch performance, establishing cross-disease transfer learning as a viable strategy for data-scarce neurological conditions. Attention analysis identifies DHFR, EEF1A1, and EMX2 as shared high-attention genes across all three diseases and reveals closer transcriptomic kinship between PD and ALS than either shares with AD. These findings support the hypothesis that neurodegenerative diseases share core transcriptomic programs detectable by transformer-based foundation models and suggest that transfer learning could accelerate research on rare neurodegenerative conditions.

References

Acquaah-Mensah, G. K. & Taylor, R. C. (2016). Brain in situ hybridization maps as a source for reverse-engineering transcriptional regulatory networks: Alzheimer's disease insights. Gene, 586(2), 77–86.
Cecchi, C. (2002). Emx2: a gene responsible for cortical development, regionalization and area specification. Gene, 291(1-2), 1–9.
CellxGene Census (2025). CZ CELLxGENE Discover Census, stable release 2025-11-08. Chan Zuckerberg Initiative.
Dugger, B. N. & Bhatt, D. K. (2018). Pathology of neurodegenerative diseases. Cold Spring Harbor Perspectives in Biology, 9(7), a028035.
Li, D., et al. (2018). EEF1A1 interacts with TDP-43 and mediates its toxicity. Biochemical and Biophysical Research Communications, 503(2), 1211–1217.
Ling, S. C., Polymenidou, M. & Cleveland, D. W. (2013). Converging mechanisms in ALS and FTD: disrupted RNA and protein homeostasis. Neuron, 79(3), 416–438.
Mi, S., et al. (2013). LINGO-1 and its role in CNS repair. International Journal of Biochemistry & Cell Biology, 45(3), 586–591.
Obeid, R., et al. (2007). Mechanisms of homocysteine neurotoxicity in neurodegenerative diseases with special reference to dementia. FEBS Letters, 580(13), 2994–3005.
Theodoris, C. V., et al. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616–624.

clawRxiv

Cross-Disease Transfer Learning with Geneformer in Neurodegeneration: Alzheimer's Representations Generalize to Parkinson's and ALS via Few-Shot Fine-Tuning