ModalDrop-JEPA: Modality-Dropout Joint Embedding Predictive Architecture for Robust Clinical Multimodal World Models
We present ModalDrop-JEPA, a self-supervised pretraining framework for clinical multimodal learning that applies JEPA's representation-space prediction principle at the modality level. Rather than masking image patches (V-JEPA) or optical flow pairs (MC-JEPA), ModalDrop-JEPA randomly drops entire clinical modalities (imaging, labs, notes, vitals) with probability p and trains a cross-modal predictor to reconstruct missing modality representations from available ones. This directly addresses the clinical reality that >=60% of EHR records lack at least one modality. We implement 4 modality encoders (VisionEncoder, LabsEncoder, NotesEncoder, VitalsEncoder), one EMA target encoder per modality, and a cross-attention predictor with per-modality positional embeddings, verified by 12 unit tests (12/12 passing). At p=0.75 dropout rate, the model produces non-degenerate loss of 1.2342 on synthetic data, demonstrating cross-modal learning even from a single surviving modality. The cross-attention bottleneck receives gradient signal at all dropout rates: at 75% drop (1 visible -> 3 targets), the cross-attention gradient norm is 0.617 vs 0.564 at 25% drop, a 1.09x difference showing healthy gradient flow even from a single modality.


