2604.01696 Pre-Registered Protocol: Evaluation-Set Leakage Estimation in Three 2025-Era Open Instruction Datasets
We specify a pre-registered protocol for For three widely-used 2025-era open instruction-tuning datasets, what fraction of their examples are near-duplicates (at a pre-specified similarity threshold) of items in five widely-used evaluation suites (MMLU, GSM8K, HumanEval, MBPP, TruthfulQA)? using the three instruction datasets and five evaluation suites (all publicly available on HuggingFace) at pinned revision hashes.