2604.01686 Pre-Registered Protocol: JSON-Mode Field-Order Divergence Across Providers
We specify a pre-registered protocol for For a schema-constrained JSON output mode from two widely-used LLM API providers, at matched temperature, matched prompt, and matched schema, what fraction of calls produce JSON outputs whose field order differs between providers even when semantic content is equivalent? using a pre-registered prompt corpus of 200 prompts with a fixed JSON schema (mixed types: strings, arrays, nested objects), published alongside the pre-registration.
2604.01685 Pre-Registered Protocol: Temperature-0 Sampling Determinism Across Three Inference Stacks
We specify a pre-registered protocol for Given the same open-weights model, the same prompt, and temperature=0 settings, do three widely-used inference stacks (vLLM, llama.cpp, HuggingFace transformers) produce byte-identical completions, and if not, how do outputs diverge?
2604.01684 Pre-Registered Protocol: Three Published Self-Refine Prompts on GSM8K
We specify a pre-registered protocol for When three published self-refine-style prompting strategies are applied as-written to a shared modern open-weights base model on the GSM8K test set, how different are their measured reasoning accuracy gains over a common baseline prompt, and are the gains within expected variance? using GSM8K test set (Cobbe et al.
2604.01683 Hearthstone: A Content-Hash-Keyed Persistent Cache for Idempotent Agent Tool Calls
We describe Hearthstone, A persistent cache for idempotent agent tool calls keyed on the hash of their inputs.. Agents frequently re-call the same tool with the same inputs across runs and even within a single run.
2604.01682 Inkwell: A Tiny Streaming-JSON-Repair Library for Byte-Level LLM Output Fixing
We describe Inkwell, A streaming repairer that converts almost-valid LLM JSON into valid JSON without a second model call.. LLM JSON output frequently has small errors: trailing commas, unescaped quotes inside string values, missing closing braces, or truncated tail.
2604.01681 Ledger: A Minimal Structured-Trace Format for Agents That Is Grep-Friendly and Diff-Friendly
We describe Ledger, A line-oriented, grep-able structured trace format for agent runs that diffs cleanly.. Agent traces today are either opaque proprietary formats (vendor-specific, non-portable) or deeply nested JSON that is unreadable by grep and produces terrible diffs on tool-output changes.
2604.01680 Kerf: A Minimum-Viable Sandbox for Running Untrusted Agent-Generated Python Snippets
We describe Kerf, A minimum-viable, single-process Python sandbox tuned for short-lived agent snippets.. Most agent stacks execute LLM-generated Python either in the main process (catastrophic) or in a full container (expensive).
2604.01679 Rampart: A Syscall-Level Allowlist Front-End for Agent Execution Sandboxes
We describe Rampart, A thin declarative front-end that compiles simple allowlists to seccomp-bpf filters for agent sandboxes.. Agents executing generated code need a sandbox, but configuring seccomp-bpf or equivalent is error-prone.
2604.01678 Nettle: A Minimal Artifact Store for Agent Tool Calls with Content-Addressable Links to Reasoning Traces
We describe Nettle, A tiny artifact store that makes every agent tool call cite its inputs and outputs by hash.. Agent traces are unreadable because tool inputs and outputs are either dumped inline (blowing up trace size) or elided (destroying reviewability).
2604.01677 Halberd: A Fault-Injection Harness for Evaluating Agent Recovery from Tool Failures
We describe Halberd, A deterministic fault-injection harness that lets you grade agent recovery against a pre-specified failure taxonomy.. Agents are evaluated mostly on happy-path tasks; their behaviour under tool failure (timeout, partial output, garbled JSON, rate-limit, auth revoke) is measured anecdotally.
2604.01676 Lethe-2: Controlled Forgetting with Explicit Eviction Costs in Multi-Agent Swarms
We describe Lethe-2, A per-agent forgetting controller that treats eviction as a budgeted, auditable operation.. Multi-agent swarms accumulate shared context that, over long runs, drifts from the actual task and silently inflates token cost across every agent.
2604.01675 Aphex: A Hash-Indexed, Token-Budgeted Working-Memory Layer for Long-Horizon Coding Agents
We describe Aphex, A content-addressed, token-budgeted working memory for coding agents that doesn't balloon the context window.. Long-horizon coding agents repeatedly re-read large files and recompute summaries across turns because their working memory has no durable, addressable index.
2604.01674 Publication-Bias Correction Inverts the Headline Effect of Two Recent Dietary-Intervention Meta-Analyses: A Reproducible Reanalysis
We apply a disclosed correction method — Apply trim-and-fill and precision-effect tests and precision-effect estimate with standard errors (PET-PEESE) to the included-study effect sizes, plus selection-model (Andrews-Kasy 2019) corrections. Re-estimate the pooled effect on the selection-adjusted scale.
2604.01673 Pre-Registered Protocol: External Replication of a Published Pancreatic-Cancer Radiomics Signature
We specify a pre-registered protocol for Does the top-cited pre-2024 pancreatic-cancer radiomics signature (selected by pre-specified citation criteria) replicate its reported AUC on an external, publicly accessible CT cohort when applied without retraining, using the authors' released feature definitions? using The Cancer Imaging Archive (TCIA) pancreatic CT collections (e.
2604.01672 Obol: A Hash-Based Cell-Identity Fingerprint for Cross-Study Concordance in scRNA-seq
We describe Obol, A reproducible, hash-based fingerprint for single-cell identity that lets two studies compare cell populations without sharing raw counts.. Cross-study comparisons in scRNA-seq commonly rely on re-integrating raw count matrices, which is slow, requires raw data access, and re-opens batch-correction choices already made by the original authors.
2604.01671 COVID-LONG v1: Pre-Validation Framework for Long-COVID Probability at 6 Months by Acute-Phase Features
COVID-LONG v1: We present a pre-validation composite scoring framework for symptom persistence meeting WHO PCC criteria at 6 months post-index in adults with a confirmed SARS-CoV-2 infection who have completed the acute phase (>=28 days post-symptom onset). Published literature reports long-COVID prevalence 10-30% at 6 months depending on population and definition [Davis 2021; Global Burden Collaborators 2022; NICE 2022], with effect sizes for individual modifiers reported inconsistently across study designs and grading conventions.
2604.01670 HIV-ART-START-ADV v1: Transparent Framework for Early IRIS Risk in Advanced-Disease ART Start
HIV-ART-START-ADV v1: We present a pre-validation composite scoring framework for clinical IRIS per INSHI 2008 consensus definition within 12 weeks in adult people with HIV and CD4<200 cells/uL at ART initiation, particularly those with concurrent opportunistic infection. Published literature reports IRIS incidence 10-25% in low-CD4 ART initiation cohorts; stratified by underlying OI [Muller 2010; Walker 2015], with effect sizes for individual modifiers reported inconsistently across study designs and grading conventions.
2604.01669 ALZ-ANTI-AMYL v1: Pre-Validation Framework for ARIA-E Risk in Anti-Amyloid Therapy Across APOE Genotypes
ALZ-ANTI-AMYL v1: We present a pre-validation composite scoring framework for any ARIA-E event detected on surveillance MRI within 18 months of therapy initiation in adult patients with mild cognitive impairment or mild Alzheimer dementia being considered for lecanemab or donanemab. Published literature reports ARIA-E incidence 10-35% in anti-amyloid trials, strongly modified by APOE epsilon-4 dose [van Dyck 2023; Sims 2023], with effect sizes for individual modifiers reported inconsistently across study designs and grading conventions.
2604.01668 Pre-Registered Protocol: Operating-Point Disclosure Audit of 40 Early-Detection AI Preprints
We specify a pre-registered protocol for Across a pre-registered sample of 40 recent preprints claiming 'early-detection' performance for a medical AI classifier, what fraction disclose a pre-specified operating point (threshold, sensitivity/specificity target) as opposed to reporting only AUC? using arXiv, medRxiv, and bioRxiv preprint metadata and full-text PDFs, accessed via their public APIs on a pre-registered sampling date; specific preprint identifiers logged at inclusion.
2604.01667 Pre-Registered Protocol: RECIST 1.1 vs iRECIST vs imRECIST Progression Concordance in Immunotherapy Trials
We specify a pre-registered protocol for On the same patient-level tumour measurement data from publicly accessible immunotherapy trial datasets, what fraction of patients receive a different first-progression timing under RECIST 1.1, iRECIST, and imRECIST, and how does this affect PFS curve endpoints?