How Well Does the Clinical Pipeline Cover Approved Drug Space? A Reproducible Chemical Diversity Audit of ChEMBL Phase 1–4 Small Molecules
How Well Does the Clinical Pipeline Cover Approved Drug Space?
1. Introduction
A central question in drug discovery portfolio strategy is whether the molecules currently in clinical development explore the same chemical territory as approved drugs, or whether they venture into genuinely new structural space. If the clinical pipeline largely recapitulates approved drug chemistry, it may be optimised for incremental improvement but miss unexplored regions. Conversely, if it diverges too far from proven chemical space, it may face unforeseen ADMET liabilities.
We address this question quantitatively using a reproducible, agent-executable pipeline applied to ChEMBL — the largest publicly curated bioactivity database. We define a coverage index: the fraction of approved drugs that have at least one clinical-stage structural neighbour above a Tanimoto similarity threshold. This gives a single interpretable number summarising how much of the "success space" of approved drugs is being actively explored by the clinical pipeline.
2. Methods
2.1 Data Acquisition
All data were obtained programmatically from ChEMBL via the
chembl-webresource-client Python package. We downloaded:
- Approved drugs: all small molecules with
max_phase = 4(n = 3,280 raw) - Clinical candidates: all small molecules with
max_phase∈ {1, 2, 3} (n = 9,433 raw)
2.2 Curation
Identical curation was applied to both sets to ensure fair comparison:
- Drop records missing SMILES
- Standardise with datamol (
standardize_mol,fix_mol) - Deduplicate on canonical SMILES
- Remove PAINS compounds (RDKit FilterCatalog)
| Set | Raw | After standardisation | After dedup | After PAINS |
|---|---|---|---|---|
| Approved (phase 4) | 3,280 | 3,127 | 3,127 | 2,945 |
| Clinical (phases 1–3) | 9,433 | 9,239 | 9,239 | 8,757 |
2.3 Coverage Index
For each of the 2,945 approved drugs, we computed the maximum Tanimoto similarity
to any of the 8,757 clinical candidates using Morgan fingerprints (radius=2,
2048 bits) and RDKit's BulkTanimotoSimilarity. An approved drug is considered
"covered" if this maximum similarity ≥ 0.4 — a standard threshold for chemical
neighbourhood in fingerprint space. We also report the full distribution of
nearest-neighbour similarities.
2.4 Chemical Space Visualisation
Morgan fingerprints from both sets were concatenated and projected to 2D via UMAP (n_neighbors=15, min_dist=0.1, random_state=42) to visualise overlap and separation.
2.5 Property and Scaffold Analysis
Five physicochemical properties (MW, LogP, TPSA, QED, Fsp3) were computed with RDKit and compared using two-sided Kolmogorov–Smirnov tests. Bemis-Murcko scaffold diversity was computed for both sets independently.
3. Results
3.1 Coverage Index
81.1% of approved drug chemical space is covered by at least one clinical candidate at Tanimoto ≥ 0.4 (2,388 of 2,945 approved drugs). The mean nearest-neighbour Tanimoto similarity is 0.580 (median: 0.585), indicating that on average, an approved drug has a moderately close structural analogue somewhere in the clinical pipeline.
The remaining 18.9% of approved space is uncovered — 557 approved drugs with no clinical-stage structural neighbour above the threshold. These represent regions of proven therapeutic chemical space that the current pipeline is not actively exploring, whether by structural novelty, de-prioritisation, or historical accident.
| Metric | Value |
|---|---|
| Coverage index (Tanimoto ≥ 0.4) | 81.09% |
| Covered approved drugs | 2,388 / 2,945 |
| Uncovered approved drugs | 557 / 2,945 |
| Mean max Tanimoto | 0.5803 |
| Median max Tanimoto | 0.5846 |
3.2 Chemical Space Structure
UMAP visualisation reveals substantial overlap between approved drugs and clinical candidates, consistent with the high coverage index. However, the clinical set fills significantly more of the periphery — regions of chemical space distant from the dense approved-drug core — consistent with its higher scaffold diversity. The phase-coloured projection shows a clear gradient: phase 1 compounds occupy the most peripheral positions, while phase 3 compounds cluster closer to approved drug space, consistent with the well-known attrition bias toward drug-like chemistry as candidates advance.
3.3 Physicochemical Property Shifts
All five tested properties differed significantly between sets (KS test, all p < 0.05). The most striking divergences are in LogP and TPSA:
| Property | Mean (Approved) | Mean (Clinical) | KS statistic | p-value |
|---|---|---|---|---|
| MW | 406.2 Da | 396.0 Da | 0.039 | 0.0025 |
| LogP | 1.922 | 2.840 | 0.134 | < 0.001 |
| TPSA | 98.8 Ų | 84.8 Ų | 0.103 | < 0.001 |
| QED | 0.504 | 0.536 | 0.059 | < 0.001 |
| Fsp3 | 0.442 | 0.432 | 0.029 | 0.046 |
Clinical candidates are on average more lipophilic (ΔLogP = +0.92) and less polar (ΔTPSA = −14.0 Ų) than approved drugs. This may reflect an industry-wide trend toward CNS-penetrant and membrane-permeable targets in current pipelines, or alternatively, a "lipophilicity creep" driven by optimising potency at the expense of ADMET properties. The higher mean QED of clinical candidates (0.536 vs. 0.504) is superficially reassuring but must be interpreted cautiously given the LogP and TPSA shifts.
The Fsp3 difference, while statistically significant, is small in absolute terms (0.442 vs. 0.432) and unlikely to be practically meaningful — both sets occupy similar levels of three-dimensionality.
3.4 Scaffold Diversity
The clinical pipeline is substantially more structurally diverse than the approved drug set:
| Metric | Approved | Clinical |
|---|---|---|
| Compounds | 2,945 | 8,757 |
| Unique scaffolds | 1,235 | 5,299 |
| Diversity index | 0.419 | 0.605 |
| Top scaffold share | 8.56% | 6.11% |
| Top-10 scaffolds share | 22.61% | 15.45% |
The approved set's top 10 scaffolds account for 22.6% of all approved drugs — indicating that regulatory success has historically concentrated in a small number of privileged scaffold families. The clinical pipeline's top-10 concentration (15.4%) suggests broader structural exploration, though whether this translates to eventual approval of novel scaffolds remains to be seen.
The apparent paradox — the clinical pipeline is more diverse yet 18.9% of approved space remains uncovered — is explained by the nature of the coverage metric: clinical diversity is concentrated in new regions rather than in filling gaps in approved space.
4. Discussion
The 18.9% Gap
The 557 uncovered approved drugs represent a structurally defined opportunity space. Several explanations may account for their absence from the clinical pipeline: (1) they may occupy established markets with generic competition, reducing commercial incentive for follow-on development; (2) they may represent biological modalities (e.g. natural product-derived structures) that are chemically difficult to optimise; or (3) they may simply be underexplored.
LogP Creep and Its Implications
The significant LogP shift (+0.92) between approved drugs and clinical candidates is consistent with published observations of "property creep" in drug development, where optimising potency during lead optimisation inflates lipophilicity. Historically, this predicts higher clinical attrition: more lipophilic compounds face greater metabolic liability, promiscuous binding, and formulation challenges. The clinical pipeline's lower TPSA (−14 Ų) reinforces this concern, as lower TPSA generally correlates with reduced aqueous solubility.
Limitations
The coverage index is sensitive to the Tanimoto threshold (0.4 here); a stricter threshold of 0.6 would yield a lower coverage estimate. Morgan fingerprints capture 2D topology but not 3D shape or pharmacophoric features, so structurally dissimilar compounds with similar binding modes are treated as uncovered. Future work could repeat this analysis with 3D shape descriptors or pharmacophore fingerprints.
5. Conclusions
81.1% of approved drug chemical space has a clinical-stage structural neighbour (Tanimoto ≥ 0.4), but 18.9% — 557 approved drugs — remains unoccupied by current candidates. The clinical pipeline is more structurally diverse than the approved set (diversity index 0.605 vs. 0.419) but is shifting toward greater lipophilicity and lower polarity relative to approved drugs, a property profile associated with higher attrition risk. These findings are fully reproducible via the accompanying SKILL.md on any ChEMBL phase subset.
6. Reproducibility
The pipeline was validated by a single clean-environment agent execution
(Claude Code, VS Code extension) with zero manual interventions. All steps ran
as written. Runtime: ~20 minutes end-to-end including ChEMBL downloads and
coverage computation. To reproduce on a different scope, edit only params.py.
```
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: approved-vs-clinical-diversity
description: >
Downloads FDA-approved small molecule drugs and clinical-stage candidates from
ChEMBL, curates both sets, and quantifies how much of approved drug chemical
space is covered by clinical candidates. Produces a coverage index, UMAP
comparison, scaffold diversity contrast, and property distribution analysis.
Outputs a self-contained HTML report and machine-readable CSVs.
allowed-tools: Bash(pip *), Bash(python *), Bash(curl *), Bash(echo *), Bash(cat *)
---
# Molecular Diversity Audit: Approved Drugs vs. Clinical Candidates
## Parameters
Edit only this section to change thresholds or output location.
```bash
cat > params.py << 'EOF'
# ChEMBL phase definitions
APPROVED_PHASE = 4 # max_phase == 4 → FDA-approved
CLINICAL_PHASES = [1, 2, 3] # max_phase in 1-3 → clinical-stage
MOL_TYPE = "Small molecule"
# Coverage analysis
COVERAGE_TANIMOTO = 0.40 # a clinical compound "covers" an approved drug
# if Tanimoto(Morgan FP) >= this threshold
RANDOM_SEED = 42
OUTPUT_DIR = "diversity_output"
EOF
```
> **Important:** Run all commands from the **project root** (the directory
> containing `params.py`). Use `python scripts/NN_name.py` — never `cd scripts/`.
> Each script inserts the project root into `sys.path` explicitly.
---
## Step 1 — Environment Setup
**Expected time:** ~3 minutes on a clean environment.
```bash
pip install chembl-webresource-client==0.10.9 \
datamol==0.12.3 \
rdkit==2024.3.1 \
umap-learn==0.5.6 \
pandas==2.1.4 \
numpy==1.26.4 \
matplotlib==3.9.0 \
seaborn==0.13.2 \
scipy==1.13.0 \
jinja2==3.1.4
mkdir -p diversity_output/figures diversity_output/data scripts
```
**Validation:** `python -c "import datamol, chembl_webresource_client, umap, scipy; print('OK')"` must print `OK`.
---
## Step 2 — Download Approved Drugs
Downloads all ChEMBL small molecule drugs with `max_phase = 4`.
```python
# scripts/01_download_approved.py
import sys, os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import pandas as pd
from chembl_webresource_client.new_client import new_client
from params import APPROVED_PHASE, MOL_TYPE, OUTPUT_DIR
mol_api = new_client.molecule
records = mol_api.filter(
max_phase=APPROVED_PHASE,
molecule_type=MOL_TYPE,
).only([
"molecule_chembl_id",
"pref_name",
"max_phase",
"molecule_structures",
"molecule_properties",
])
rows = []
for r in records:
smiles = (r.get("molecule_structures") or {}).get("canonical_smiles")
props = r.get("molecule_properties") or {}
rows.append({
"chembl_id": r["molecule_chembl_id"],
"name": r.get("pref_name"),
"phase": r["max_phase"],
"smiles": smiles,
"mw": props.get("full_mwt"),
"alogp": props.get("alogp"),
})
df = pd.DataFrame(rows)
out = f"{OUTPUT_DIR}/data/approved_raw.csv"
df.to_csv(out, index=False)
print(f"Downloaded {len(df)} approved drugs → {out}")
```
```bash
python scripts/01_download_approved.py
```
**Validation:** `wc -l diversity_output/data/approved_raw.csv` should be > 1000.
---
## Step 3 — Download Clinical Candidates
Downloads all ChEMBL small molecules with `max_phase` in {1, 2, 3}.
```python
# scripts/02_download_clinical.py
import sys, os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import pandas as pd
from chembl_webresource_client.new_client import new_client
from params import CLINICAL_PHASES, MOL_TYPE, OUTPUT_DIR
mol_api = new_client.molecule
all_rows = []
for phase in CLINICAL_PHASES:
print(f"Fetching phase {phase} ...")
records = mol_api.filter(
max_phase=phase,
molecule_type=MOL_TYPE,
).only([
"molecule_chembl_id",
"pref_name",
"max_phase",
"molecule_structures",
"molecule_properties",
])
for r in records:
smiles = (r.get("molecule_structures") or {}).get("canonical_smiles")
props = r.get("molecule_properties") or {}
all_rows.append({
"chembl_id": r["molecule_chembl_id"],
"name": r.get("pref_name"),
"phase": r["max_phase"],
"smiles": smiles,
"mw": props.get("full_mwt"),
"alogp": props.get("alogp"),
})
df = pd.DataFrame(all_rows)
out = f"{OUTPUT_DIR}/data/clinical_raw.csv"
df.to_csv(out, index=False)
print(f"Downloaded {len(df)} clinical candidates → {out}")
print(df["phase"].value_counts().to_string())
```
```bash
python scripts/02_download_clinical.py
```
**Validation:** `wc -l diversity_output/data/clinical_raw.csv` should be > 2000.
---
## Step 4 — Curation
Standardises both sets with datamol, deduplicates, and removes PAINS.
Applies identical curation to both sets so comparisons are fair.
```python
# scripts/03_curate.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import pandas as pd
import datamol as dm
from rdkit.Chem.FilterCatalog import FilterCatalog, FilterCatalogParams
from params import OUTPUT_DIR
def standardize(smi):
try:
mol = dm.to_mol(smi, sanitize=True)
if mol is None:
return None
mol = dm.standardize_mol(mol)
mol = dm.fix_mol(mol)
return dm.to_smiles(mol)
except Exception:
return None
try:
dm.disable_logs()
except AttributeError:
pass
pains_params = FilterCatalogParams()
pains_params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)
catalog = FilterCatalog(pains_params)
def is_pains(smi):
mol = dm.to_mol(smi)
return mol is not None and catalog.HasMatch(mol)
def curate(df, label):
log = {"label": label, "raw": len(df)}
df = df.dropna(subset=["smiles"]).copy()
log["after_missing"] = len(df)
df["std_smiles"] = df["smiles"].apply(standardize)
df = df.dropna(subset=["std_smiles"])
log["after_standardization"] = len(df)
df = df.drop_duplicates(subset="std_smiles", keep="first").reset_index(drop=True)
log["after_deduplication"] = len(df)
df["is_pains"] = df["std_smiles"].apply(is_pains)
log["pains_flagged"] = int(df["is_pains"].sum())
df = df[~df["is_pains"]].copy().reset_index(drop=True)
log["after_pains"] = len(df)
print(json.dumps(log, indent=2))
return df, log
approved_raw = pd.read_csv(f"{OUTPUT_DIR}/data/approved_raw.csv")
clinical_raw = pd.read_csv(f"{OUTPUT_DIR}/data/clinical_raw.csv")
approved, log_a = curate(approved_raw, "approved")
clinical, log_c = curate(clinical_raw, "clinical")
approved["dataset"] = "approved"
clinical["dataset"] = "clinical"
approved.to_csv(f"{OUTPUT_DIR}/data/approved_curated.csv", index=False)
clinical.to_csv(f"{OUTPUT_DIR}/data/clinical_curated.csv", index=False)
with open(f"{OUTPUT_DIR}/data/curation_log.json", "w") as f:
json.dump({"approved": log_a, "clinical": log_c}, f, indent=2)
print(f"\nFinal: {len(approved)} approved, {len(clinical)} clinical candidates")
```
```bash
python scripts/03_curate.py
```
**Validation:** Both `approved_curated.csv` and `clinical_curated.csv` must exist with > 0 rows.
---
## Step 5 — Descriptor Calculation
Computes Morgan fingerprints and RDKit 2D descriptors for both sets.
```python
# scripts/04_descriptors.py
import sys, os, warnings
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import datamol as dm
from rdkit.Chem import Descriptors, AllChem
from params import OUTPUT_DIR
def calc_descriptors(smi):
mol = dm.to_mol(smi)
if mol is None:
return {}
return {
"MW": Descriptors.MolWt(mol),
"LogP": Descriptors.MolLogP(mol),
"TPSA": Descriptors.TPSA(mol),
"HBD": Descriptors.NumHDonors(mol),
"HBA": Descriptors.NumHAcceptors(mol),
"RotBonds": Descriptors.NumRotatableBonds(mol),
"Fsp3": Descriptors.FractionCSP3(mol),
"RingCount":Descriptors.RingCount(mol),
"QED": __import__('rdkit.Chem.QED', fromlist=['qed']).qed(mol),
}
def morgan_fp(smi, radius=2, nbits=2048):
mol = dm.to_mol(smi)
if mol is None:
return np.zeros(nbits, dtype=np.uint8)
return np.array(AllChem.GetMorganFingerprintAsBitVect(mol, radius, nbits),
dtype=np.uint8)
for label in ["approved", "clinical"]:
df = pd.read_csv(f"{OUTPUT_DIR}/data/{label}_curated.csv")
desc = df["std_smiles"].apply(calc_descriptors).apply(pd.Series)
df = pd.concat([df, desc], axis=1)
df.to_csv(f"{OUTPUT_DIR}/data/{label}_enriched.csv", index=False)
fps = np.vstack(df["std_smiles"].apply(morgan_fp).values)
np.save(f"{OUTPUT_DIR}/data/{label}_fps.npy", fps)
print(f"{label}: {len(df)} compounds, fingerprint matrix {fps.shape}")
```
```bash
python scripts/04_descriptors.py
```
**Validation:** Both `approved_fps.npy` and `clinical_fps.npy` must exist.
```bash
python -c "
import numpy as np
a = np.load('diversity_output/data/approved_fps.npy')
c = np.load('diversity_output/data/clinical_fps.npy')
print('approved fps:', a.shape, ' clinical fps:', c.shape)
assert a.shape[1] == 2048 and c.shape[1] == 2048
print('OK')
"
```
---
## Step 6 — Coverage Index Calculation
**The primary scientific output.** For each approved drug, checks whether any
clinical candidate has Tanimoto similarity ≥ `COVERAGE_TANIMOTO`. Reports the
fraction of approved chemical space "covered" by the clinical pipeline.
Uses `DataStructs.BulkTanimotoSimilarity` for efficient row-wise computation —
O(N×M) but vectorised, tractable for the dataset sizes involved (~2K × ~8K).
```python
# scripts/05_coverage.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from rdkit.Chem import DataStructs, AllChem
from rdkit.DataStructs import ExplicitBitVect
import datamol as dm
from params import OUTPUT_DIR, COVERAGE_TANIMOTO, RANDOM_SEED
approved = pd.read_csv(f"{OUTPUT_DIR}/data/approved_enriched.csv")
clinical = pd.read_csv(f"{OUTPUT_DIR}/data/clinical_enriched.csv")
def smiles_to_rdkit_fp(smi):
mol = dm.to_mol(smi)
if mol is None:
return None
return AllChem.GetMorganFingerprintAsBitVect(mol, 2, 2048)
print("Building fingerprint objects ...")
approved_fps = [smiles_to_rdkit_fp(s) for s in approved["std_smiles"]]
clinical_fps = [smiles_to_rdkit_fp(s) for s in clinical["std_smiles"]]
# Drop None
approved_valid = [(i, fp) for i, fp in enumerate(approved_fps) if fp is not None]
clinical_fps_valid = [fp for fp in clinical_fps if fp is not None]
print(f"Computing coverage: {len(approved_valid)} approved vs {len(clinical_fps_valid)} clinical ...")
covered = 0
max_sims = []
for idx, (i, fp_a) in enumerate(approved_valid):
if idx % 200 == 0:
print(f" {idx}/{len(approved_valid)} ...")
sims = DataStructs.BulkTanimotoSimilarity(fp_a, clinical_fps_valid)
best = max(sims)
max_sims.append(best)
if best >= COVERAGE_TANIMOTO:
covered += 1
coverage_index = covered / len(approved_valid)
print(f"\nCoverage index (Tanimoto >= {COVERAGE_TANIMOTO}): "
f"{covered}/{len(approved_valid)} = {coverage_index:.3f} ({coverage_index*100:.1f}%)")
# Distribution of nearest-neighbour similarities
sim_df = pd.DataFrame({
"chembl_id": [approved.iloc[i]["chembl_id"] for i, _ in approved_valid],
"max_tanimoto": max_sims,
"covered": [s >= COVERAGE_TANIMOTO for s in max_sims],
})
sim_df.to_csv(f"{OUTPUT_DIR}/data/coverage_scores.csv", index=False)
result = {
"coverage_index": round(coverage_index, 4),
"coverage_pct": round(coverage_index * 100, 2),
"covered_count": covered,
"approved_total": len(approved_valid),
"clinical_total": len(clinical_fps_valid),
"tanimoto_threshold": COVERAGE_TANIMOTO,
"mean_max_tanimoto": round(float(np.mean(max_sims)), 4),
"median_max_tanimoto": round(float(np.median(max_sims)), 4),
}
with open(f"{OUTPUT_DIR}/data/coverage_result.json", "w") as f:
json.dump(result, f, indent=2)
print(json.dumps(result, indent=2))
```
```bash
python scripts/05_coverage.py
```
**Expected time:** 3–8 minutes depending on dataset sizes.
**Validation:** `diversity_output/data/coverage_result.json` must contain `coverage_index`.
---
## Step 7 — UMAP Visualisation
Projects both sets into a shared 2D chemical space coloured by dataset
(approved vs. clinical) and by phase (1/2/3/4).
```python
# scripts/06_umap.py
import sys, os, warnings
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import matplotlib
matplotlib.use("Agg")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import umap
from params import OUTPUT_DIR, RANDOM_SEED
approved = pd.read_csv(f"{OUTPUT_DIR}/data/approved_enriched.csv")
clinical = pd.read_csv(f"{OUTPUT_DIR}/data/clinical_enriched.csv")
fps_a = np.load(f"{OUTPUT_DIR}/data/approved_fps.npy")
fps_c = np.load(f"{OUTPUT_DIR}/data/clinical_fps.npy")
# Combine for joint embedding
all_fps = np.vstack([fps_a, fps_c])
labels = ["approved"] * len(fps_a) + ["clinical"] * len(fps_c)
phases = list(approved["phase"].astype(int)) + list(clinical["phase"].astype(int))
print(f"Running UMAP on {len(all_fps)} compounds ...")
reducer = umap.UMAP(n_components=2, random_state=RANDOM_SEED,
n_neighbors=15, min_dist=0.1)
coords = reducer.fit_transform(all_fps)
combined = pd.DataFrame({
"UMAP_1": coords[:, 0],
"UMAP_2": coords[:, 1],
"dataset": labels,
"phase": phases,
})
combined.to_csv(f"{OUTPUT_DIR}/data/umap_coords.csv", index=False)
# Figure 1: approved vs clinical
fig, ax = plt.subplots(figsize=(9, 7))
colours = {"approved": "#1565C0", "clinical": "#E53935"}
for ds, colour in colours.items():
mask = combined["dataset"] == ds
alpha = 0.5 if ds == "clinical" else 0.8
size = 6 if ds == "clinical" else 10
ax.scatter(combined.loc[mask, "UMAP_1"], combined.loc[mask, "UMAP_2"],
c=colour, s=size, alpha=alpha, label=ds.capitalize(), rasterized=True)
ax.legend(markerscale=2, framealpha=0.9)
ax.set_title("Chemical Space: Approved Drugs vs. Clinical Candidates")
ax.set_xlabel("UMAP 1"); ax.set_ylabel("UMAP 2")
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/umap_dataset.png", dpi=150)
print("Saved umap_dataset.png")
# Figure 2: coloured by phase
fig, ax = plt.subplots(figsize=(9, 7))
phase_colours = {1: "#FF7043", 2: "#FFA726", 3: "#66BB6A", 4: "#1565C0"}
phase_labels = {1: "Phase 1", 2: "Phase 2", 3: "Phase 3", 4: "Approved"}
for phase, colour in phase_colours.items():
mask = combined["phase"] == phase
ax.scatter(combined.loc[mask, "UMAP_1"], combined.loc[mask, "UMAP_2"],
c=colour, s=6, alpha=0.6, label=phase_labels[phase], rasterized=True)
ax.legend(markerscale=2, framealpha=0.9)
ax.set_title("Chemical Space by Development Phase")
ax.set_xlabel("UMAP 1"); ax.set_ylabel("UMAP 2")
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/umap_phase.png", dpi=150)
print("Saved umap_phase.png")
```
```bash
python scripts/06_umap.py
```
**Validation:** Both `umap_dataset.png` and `umap_phase.png` must exist and be > 20 KB.
---
## Step 8 — Property Distribution Comparison
Plots and statistically tests MW, LogP, TPSA, QED, and Fsp3 distributions
between approved drugs and clinical candidates. Uses Kolmogorov–Smirnov test
to report whether distributions differ significantly.
```python
# scripts/07_properties.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import matplotlib
matplotlib.use("Agg")
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from params import OUTPUT_DIR
approved = pd.read_csv(f"{OUTPUT_DIR}/data/approved_enriched.csv")
clinical = pd.read_csv(f"{OUTPUT_DIR}/data/clinical_enriched.csv")
PROPS = {
"MW": "Molecular Weight (Da)",
"LogP": "LogP",
"TPSA": "TPSA (Ų)",
"QED": "QED Drug-likeness",
"Fsp3": "Fraction sp³ Carbon (Fsp3)",
}
fig, axes = plt.subplots(1, len(PROPS), figsize=(18, 4))
ks_results = {}
for ax, (prop, label) in zip(axes, PROPS.items()):
a_vals = approved[prop].dropna()
c_vals = clinical[prop].dropna()
ax.hist(a_vals, bins=40, alpha=0.6, color="#1565C0",
density=True, label=f"Approved (n={len(a_vals)})")
ax.hist(c_vals, bins=40, alpha=0.6, color="#E53935",
density=True, label=f"Clinical (n={len(c_vals)})")
ks_stat, p_val = stats.ks_2samp(a_vals, c_vals)
ks_results[prop] = {"ks_stat": round(ks_stat, 4), "p_value": round(p_val, 6),
"mean_approved": round(float(a_vals.mean()), 3),
"mean_clinical": round(float(c_vals.mean()), 3)}
sig = "***" if p_val < 0.001 else ("**" if p_val < 0.01 else ("*" if p_val < 0.05 else "ns"))
ax.set_title(f"{label}\nKS={ks_stat:.3f} {sig}", fontsize=9)
ax.legend(fontsize=7)
ax.set_xlabel(label, fontsize=8)
plt.suptitle("Property Distributions: Approved vs. Clinical", y=1.02, fontsize=11)
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/property_distributions.png", dpi=150, bbox_inches="tight")
print("Saved property_distributions.png")
print(json.dumps(ks_results, indent=2))
with open(f"{OUTPUT_DIR}/data/ks_results.json", "w") as f:
json.dump(ks_results, f, indent=2)
```
```bash
python scripts/07_properties.py
```
**Validation:** `diversity_output/data/ks_results.json` must contain entries for MW, LogP, QED.
---
## Step 9 — Scaffold Diversity Comparison
Computes Bemis-Murcko scaffold diversity for both sets and compares.
```python
# scripts/08_scaffolds.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import matplotlib
matplotlib.use("Agg")
import pandas as pd
import matplotlib.pyplot as plt
import datamol as dm
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol
from rdkit.Chem import MolToSmiles
from params import OUTPUT_DIR
def get_scaffold(smi):
mol = dm.to_mol(smi)
if mol is None:
return None
try:
return MolToSmiles(GetScaffoldForMol(mol))
except Exception:
return None
results = {}
for label in ["approved", "clinical"]:
df = pd.read_csv(f"{OUTPUT_DIR}/data/{label}_enriched.csv")
df["scaffold"] = df["std_smiles"].apply(get_scaffold)
df = df.dropna(subset=["scaffold"])
counts = df["scaffold"].value_counts()
n_total = len(df)
n_unique = len(counts)
results[label] = {
"n_compounds": n_total,
"n_unique_scaffolds": n_unique,
"diversity_index": round(n_unique / n_total, 4),
"top1_pct": round(counts.iloc[0] / n_total * 100, 2),
"top10_pct": round(counts.iloc[:10].sum() / n_total * 100, 2),
}
print(f"{label}: {n_unique}/{n_total} unique scaffolds, "
f"diversity={n_unique/n_total:.3f}")
with open(f"{OUTPUT_DIR}/data/scaffold_stats.json", "w") as f:
json.dump(results, f, indent=2)
# Bar comparison
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
metrics = ["diversity_index", "top10_pct"]
titles = ["Scaffold Diversity Index\n(higher = more diverse)",
"% Actives in Top-10 Scaffolds\n(lower = more diverse)"]
for ax, metric, title in zip(axes, metrics, titles):
vals = [results["approved"][metric], results["clinical"][metric]]
colours = ["#1565C0", "#E53935"]
ax.bar(["Approved", "Clinical"], vals, color=colours, alpha=0.8, width=0.5)
for i, v in enumerate(vals):
ax.text(i, v + 0.005, f"{v:.3f}", ha="center", va="bottom", fontsize=10)
ax.set_title(title, fontsize=10)
ax.set_ylim(0, max(vals) * 1.25)
plt.suptitle("Scaffold Diversity: Approved Drugs vs. Clinical Candidates", fontsize=11)
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/scaffold_diversity.png", dpi=150)
print("Saved scaffold_diversity.png")
```
```bash
python scripts/08_scaffolds.py
```
**Validation:** `diversity_output/data/scaffold_stats.json` must contain `approved` and `clinical` keys.
---
## Step 10 — Compile Report
```python
# scripts/09_report.py
import sys, os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import json, base64, pathlib
import pandas as pd
from jinja2 import Template
from params import OUTPUT_DIR, COVERAGE_TANIMOTO
def img_b64(path):
return base64.b64encode(pathlib.Path(path).read_bytes()).decode()
cur = json.load(open(f"{OUTPUT_DIR}/data/curation_log.json"))
cov = json.load(open(f"{OUTPUT_DIR}/data/coverage_result.json"))
scaf = json.load(open(f"{OUTPUT_DIR}/data/scaffold_stats.json"))
ks = json.load(open(f"{OUTPUT_DIR}/data/ks_results.json"))
TEMPLATE = """
<!DOCTYPE html><html><head><meta charset="utf-8">
<title>Approved vs Clinical Diversity Audit</title>
<style>
body{font-family:Georgia,serif;max-width:960px;margin:40px auto;line-height:1.6;color:#222}
h1{color:#1a237e} h2{color:#283593;border-bottom:1px solid #ccc;padding-bottom:4px}
table{border-collapse:collapse;width:100%} th,td{border:1px solid #ddd;padding:8px}
th{background:#e8eaf6}
img{max-width:100%;border:1px solid #eee;border-radius:4px;margin:8px 0}
.stat{font-size:2em;font-weight:bold;color:#1565c0}
.label{font-size:0.85em;color:#555}
.grid{display:grid;grid-template-columns:1fr 1fr 1fr 1fr;gap:14px;margin:20px 0}
.card{background:#f5f5f5;border-radius:6px;padding:14px;text-align:center}
.highlight{background:#e3f2fd;padding:12px;border-left:4px solid #1565c0;margin:12px 0}
</style></head><body>
<h1>Molecular Diversity Audit: Approved Drugs vs. Clinical Candidates</h1>
<div class="grid">
<div class="card">
<div class="stat">{{ cur.approved.after_pains }}</div>
<div class="label">Approved drugs</div>
</div>
<div class="card">
<div class="stat">{{ cur.clinical.after_pains }}</div>
<div class="label">Clinical candidates</div>
</div>
<div class="card">
<div class="stat">{{ cov.coverage_pct }}%</div>
<div class="label">Coverage index<br>(Tanimoto ≥ {{ cov.tanimoto_threshold }})</div>
</div>
<div class="card">
<div class="stat">{{ cov.mean_max_tanimoto }}</div>
<div class="label">Mean nearest-neighbour<br>Tanimoto</div>
</div>
</div>
<div class="highlight">
<strong>Key finding:</strong>
{{ cov.coverage_pct }}% of approved drug chemical space is covered by at least
one clinical candidate (Tanimoto ≥ {{ cov.tanimoto_threshold }}).
The mean nearest-neighbour similarity from an approved drug to the clinical
pipeline is {{ cov.mean_max_tanimoto }} (median: {{ cov.median_max_tanimoto }}).
</div>
<h2>1. Data & Curation</h2>
<table>
<tr><th>Set</th><th>Raw</th><th>After standardisation</th><th>After dedup</th><th>After PAINS</th></tr>
<tr>
<td>Approved (phase 4)</td>
<td>{{ cur.approved.raw }}</td>
<td>{{ cur.approved.after_standardization }}</td>
<td>{{ cur.approved.after_deduplication }}</td>
<td><strong>{{ cur.approved.after_pains }}</strong></td>
</tr>
<tr>
<td>Clinical (phases 1–3)</td>
<td>{{ cur.clinical.raw }}</td>
<td>{{ cur.clinical.after_standardization }}</td>
<td>{{ cur.clinical.after_deduplication }}</td>
<td><strong>{{ cur.clinical.after_pains }}</strong></td>
</tr>
</table>
<h2>2. Chemical Space (UMAP)</h2>
<img src="data:image/png;base64,{{ umap_ds_b64 }}" alt="UMAP by dataset">
<img src="data:image/png;base64,{{ umap_ph_b64 }}" alt="UMAP by phase">
<h2>3. Coverage Analysis</h2>
<p>
For each approved drug, the maximum Tanimoto similarity to any clinical candidate
(Morgan FP, radius=2, 2048 bits) was computed. A compound is considered "covered"
if this maximum similarity ≥ {{ cov.tanimoto_threshold }}.
</p>
<table>
<tr><th>Metric</th><th>Value</th></tr>
<tr><td>Coverage index</td><td>{{ cov.coverage_pct }}%</td></tr>
<tr><td>Covered / total approved</td><td>{{ cov.covered_count }} / {{ cov.approved_total }}</td></tr>
<tr><td>Mean max Tanimoto</td><td>{{ cov.mean_max_tanimoto }}</td></tr>
<tr><td>Median max Tanimoto</td><td>{{ cov.median_max_tanimoto }}</td></tr>
<tr><td>Tanimoto threshold</td><td>{{ cov.tanimoto_threshold }}</td></tr>
</table>
<h2>4. Property Distributions</h2>
<img src="data:image/png;base64,{{ props_b64 }}" alt="Property distributions">
<table>
<tr><th>Property</th><th>Mean (Approved)</th><th>Mean (Clinical)</th><th>KS statistic</th><th>p-value</th></tr>
{% for prop, r in ks.items() %}
<tr>
<td>{{ prop }}</td>
<td>{{ r.mean_approved }}</td>
<td>{{ r.mean_clinical }}</td>
<td>{{ r.ks_stat }}</td>
<td>{{ r.p_value }}</td>
</tr>
{% endfor %}
</table>
<h2>5. Scaffold Diversity</h2>
<img src="data:image/png;base64,{{ scaf_b64 }}" alt="Scaffold diversity">
<table>
<tr><th>Metric</th><th>Approved</th><th>Clinical</th></tr>
<tr><td>Unique scaffolds</td>
<td>{{ scaf.approved.n_unique_scaffolds }}</td>
<td>{{ scaf.clinical.n_unique_scaffolds }}</td></tr>
<tr><td>Diversity index</td>
<td>{{ scaf.approved.diversity_index }}</td>
<td>{{ scaf.clinical.diversity_index }}</td></tr>
<tr><td>Top scaffold share (%)</td>
<td>{{ scaf.approved.top1_pct }}%</td>
<td>{{ scaf.clinical.top1_pct }}%</td></tr>
<tr><td>Top-10 scaffolds share (%)</td>
<td>{{ scaf.approved.top10_pct }}%</td>
<td>{{ scaf.clinical.top10_pct }}%</td></tr>
</table>
<h2>6. Conclusions</h2>
<p>
{{ cov.coverage_pct }}% of the approved drug chemical space has at least one
clinical-stage neighbour within Tanimoto ≥ {{ cov.tanimoto_threshold }},
leaving {{ "%.1f" | format(100 - cov.coverage_pct) }}% of approved chemical
space unoccupied by current clinical candidates — a potential opportunity gap.
The clinical pipeline is
{% if scaf.clinical.diversity_index > scaf.approved.diversity_index %}
more
{% else %}
less
{% endif %}
structurally diverse than the approved set
(diversity index {{ scaf.clinical.diversity_index }} vs.
{{ scaf.approved.diversity_index }}).
KS tests indicate that clinical candidates differ significantly from approved
drugs in {{ ks.values() | selectattr('p_value', 'lt', 0.05) | list | length }}
of {{ ks | length }} physicochemical properties tested.
</p>
</body></html>
"""
html = Template(TEMPLATE).render(
cur = cur,
cov = cov,
scaf = scaf,
ks = ks,
umap_ds_b64 = img_b64(f"{OUTPUT_DIR}/figures/umap_dataset.png"),
umap_ph_b64 = img_b64(f"{OUTPUT_DIR}/figures/umap_phase.png"),
props_b64 = img_b64(f"{OUTPUT_DIR}/figures/property_distributions.png"),
scaf_b64 = img_b64(f"{OUTPUT_DIR}/figures/scaffold_diversity.png"),
)
out = f"{OUTPUT_DIR}/report.html"
pathlib.Path(out).write_text(html, encoding="utf-8")
print(f"Report saved to {out}")
```
```bash
python scripts/09_report.py
```
**Expected final output:** `diversity_output/report.html` — open in any browser.
---
## Expected Outputs
```
diversity_output/
├── data/
│ ├── approved_raw.csv
│ ├── clinical_raw.csv
│ ├── approved_curated.csv
│ ├── clinical_curated.csv
│ ├── approved_enriched.csv
│ ├── clinical_enriched.csv
│ ├── approved_fps.npy
│ ├── clinical_fps.npy
│ ├── coverage_scores.csv
│ ├── curation_log.json
│ ├── coverage_result.json
│ ├── scaffold_stats.json
│ ├── ks_results.json
│ └── umap_coords.csv
├── figures/
│ ├── umap_dataset.png
│ ├── umap_phase.png
│ ├── property_distributions.png
│ └── scaffold_diversity.png
└── report.html ← primary deliverable
```
---
## Reproducing with Different Parameters
Edit `params.py`:
- Raise `COVERAGE_TANIMOTO` (e.g. to 0.6) for a stricter coverage definition
- Change `CLINICAL_PHASES` to `[3]` to analyse only late-stage candidates
- All outputs regenerate automaticallyDiscussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.


