GWASEngine is a complete GWAS analysis pipeline implemented entirely in Python using NumPy, SciPy, and scikit-learn. Six modules: QC, linear regression GWAS, LD clumping, polygenic risk scores (C+T), Bayesian fine-mapping (Wakefield ABF), and LD Score Regression.
We present MetaGenomics, a pure NumPy/SciPy/scikit-learn metagenomics analysis engine implemented entirely in Python without external bioinformatics frameworks (no QIIME2, mothur, HUMAnN3, or R). MetaGenomics bundles six published statistical methods: (1) taxonomic profiling with rarefaction and CLR normalization, (2) alpha diversity (Shannon, Simpson, Chao1, Pielou evenness), (3) beta diversity with PCoA ordination and PERMANOVA significance testing, (4) differential abundance via LEfSe, ALDEx2, and ANCOM-BC, (5) functional profiling with COG/KEGG mapping and ARG detection across 20 resistance gene classes, and (6) SparCC-inspired co-occurrence network inference.
CancerGenomics is a self-contained Python pipeline for tumor genomic analysis using only NumPy, SciPy, and scikit-learn — no GATK, CNVkit, maftools, or R required. The engine provides six analysis modules: (1) Circular Binary Segmentation for copy-number variation detection, (2) TMB/MSI computation from somatic mutation calls, (3) COSMIC SBS96 mutational signature decomposition via NNLS, (4) MHC-I neoantigen prediction using position weight matrices, (5) clonal architecture inference via cancer cell fraction estimation and KMeans clustering, and (6) genomic instability scoring including LOH fraction and HRD score.
We present HiCAnalysis, a complete Hi-C chromatin 3D genome analysis pipeline implemented entirely in NumPy/SciPy — no cooler, no cooltools, no Juicer, no HiCExplorer, no R HiTC. The engine provides five analysis modules: (1) ICE normalization for bias correction, (2) insulation score and directionality index for TAD boundary detection, (3) PCA-based A/B compartment calling with GC-content guided eigenvector orientation, (4) HICCUPS-inspired chromatin loop detection using enrichment and Poisson p-values, and (5) differential TAD analysis with permutation significance testing.
We present ProteinStability, a training-free protein thermodynamic stability prediction pipeline implemented in pure NumPy. Given only a protein sequence, it estimates ΔΔG for all possible single-point mutations using a 19-feature model combining Miyazawa-Jernigan inter-residue potentials, hydrophobicity, secondary structure context, and sequence-derived contact maps.
We conduct the largest study to date on type annotations, analyzing 40,799 instances across 8 datasets spanning multiple domains. Our key finding is that python accounts for 16.