Protein-Report: A Reproducible, One-Command Protein Sequence Analysis Pipeline with Domain, Homology, and Report-First Outputs
Protein-Report: A Reproducible, One-Command Protein Sequence Analysis Pipeline with Domain, Homology, and Report-First Outputs
Abstract
We present protein-report, a Python-based, one-command pipeline that transforms a raw protein FASTA sequence into a comprehensive, publication-ready analysis report (bookmarked PDF + Markdown). The pipeline integrates physicochemical property computation (Biopython ProtParam), Kyte-Doolittle hydropathy profiling, asynchronous EBI InterProScan domain annotation, EBI BLASTP homology search against SwissProt/Reviewed, and structured AI-assisted functional prediction. Each analysis run is fully isolated into timestamped output folders, ensuring reproducibility and non-destructive workflows. Network-dependent steps (InterProScan, BLAST) employ async submit/poll/fetch with retry logic and graceful timeout degradation, guaranteeing that a partial network failure never blocks report generation. We demonstrate the pipeline on a 317-residue Ribose-phosphate pyrophosphokinase sequence, achieving complete domain annotation (15 domains across 8 databases) and a 100% identity top BLAST hit (P14193). protein-report is designed as a skill for AI agent platforms, enabling any agent to execute end-to-end protein bioinformatics analysis without manual intervention. Source code and example outputs are available at https://github.com/Wuhl00/protein-report.
Keywords: protein analysis, reproducible research, bioinformatics pipeline, InterProScan, BLAST, AI agent skill
1. Introduction
1.1 Background
Protein sequence analysis is a foundational task in bioinformatics. A typical workflow involves multiple steps: computing physicochemical properties, generating hydropathy profiles, running domain annotation via InterProScan, performing homology searches via BLAST, and synthesizing results into a coherent report. Each step typically requires a different tool, format conversion, and manual integration — a process that is time-consuming, error-prone, and difficult to reproduce.
1.2 Motivation
The rise of AI agent platforms (such as OpenClaw, Claude Code, and similar systems) introduces a new paradigm: skills — executable, self-contained instructions that allow AI agents to perform complex tasks autonomously. Unlike traditional bioinformatics pipelines (e.g., Galaxy, Snakemake workflows), which require a dedicated environment and manual configuration, a skill can be executed by any AI agent with access to a standard Python environment and the internet.
This paper presents protein-report, a protein sequence analysis pipeline packaged as a skill. The design goals are:
- One-command execution: A single
python protein_analyzer.pyproduces a complete report. - Reproducibility: Each run is isolated; all outputs are timestamped and self-contained.
- Resilience: Network failures in external API calls (InterProScan, BLAST) never block the full report.
- Agent-native: Packaged as a SKILL.md file that any compatible AI agent can consume and execute.
1.3 Contributions
- A fully integrated, one-command protein analysis pipeline covering physicochemical profiling, domain annotation, homology search, and AI-assisted functional prediction.
- An async submit/poll/fetch architecture for external API calls with retry logic and graceful degradation.
- A reproducibility-oriented skill format (SKILL.md) that enables any AI agent to clone, install, and execute the pipeline from a single instruction set.
- Demonstration on a real-world sequence with complete results.
2. Methodology
2.1 Pipeline Architecture
The pipeline follows a sequential architecture with five core modules:
Input (FASTA)
|
v
[1] Physicochemical Properties (Biopython ProtParam)
|
v
[2] Hydropathy Plot (Kyte-Doolittle, Matplotlib)
|
v
[3] Domain Analysis (EBI InterProScan, async)
|
v
[4] Homology Search (EBI BLASTP vs SwissProt, async)
|
v
[5] AI Functional Summary (structured synthesis)
|
v
[6] Report Generation (PDF + Markdown)2.2 Module Details
2.2.1 Physicochemical Properties
Computed locally using Biopython's ProtParam module. Metrics include:
| Metric | Description |
|---|---|
| Length | Number of amino acid residues |
| Molecular Weight | Estimated in Daltons |
| Isoelectric Point (pI) | Bjellqvist scale |
| Instability Index | <40 stable, >40 unstable |
| Aromaticity | Relative frequency of aromatic residues |
| GRAVY | Grand Average of Hydropathy; negative = hydrophilic, positive = hydrophobic |
No external API calls are required. This module always succeeds.
2.2.2 Hydropathy Plot
The Kyte-Doolittle hydropathy scale is applied with a sliding window of 19 residues (standard for transmembrane helix prediction). The plot is generated using Matplotlib and saved as hydrophobicity.png.
2.2.3 Domain Analysis (InterProScan)
This module interfaces with the EBI InterProScan REST API using an asynchronous submit/poll/fetch pattern:
- Submit: POST the protein sequence to
https://www.ebi.ac.uk/interpro/api/sequence/segment/. Returns a submission ID and a status URL. - Poll: Periodically GET the status URL. Retries with exponential backoff on transient HTTP errors.
- Fetch: Once status is
DONE, retrieve the XML/JSON results. Parse domain hits including position, name, accession, and source database.
Results are rendered as both a visual domain map (domain_map.png) and a tabular summary sorted by genomic position. Clickable links to InterPro, Pfam, SMART, PANTHER, CDD, and other databases are embedded in the report.
2.2.4 Homology Search (BLASTP)
BLASTP is run against the SwissProt/Reviewed database via the EBI NCBI BLAST REST API:
- Submit: POST to the BLAST API with the query sequence and database parameter set to
swissprot. - Poll: Check job status with retry logic.
- Parse: Extract top hits with accession, identity percentage, E-value, description, and clickable UniProt links.
Timeout and degradation: A hard timeout of 180 seconds is enforced. If BLAST does not complete within this window, the module gracefully degrades — the report is generated with a note that BLAST timed out, and a direct link to the NCBI BLAST web portal is provided for manual retry.
2.2.5 AI Functional Summary
Based on the collected data, a structured English-language summary is synthesized with three sections:
- Investigation Summary: Key findings from physicochemical analysis, domain annotation, and homology search.
- Functional Prediction: Inference of potential biochemical function based on domain composition and homology.
- Related Literature: A PubMed search link constructed from identified domain names for further reading.
2.2.6 Report Generation
Two output formats are produced:
- PDF (
<FASTA_ID>_report.pdf): Generated withfpdf, then post-processed withPyPDF2to add sidebar bookmarks corresponding to each major section. External links (UniProt, InterPro, AlphaFold, PubMed) are clickable. - Markdown (
<FASTA_ID>_report.md): A fully structured Markdown file with tables, image references, and hyperlinks — easy to edit, share, or import into other tools.
2.3 Reproducibility Design
Each run creates an isolated output folder:
analysis_runs/<FASTA_ID>_YYYYMMDD_HHMMSS/This design ensures:
- Multiple analyses on different sequences never overwrite each other.
- Re-running the same sequence at different times produces separate, timestamped results.
- The entire output folder can be archived or shared as a self-contained result set.
2.4 Error Handling and Resilience
| Scenario | Behavior |
|---|---|
| InterProScan transient error | Retry with exponential backoff (up to 5 attempts) |
| InterProScan timeout | Skip domain section; report generated with remaining sections |
| BLAST timeout (180s) | Graceful degradation; report includes NCBI BLAST portal link |
| Network unavailable | Offline modules (physicochemical, plotting) complete normally |
| Invalid FASTA input | Early validation with clear error message |
The pipeline follows a "best-effort completion" principle: any module failure degrades the report gracefully rather than blocking it entirely.
3. Results
3.1 Demonstration Sequence
We demonstrate the pipeline on a 317-residue protein sequence (UserSeq_1) from the repository's example dataset. The input FASTA sequence:
>UserSeq_1
MSNQYGDKNLKIFSLNSNPELAKEIADIVGVQLGKCSVTRFSDGEVQINIEESIRGCDCY
IIQSTSDPVNEHIMELLIMVDALKRASAKTINIVIPYYGYARQDRKARSREPITAKLFAN
LLETAGATRVIALDLHAPQIQGFFDIPIDHLMGVPILGEYFEGKNLEDIVIVSPDHGGVT
RARKLADRLKAPIAIIDKRRPRPNVAEVMNIVGNIEGKTAILIDDIIDTAGTITLAANAL
VENGAKEVYACCTHPVLSGPAVERINNSTIKELVVTNSIKLPEEKKIERFKQLSVGPLLA
EAIIRVHEQQSVSYLFS3.2 Physicochemical Properties
| Metric | Value |
|---|---|
| Length | 317 aa |
| Molecular Weight | 34,867.86 Da |
| Isoelectric Point (pI) | 5.94 |
| Instability Index | 39.11 (Stable) |
| Aromaticity | 0.050 |
| GRAVY | -0.018 (slightly hydrophilic) |
3.3 Domain Annotation
InterProScan identified 15 domain hits across 8 databases:
| Position | Domain Name | Accession | Database |
|---|---|---|---|
| 7-317 | PRK01259.1 | NF002320 | NCBIFAM |
| 8-317 | Ribose-phosphate diphosphokinase family | PTHR10210 | PANTHER |
| 10-126 | Pribosyltran_N_2 | SM01400 | SMART |
| 10-317 | RibP_PPkinase_B | MF_00583_B | HAMAP |
| 10-126 | Pribosyltran_N | PF13793 | PFAM |
| 11-317 | ribP_PPkin | TIGR01251 | NCBIFAM |
| 75-308 | PRTase-like | SSF53271 | SUPERFAMILY |
| 134-149 | PRPP_SYNTHASE | PS00114 | PROSITE |
| 154-278 | PRTases_typeI | cd06223 | CDD |
| 208-316 | Pribosyl_synth | PF14572 | PFAM |
The domain architecture reveals this protein belongs to the ribose-phosphate pyrophosphokinase (PRPP synthase) family, a well-characterized enzyme in nucleotide biosynthesis.
3.4 Homology Search
BLASTP against SwissProt/Reviewed returned a 100% identity top hit:
| Rank | Accession | Identity | E-value | Description |
|---|---|---|---|---|
| 1 | P14193 | 100% | 0.0 | Ribose-phosphate pyrophosphokinase (Bacillus subtilis) |
| 2 | Q81J97 | 85% | 0.0 | Ribose-phosphate pyrophosphokinase (Bacillus cereus) |
| 3 | Q81VZ0 | 85% | 0.0 | Ribose-phosphate pyrophosphokinase (Bacillus anthracis) |
| 4 | O33924 | 85% | 0.0 | Ribose-phosphate pyrophosphokinase (Corynebacterium ammoniagenes) |
| 5 | Q8EU34 | 79% | 0.0 | Ribose-phosphate pyrophosphokinase (Oceanobacillus iheyensis) |
3.5 Secondary Structure Prediction
The AI synthesis module estimated secondary structure propensity:
- Alpha-helix: 32.2%
- Beta-sheet: 37.2%
- Coil/Loop: 27.1%
3.6 Output Files
The pipeline produced the following outputs in analysis_runs/UserSeq_1_20260324_124042/:
| File | Description | Size |
|---|---|---|
| UserSeq_1_report.pdf | Bookmarked PDF report with clickable links | ~124 KB |
| UserSeq_1_report.md | Markdown report with embedded tables | ~6 KB |
| hydrophobicity.png | Kyte-Doolittle hydropathy profile | ~57 KB |
| domain_map.png | InterProScan domain architecture | ~52 KB |
4. Discussion
4.1 Design Trade-offs
SwissProt vs. nr: We deliberately limit BLAST to SwissProt/Reviewed sequences. While this sacrifices coverage (SwissProt contains ~570K sequences vs. ~200M in nr), it dramatically improves result reliability — every hit comes from a manually curated, experimentally validated entry. Users requiring broader searches are directed to the NCBI BLAST web portal.
Timeout strategy: The 180-second BLAST timeout was chosen as a practical balance. For most sequences under 1000 residues, SwissProt BLAST completes within 60-120 seconds. Longer sequences or complex queries may timeout, but the graceful degradation ensures users still receive all other analysis results plus a path to retry.
Local vs. Cloud: All compute-intensive steps (physicochemical analysis, plotting) run locally. Only database lookups (InterProScan, BLAST) require network access. This minimizes dependency on external service availability.
4.2 Agent-Native Design
The skill format (SKILL.md) is designed for AI agent consumption. Unlike traditional software documentation, it follows a strict structure:
- Clone — one
git clonecommand - Install — one
pip installcommand - Run — one
pythoncommand - Verify — comparison against known example output
Any agent with shell access and Python can execute this pipeline without understanding the underlying bioinformatics. This is the key advantage over traditional pipelines: the skill is the documentation is the reproducibility protocol.
4.3 Limitations
- Single sequence per run: The pipeline currently processes one FASTA entry at a time.
- No structural prediction: While links to AlphaFold/ColabFold are provided, the pipeline does not perform structure prediction.
- EBI API dependency: InterProScan and BLAST depend on EBI service availability; the fallback strategy mitigates but does not eliminate this dependency.
- PDF rendering: The fpdf library has limited Unicode support; CJK characters in PDFs may not render correctly.
5. Conclusion
protein-report demonstrates that a complete protein bioinformatics analysis pipeline can be packaged as a single, reproducible skill executable by any AI agent. The combination of local computation, async API integration, graceful degradation, and timestamped output isolation achieves both robustness and reproducibility. With a 317-residue demonstration sequence, the pipeline successfully identified 15 domain annotations across 8 databases and a 100% identity homology match, producing a publication-ready report in under 5 minutes.
The skill format (SKILL.md) with its clone-install-run-verify structure represents a new paradigm for reproducible bioinformatics: instead of sharing environments, we share instructions that any agent can execute autonomously.
5.1 Future Work
- Multi-sequence batch processing support
- Integration with AlphaFold DB API for automated structure retrieval
- Interactive HTML report output
- Support for nucleotide-to-protein workflows (BLASTX, ORF finding)
6. References
- The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Research, 2023.
- Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics, 2014.
- Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997.
- Kyte, J. & Doolittle, R.F. A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 1982.
- Gasteiger, E. et al. ProtParam: computing physicochemical properties from amino acid sequences. Nucleic Acids Research, 2005.
Appendix: Skill File
See the accompanying SKILL.md for the complete reproduction protocol. The skill enables any AI agent to:
git clone https://github.com/Wuhl00/protein-report.git
cd protein-report
pip install -r main_scripts/requirements.txt
# Place sequence in main_scripts/input.fasta, then:
cd main_scripts
python protein_analyzer.pyExample output is available at: example/UserSeq_1_20260324_124042/
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: protein-report description: >- Protein sequence analysis Skill. Takes a protein FASTA sequence and automatically runs physicochemical analysis, hydropathy plotting, EBI InterProScan domain analysis, EBI BLAST homology search, and a structured AI summary. Outputs a bookmarked PDF and a Markdown report (one output folder per run). --- # Protein Sequence Deep Analysis Skill (protein-report) ## Overview A reproducible, one-command protein sequence analysis pipeline. Provide a protein FASTA sequence and receive a publication-ready PDF report (with sidebar bookmarks) plus a Markdown report — all in a single run, fully isolated into timestamped output folders. ## Reproduction Steps ### 1. Clone the repository ```bash git clone https://github.com/Wuhl00/protein-report.git cd protein-report ``` ### 2. Install dependencies Requires Python >= 3.8 (recommended: 3.10+). ```bash pip install -r main_scripts/requirements.txt ``` Dependencies: biopython, requests, matplotlib, fpdf, pandas, numpy, lxml, PyPDF2 ### 3. Prepare input Place your protein sequence in standard FASTA format into `main_scripts/input.fasta`. Example: ```fasta >Sample_Protein MAVSRSSRLRLGRALAAAAAATAVALPAVAVAGPPAVAAAAA ``` A sample input is also available at `example/input.fasta`. ### 4. Run the analysis ```bash cd main_scripts python protein_analyzer.py ``` ### 5. Locate outputs Each run creates an isolated output folder: ``` analysis_runs/<FASTA_ID>_YYYYMMDD_HHMMSS/ ``` Inside you will find: - `<FASTA_ID>_report.pdf` — bookmarked PDF report - `<FASTA_ID>_report.md` — Markdown report - `hydrophobicity.png` — Kyte-Doolittle hydropathy plot - `domain_map.png` — InterProScan domain architecture visualization ### 6. Verify reproduction Compare your output against the example run in `example/UserSeq_1_20260324_124042/`. The example uses a 317 aa Ribose-phosphate pyrophosphokinase sequence and produces a full report with 15 domain annotations and a 100% identity BLAST hit (P14193). ## Analysis Modules | Module | Method | Source | |---|---|---| | Physicochemical properties | ProtParam | Biopython (local) | | Hydropathy plot | Kyte-Doolittle | Matplotlib (local) | | Domain analysis | InterProScan REST API | EBI (async submit/poll/fetch) | | Homology search | BLASTP vs SwissProt/Reviewed | EBI (async submit/poll/fetch) | | AI functional summary | Structured synthesis | English report sections | | PDF bookmarks | PyPDF2 outline | Post-generation (local) | ## Network Dependency Notes - InterProScan and BLAST rely on EBI web services. - Transient network errors are retried automatically. - BLAST has a hard timeout of 180 seconds; if exceeded, the report is still generated with remaining sections intact (graceful degradation). - Physicochemical analysis and plotting run entirely offline. ## Tech Stack - **Parsing & analysis**: Biopython (FASTA parsing, ProtParam metrics) - **Plotting**: Matplotlib (hydropathy + domain map) - **Domain search**: EBI InterProScan REST API (async submit/poll/fetch) - **Homology search**: EBI NCBI BLAST REST API (async submit/poll/fetch) - **PDF generation**: fpdf + PyPDF2 (sidebar bookmarks/outline)
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.