Loading...
Loading...
Use when designing software architecture for bioinformatics pipelines, defining data structures, planning scalability, or making technical design decisions for complex systems.
npx skill4agent add dangeles/claude systems-architectBiologist Commentator validates requirements
↓
Systems Architect designs architecture
↓
Produces technical specification
↓
Software Developer implements from spec.architecture/context.mdreferences/architecture-context-template.mdsrc/modules/[TBD][UNKNOWN][INFERRED]assets/architecture_template.md# System Architecture: [Project Name]
## Overview
[1-2 sentence system description]
## Components
1. [Component Name]: [Purpose]
2. [Component Name]: [Purpose]
## Data Flow
[Input] → [Processing] → [Output]
## Technology Stack
- Language: Python 3.11
- Key Libraries: pandas, numpy, scikit-learn
- Storage: HDF5 for matrices, SQLite for metadata
- Execution: Snakemake on HPC cluster
## Scalability
- Dataset size: [Expected range]
- Memory: [Requirements]
- Compute: [CPU cores, time estimates]
- Storage: [Space requirements]
## Error Handling
[Strategy for failures, retries, logging]
## Deployment
[Installation, configuration, execution]references/data_structure_guide.md| Use Case | Structure | When |
|---|---|---|
| Tabular data <1GB | pandas DataFrame | General analysis |
| Tabular data >1GB | Dask DataFrame | Out-of-core processing |
| Single-cell data | AnnData | scRNA-seq analysis |
| Large matrices | HDF5 | Persistent storage |
| Relational queries | SQLite/PostgreSQL | Complex joins |
| Genomic intervals | BED/GFF files | Standard interchange |
| Time series | pandas with DatetimeIndex | Temporal data |
RNA-seq count matrix: genes × samples × 8 bytes
20,000 genes × 1,000 samples × 8 = 160 MB (fits in RAM)
20,000 genes × 100,000 cells × 8 = 16 GB (need sparse or chunking)DESeq2 analysis: O(n_genes × n_samples²)
100 samples: ~5 minutes
1,000 samples: ~8 hours
Strategy: Subset for testing, full run overnightFASTQ (compressed): 50-100 MB per million reads
50M reads = 5 GB
100 samples × 50M reads = 500 GB
Strategy: Delete FASTQ after alignment, keep BAM# Pattern 1: Subprocess call
import subprocess
result = subprocess.run(
['fastqc', input_file, '-o', output_dir],
capture_output=True, check=True
)
# Pattern 2: Python binding (preferred if available)
import pysam
bam = pysam.AlignmentFile(bam_file, 'rb')# Dockerfile approach for reproducibility
FROM python:3.11-slim
RUN pip install numpy pandas scikit-learn
COPY pipeline.py /app/
ENTRYPOINT ["python", "/app/pipeline.py"].architecture/context.mdreferences/architecture_patterns.mdreferences/data_structure_guide.mdreferences/scalability_considerations.mdreferences/integration_patterns.mdreferences/architecture-context-template.md## Architecture Specification
### Overview
Parallel QC pipeline processing 1,000 bulk RNA-seq FASTQ files with automated report generation.
### Components
1. Validator: Check FASTQ integrity, format
2. QC Runner: Execute FastQC in parallel
3. Aggregator: Combine metrics with MultiQC
4. Reporter: Generate summary statistics and plots
### Data Flow
FASTQ files → Validator → QC Runner (parallel) → Aggregator → HTML Report
### Technology Stack
- Execution: Snakemake (manages dependencies, parallelization)
- QC: FastQC 0.12.1
- Aggregation: MultiQC 1.14
- Custom code: Python 3.11, pandas, matplotlib
- Storage: FASTQ (gzip), QC metrics (JSON), report (HTML)
### Scalability
- Data: 1,000 samples × 50M reads × 100 bp = 500 GB FASTQ
- Compute: 100 parallel jobs on HPC cluster
- Time: 30 min per sample → 300 min total (5 hours)
- Memory: 4 GB per FastQC job = 400 GB total (distributed)
### Error Handling
- Retry failed jobs (3 attempts)
- Continue pipeline if individual samples fail
- Log all errors with sample ID
- Final report includes QC pass/fail status per sample
### Deployment
- Install: micromamba env from environment.yml
- Config: samples.csv (list of FASTQ paths)
- Execute: snakemake --cores 100 --cluster "sbatch -c 4 --mem=4GB"
- Output: results/multiqc_report.html