SAMtools Plugin
SAMtools plugin for GeneForgeLang that enables sequence alignment manipulation and analysis directly from GFL workflows.
Overview
This plugin provides integration with SAMtools, allowing users to manipulate and analyze sequence alignment files (SAM/BAM/CRAM) directly from GeneForgeLang workflows. It supports key SAMtools operations including sorting, indexing, filtering, and statistics generation.
Features
- File Manipulation: Sort, index, and convert alignment files
- Filtering: Filter alignments based on various criteria
- Statistics: Generate alignment statistics and metrics
- Configurable Parameters: Set parameters for each SAMtools operation
- Structured Output: Returns parsed results in a clean JSON format
- GFL Integration: Seamlessly integrates with GeneForgeLang's plugin system
Installation
pip install gfl-plugin-samtools
Note: This plugin requires SAMtools to be installed separately. Install SAMtools using your system's package manager or download from the official website.
Usage
Once installed, the plugin is automatically discovered by the GeneForgeLang service through entry points.
Example GFL Workflow
run:
- plugin: "samtools"
operation: "sort"
input_bam: "/path/to/unsorted.bam"
output_bam: "/path/to/sorted.bam"
as_var: "sorting_results"
- plugin: "samtools"
operation: "index"
input_bam: "/path/to/sorted.bam"
as_var: "indexing_results"
output:
- sorted_bam: "${sorting_results.output_bam}"
- bam_index: "${indexing_results.output_bai}"
Parameters
operation: The SAMtools operation to perform (sort, index, view, stats, etc.)input_bam: Path to the input BAM fileoutput_bam: Path to the output BAM file (for operations that create new files)- Additional operation-specific parameters
Supported Operations
sort
Sorts SAM/BAM/CRAM files by chromosome and genomic coordinate.
Parameters:
- input_bam: Input alignment file
- output_bam: Output sorted file
- sort_order: Sort order (coordinate, queryname, etc.)
- memory_limit: Memory limit for sorting
index
Creates an index for a sorted BAM/CRAM file.
Parameters:
- input_bam: Input sorted alignment file
- output_bai: Output index file (optional, auto-generated if not specified)
view
Converts between SAM/BAM/CRAM formats and filters alignments.
Parameters:
- input_file: Input alignment file
- output_file: Output file
- regions: Genomic regions to include (optional)
- filters: Filtering criteria (mapping quality, flags, etc.)
stats
Generates comprehensive statistics for alignment files.
Parameters:
- input_bam: Input alignment file
- output_stats: Output statistics file
- regions: Genomic regions to analyze (optional)
flagstat
Counts the number of alignments for each FLAG type.
Parameters:
- input_bam: Input alignment file
- output_flagstat: Output flagstat file
Output Format
The plugin returns structured results containing:
operation: The SAMtools operation that was performedoutput_files: Dictionary of output files createdmetrics: Performance metrics and statisticsstatus: Execution status (success/failure)
Requirements
- GeneForgeLang >= 1.0.0
- SAMtools >= 1.9 (separate installation required)
- pysam >= 0.15.0
API Reference
Class: SamtoolsPlugin
Methods
__init__(self, config: Optional[Dict[str, Any]] = None)
Initialize the SAMtools plugin.
Parameters:
- config: Optional configuration dictionary
run(self, input_data: Any, params: Optional[Dict[str, Any]] = None) -> Dict[str, Any]
Execute a SAMtools operation.
Parameters:
- input_data: Input data containing file paths and parameters
- params: Optional parameters for the SAMtools operation
Returns: - Dictionary containing SAMtools results
validate_input(self, input_data: Any) -> bool
Validate input data for the plugin.
Parameters:
- input_data: Input data to validate
Returns: - True if input is valid, False otherwise
Configuration
The plugin accepts optional configuration parameters:
config = {
"samtools_path": "/path/to/samtools/binaries", # Path to SAMtools executables
"default_memory": "2g", # Default memory allocation
"temp_dir": "/tmp/samtools" # Temporary directory for operations
}
plugin = SamtoolsPlugin(config=config)
Development
Setting Up for Development
git clone https://github.com/Fundacion-de-Neurociencias/gfl-plugin-samtools.git
cd gfl-plugin-samtools
pip install -e ".[dev]"
Running Tests
pytest
Code Formatting
black gfl_plugin_samtools/
ruff check gfl_plugin_samtools/
Troubleshooting
Common Issues
- SAMtools Not Found: Ensure SAMtools is installed and in your PATH
- File Format Errors: Verify input file formats (SAM/BAM/CRAM)
- Memory Errors: For large files, increase memory allocation
- Indexing Issues: Ensure input files are properly sorted before indexing
Debugging
Enable verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Examples
Basic Sorting and Indexing Workflow
input:
unsorted_bam: "/data/alignments/sample1_unsorted.bam"
output_dir: "/data/alignments"
run:
- plugin: "samtools"
operation: "sort"
input_bam: "${unsorted_bam}"
output_bam: "${output_dir}/sample1_sorted.bam"
sort_order: "coordinate"
as_var: "sorting_results"
- plugin: "samtools"
operation: "index"
input_bam: "${sorting_results.output_bam}"
as_var: "indexing_results"
output:
- sorted_bam: "${sorting_results.output_bam}"
- bam_index: "${indexing_results.output_bai}"
Alignment Statistics and Filtering
input:
input_bam: "/data/alignments/sample1.bam"
output_dir: "/data/stats"
run:
- plugin: "samtools"
operation: "stats"
input_bam: "${input_bam}"
output_stats: "${output_dir}/sample1_stats.txt"
as_var: "stats_results"
- plugin: "samtools"
operation: "view"
input_file: "${input_bam}"
output_file: "${output_dir}/high_quality.bam"
filters: [
"-q 30", # Minimum mapping quality
"-F 2304" # Exclude duplicates and secondary alignments
]
as_var: "filtering_results"
- plugin: "samtools"
operation: "flagstat"
input_bam: "${filtering_results.output_file}"
output_flagstat: "${output_dir}/high_quality_flagstat.txt"
as_var: "flagstat_results"
output:
- alignment_stats: "${stats_results.output_stats}"
- filtered_bam: "${filtering_results.output_file}"
- flag_statistics: "${flagstat_results.output_flagstat}"
Multi-Sample Processing Pipeline
input:
samples: [
{"name": "sample1", "bam": "/data/alignments/sample1.bam"},
{"name": "sample2", "bam": "/data/alignments/sample2.bam"},
{"name": "sample3", "bam": "/data/alignments/sample3.bam"}
]
output_dir: "/data/processed"
process:
- name: "process_sample"
for_each: "sample in samples"
run:
- plugin: "samtools"
operation: "sort"
input_bam: "${sample.bam}"
output_bam: "${output_dir}/${sample.name}_sorted.bam"
as_var: "sort_result_${sample.name}"
- plugin: "samtools"
operation: "index"
input_bam: "${sort_result_${sample.name}.output_bam}"
as_var: "index_result_${sample.name}"
output:
- sorted_bam_${sample.name}: "${sort_result_${sample.name}.output_bam}"
- index_file_${sample.name}: "${index_result_${sample.name}.output_bai}"
output:
- processed_samples: "All samples processed successfully"
License
This project is licensed under the MIT License.