GATK Plugin
Genomic Analysis Toolkit (GATK) plugin for GeneForgeLang that enables variant calling, genotyping, and other genomic analysis workflows directly from GFL.
Overview
This plugin provides integration with the Genomic Analysis Toolkit (GATK), allowing users to perform variant calling, genotyping, and other genomic analysis tasks directly from GeneForgeLang workflows. It supports key GATK tools including HaplotypeCaller, GenotypeGVCFs, and VariantFiltration.
Features
- Variant Calling: HaplotypeCaller for SNP and indel detection
- Genotyping: GenotypeGVCFs for joint genotyping
- Variant Filtering: VariantFiltration for quality control
- Configurable Parameters: Set parameters for each GATK tool
- Structured Output: Returns parsed results in a clean JSON format
- GFL Integration: Seamlessly integrates with GeneForgeLang's plugin system
Installation
pip install gfl-plugin-gatk
Note: This plugin requires GATK to be installed separately. Download GATK from the Broad Institute website.
Usage
Once installed, the plugin is automatically discovered by the GeneForgeLang service through entry points.
Example GFL Workflow
run:
- plugin: "gatk"
tool: "HaplotypeCaller"
reference: "/path/to/reference.fasta"
input_bam: "/path/to/sample.bam"
output_vcf: "/path/to/output.vcf"
as_var: "variant_calling_results"
output:
- vcf_file: "${variant_calling_results.output_vcf}"
Parameters
tool: The GATK tool to run (HaplotypeCaller, GenotypeGVCFs, VariantFiltration, etc.)reference: Path to the reference genome FASTA fileinput_bam: Path to the input BAM fileoutput_vcf: Path to the output VCF file- Additional tool-specific parameters
Supported Tools
HaplotypeCaller
Calls germline SNPs and indels via local re-assembly of haplotypes.
Parameters:
- input_bam: Input BAM file
- reference: Reference genome FASTA
- output_vcf: Output VCF file
- intervals: Genomic intervals to process (optional)
- stand_call_conf: Minimum phred-scaled confidence threshold (default: 30)
GenotypeGVCFs
Performs joint genotyping on one or more samples pre-called with HaplotypeCaller.
Parameters:
- input_gvcfs: List of input GVCF files
- reference: Reference genome FASTA
- output_vcf: Output VCF file
- intervals: Genomic intervals to process (optional)
VariantFiltration
Filters variant calls based on INFO and/or FORMAT annotations.
Parameters:
- input_vcf: Input VCF file
- output_vcf: Output VCF file
- filter_expressions: List of filter expressions
- filter_names: Names for the filters
Output Format
The plugin returns structured results containing:
tool: The GATK tool that was runoutput_files: Dictionary of output files createdmetrics: Performance metrics and statisticsstatus: Execution status (success/failure)
Requirements
- GeneForgeLang >= 1.0.0
- GATK >= 4.0.0 (separate installation required)
- Java >= 8
API Reference
Class: GatkPlugin
Methods
__init__(self, config: Optional[Dict[str, Any]] = None)
Initialize the GATK plugin.
Parameters:
- config: Optional configuration dictionary
run(self, input_data: Any, params: Optional[Dict[str, Any]] = None) -> Dict[str, Any]
Execute a GATK tool.
Parameters:
- input_data: Input data containing file paths and parameters
- params: Optional parameters for the GATK tool
Returns: - Dictionary containing GATK results
validate_input(self, input_data: Any) -> bool
Validate input data for the plugin.
Parameters:
- input_data: Input data to validate
Returns: - True if input is valid, False otherwise
Configuration
The plugin accepts optional configuration parameters:
config = {
"gatk_path": "/path/to/gatk/binaries", # Path to GATK executables
"java_path": "/path/to/java", # Path to Java executable
"default_memory": "4g", # Default memory allocation
"temp_dir": "/tmp/gatk" # Temporary directory for GATK operations
}
plugin = GatkPlugin(config=config)
Development
Setting Up for Development
git clone https://github.com/Fundacion-de-Neurociencias/gfl-plugin-gatk.git
cd gfl-plugin-gatk
pip install -e ".[dev]"
Running Tests
pytest
Code Formatting
black gfl_plugin_gatk/
ruff check gfl_plugin_gatk/
Troubleshooting
Common Issues
- GATK Not Found: Ensure GATK is installed and in your PATH
- Java Issues: Verify Java installation and version compatibility
- Memory Errors: For large datasets, increase memory allocation
- File Permissions: Check read/write permissions for input/output files
Debugging
Enable verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Examples
Basic Variant Calling
input:
reference_genome: "/data/reference/hg38.fasta"
sample_bam: "/data/samples/sample1.bam"
output_dir: "/data/results"
run:
- plugin: "gatk"
tool: "HaplotypeCaller"
reference: "${reference_genome}"
input_bam: "${sample_bam}"
output_vcf: "${output_dir}/sample1.vcf"
stand_call_conf: 30
as_var: "haplotypecaller_results"
output:
- variant_calls: "${haplotypecaller_results.output_vcf}"
Joint Genotyping Workflow
input:
reference_genome: "/data/reference/hg38.fasta"
sample_gvcfs: [
"/data/gvcfs/sample1.g.vcf",
"/data/gvcfs/sample2.g.vcf",
"/data/gvcfs/sample3.g.vcf"
]
output_dir: "/data/results"
run:
- plugin: "gatk"
tool: "GenotypeGVCFs"
reference: "${reference_genome}"
input_gvcfs: "${sample_gvcfs}"
output_vcf: "${output_dir}/joint_genotyped.vcf"
as_var: "genotyping_results"
process:
- name: "filter_variants"
plugin: "gatk"
tool: "VariantFiltration"
input_vcf: "${genotyping_results.output_vcf}"
output_vcf: "${output_dir}/filtered_variants.vcf"
filter_expressions: [
"QD < 2.0",
"MQ < 40.0",
"FS > 60.0"
]
filter_names: [
"LowQualityDepth",
"LowMappingQuality",
"HighStrandBias"
]
as_var: "filtered_results"
output:
- final_variants: "${filtered_results.output_vcf}"
License
This project is licensed under the MIT License.