GeneForgeLang quickstart guide
GeneForgeLang (GFL) is a domain-specific language designed for genomic workflow specification, validation, and AI-powered analysis. This guide walks you through installation, your first workflows, custom schemas, and best practices to get you up and running quickly.
Prerequisites
Before starting, make sure you have: 1. Python 3.9 or higher installed 2. GeneForgeLang installed (see Installation Guide) 3. Basic understanding of biological workflows
1. Installation
# Check GeneForgeLang installation
python -c "import geneforgelang; print(f'Version: {geneforgelang.__version__}')"
# Check available plugins
gfl --list-plugins"
If both commands run without errors, you are ready to create your first workflow.
2. Your First Workflow
Create a file called my_first_workflow.gfl with the following content. It runs a BLAST search and filters the results:
# my_first_workflow.gfl
input:
sequence: "ATGCGATCGATCGATCGATCGATCGATCG"
database: "nt"
run:
- plugin: "blast"
operation: "blastn"
sequence: "${sequence}"
database: "${database}"
expect_threshold: 0.001
as_var: "blast_results"
process:
- name: "filter_hits"
input: "${blast_results}"
operation: "filter"
condition: "evalue < 0.01"
as_var: "filtered_hits"
output:
- blast_hits: "${filtered_hits}"
- summary: "Found ${len(filtered_hits)} significant hits"
Run it and optionally save the output in different formats:
# Run the workflow
gfl run my_first_workflow.gfl
# Save results as JSON
gfl run my_first_workflow.gfl --output-format json --output-file results.json
The following workflow uses Genesis plugins to design and evaluate gRNA candidates:
# crispr_design.gfl
input:
target_sequence: "ATGCGATCGATCGATCGATCG"
genome_context: "ACGTGCAATGGAGCGGCTTGCGG"
design:
entity: "gRNA"
count: 5
constraints:
- "length(20)"
- "gc_content(40, 60)"
evaluate:
candidates: "${design.candidates}"
- plugin: "gfl-ontarget-scorer"
input:
grna_sequence: "${candidate.sequence}"
genome_sequence: "${genome_context}"
as_var: "on_target_score"
- plugin: "gfl-offtarget-scorer"
input:
grna_sequence: "${candidate.sequence}"
params:
max_mismatches: 3
genome_reference: "GRCh38"
as_var: "off_target_risk"
- plugin: "gfl-crispr-evaluator"
input:
grna_candidates: "${evaluate.candidates}"
params:
weight_factor: 0.3
as_var: "final_scores"
output:
- ranked_candidates: "${final_scores.results}"
- summary: "Evaluated ${len(final_scores.results)} gRNA candidates"
GFL supports dynamic variables and conditional steps:
input:
organism: "human"
quality_score: 85
variables:
database_map:
human: "GRCh38"
mouse: "GRCm39"
threads: "${os.cpu_count() // 2}"
run:
- plugin: "gatk"
operation: "variant_calling"
reference_genome: "${database_map[organism]}"
threads: "${threads}"
as_var: "variants"
process:
- name: "high_quality_analysis"
condition: "${quality_score >= 90}"
plugin: "advanced_analysis"
- name: "standard_analysis"
condition: "${quality_score < 90}"
plugin: "basic_analysis"
3. Custom schemas and IO contracts
Custom schemas let you define typed data contracts between workflow steps, catching errors before execution. Schemas are defined in separate YAML files and imported into your workflow.
Define a schema file:
# gene_expression_schemas.yml
version: 1.0
schemas:
- name: GeneExpressionMatrix
base_type: CSV
description: "Gene expression data matrix"
attributes:
normalized:
required: true
value: true
sample_count:
required: true
type: integer
gene_count:
required: true
type: integer
- name: DifferentialExpressionResults
base_type: CSV
description: "Results from differential expression analysis"
attributes:
p_value_adjusted:
required: true
value: true
fold_change_threshold:
required: true
type: float
Use the schema in a window Import the schema file and reference schema types in your contract blocks:
# rna_seq_analysis.gfl
import_schemas:
- ./schemas/gene_expression_schemas.yml
experiment:
tool: RNAseq
type: sequencing
contract:
outputs:
raw_expression:
type: GeneExpressionMatrix
attributes:
normalized: true
sample_count: 12
gene_count: 25000
params:
protocol: standard
read_length: 150
paired_end: true
output: expression_data
analyze:
strategy: differential
contract:
inputs:
expression_data:
type: GeneExpressionMatrix
attributes:
normalized: true
outputs:
de_results:
type: DifferentialExpressionResults
attributes:
p_value_adjusted: true
fold_change_threshold: 1.5
data: expression_data
thresholds:
p_value: 0.05
fold_change: 1.5
output: differential_results
Validate Your Workflow Run the validator after adding schemas to catch contract mismatches early:
gfl-validate rna_seq_analysis.gfl
# Expected output:
# ✓ Parsing successful
# ✓ Validation passed: 0 errors, 0 warnings
Common Schema Validation Errors
# Missing required attribute
Error: Required attribute 'sample_count' missing in contract outputs
'raw_expression' for schema type 'GeneExpressionMatrix'
Fix: Add the required 'sample_count' attribute to your contract definition
# Invalid attribute value
Error: Attribute 'clinical_significance' must have one of the allowed values,
got 'unknown'
Fix: Use one of: benign, likely_benign, uncertain, likely_pathogenic, pathogenic
NOTE: For advanced schema features such as VCF/CUSTOM base types, complex validation rules, and phylogenetic data types, see the Advanced Schema Definitions reference document.
4. Best Practices
Workflow Organization - Break complex workflows into smaller, reusable components.Modular Design: - Use descriptive names for variables, steps, and outputs.Clear Naming: - Add comments and include a metadata block with author, date, and version.Documentation: - Include validation and conditional steps to handle edge cases.Error Handling:
Example metadata block:
metadata:
author: "Dr. Jane Smith"
institution: "University Research Lab"
date: "2025-01-15"
version: "1.0"
notes: |
This workflow validates CRISPR knockout efficiency.
Expected results: >80% knockout efficiency.
Plugin Usage: - Choose plugins that match your specific needs and verify compatibility.Plugin Selection: - Experiment with parameters for optimal results on your data.Parameter Tuning: - Specify plugin versions explicitly for reproducibility.Version Pinning: - Monitor memory and CPU usage for large datasets.Resource Management:
Schema Best Practices - Group related schemas in logical YAML files.Organize schema files: - Include version information to track changes over time.Version your schemas: - Run gfl-validate after every schema change.Validate early: - Design schemas to be shared across multiple projects.Reuse schemas:
Common Syntax Pitfalls
# ❌ Wrong: dash in unquoted value
target_gene: TP-53
# ✅ Correct: quote special characters
target_gene: "TP-53"
# ❌ Wrong: lowercase tool name
tool: crispr_cas9
# ✅ Correct: standard tool name
tool: CRISPR_cas9
# ❌ Wrong: string instead of number
efficiency: "0.8"
# ✅ Correct: proper data type
efficiency: 0.8
5. Next Steps
Now that you have a working setup, explore these resources to go further:
- plugins_overview.md — browse available plugins and their parameters.Plugin Ecosystem —
- gfl_yaml/ — complete syntax reference.Language Specification —
- advanced_schemas.md — VCF/CUSTOM types and complex validation rules.Advanced Schema Definitions —
- tutorials/ — advanced workflow examples including batch processing and AI inference.Tutorials —
- api/ — REST API and client SDK documentation.API Reference —
- https://github.com/Fundacion-de-Neurociencias/GeneForgeLang/discussionsCommunity —
Getting Help 1. Check the Troubleshooting Guide (installation.md#troubleshooting). 2. Review the API Reference (api/). 3. Search the project documentation. 4. Ask questions in the GitHub Discussion Forum.