Schema Registry & Type System
GeneForgeLang v1.0.0 introduces a powerful Schema Registry system that allows users to define custom data types in external schema files. This extensible type system enhances the language's flexibility and domain-specific capabilities.
Overview
The Schema Registry enables users to:
- Define custom data types in YAML schema files
- Import schema definitions using the import_schemas directive
- Use custom types in IO contracts for enhanced validation
- Create domain-specific types for specialized workflows
Schema Definition Files
Schema definitions are written in YAML files with a specific structure:
# custom_types.yml
schemas:
- name: GeneExpressionMatrix
base_type: CSV
description: "Gene expression data matrix with genes as rows and samples as columns"
attributes:
normalized:
required: true
value: true
log_transformed:
required: false
sample_count:
required: true
type: integer
- name: ProteinStructure
base_type: CUSTOM
description: "3D protein structure in PDB format with metadata"
attributes:
resolution:
required: true
type: float
atom_count:
required: false
type: integer
Importing Schemas
Use the import_schemas directive at the top level of your GFL file to import schema definitions:
import_schemas:
- ./schemas/custom_types.yml
- ./schemas/domain_specific.yml
experiment:
tool: RNAseq
type: sequencing
contract:
outputs:
expression_data:
type: GeneExpressionMatrix
attributes:
normalized: true
sample_count: 12
Using Custom Types in Contracts
Once imported, custom types can be used in IO contracts just like built-in types:
analyze:
strategy: differential
contract:
inputs:
expression_data:
type: GeneExpressionMatrix
attributes:
normalized: true
outputs:
significant_genes:
type: CSV
attributes:
format: gene_list
Schema Validation
The Schema Registry provides robust validation for custom types:
- Required Attributes: Ensures all required attributes are present
- Value Constraints: Validates attribute values against expected values
- Type Checking: Verifies attribute types match schema definitions
- Base Type Compatibility: Checks compatibility with base types
Example Schema File
Here's a complete example of a schema definition file:
# genomics_schemas.yml
version: 1.0
schemas:
- name: VariantCallFormat
base_type: VCF
description: "Variant call format file with genomic variants"
attributes:
compressed:
required: false
value: true
annotated:
required: true
value: true
sample_count:
required: true
type: integer
- name: PhylogeneticTree
base_type: CUSTOM
description: "Phylogenetic tree representation in Newick format"
attributes:
rooted:
required: false
type: boolean
bootstrap_values:
required: false
value: true
leaf_count:
required: true
type: integer
- name: GenomeAssembly
base_type: FASTA
description: "Genome assembly in FASTA format"
attributes:
contig_count:
required: true
type: integer
assembly_level:
required: true
value: [complete, chromosome, scaffold, contig]
gc_content:
required: false
type: float
Error Handling
When schema validation fails, GeneForgeLang provides clear error messages:
Required attribute 'sample_count' missing in contract outputs 'expression_data' for schema type 'GeneExpressionMatrix'
Best Practices
- Organize Schemas: Group related types in logical schema files
- Version Control: Include version information in schema files
- Documentation: Provide clear descriptions for each schema and attribute
- Validation: Define required attributes to ensure data quality
- Reusability: Design schemas to be reusable across multiple workflows
Next Steps
- IO Contracts Documentation - Learn how to use custom types in contracts
- Design Block Documentation - See how custom types work with AI workflows
- Error Handling Guide - Understand schema validation errors