YAML Project configuration breakdown¶
Configuration Validation
Always validate your YAML configuration before using:
# Validate configuration syntax and structure
depictio-cli config validate-project-config \
--project-config-path ./my_project.yaml --verbose
For YAML syntax highlighting in VS Code: Install the YAML extension and save files with .yaml or .yml extension.
Configuration Template Warning
Do not copy-paste entire configurations blindly. This reference shows all available options with their defaults and descriptions. Only specify values that differ from defaults or are required for your specific use case.
This guide provides comprehensive YAML configuration examples for Depictio projects, from simple setups to complex bioinformatics workflows. For a complete reference of all options, see the Configuration Reference.
Quick Start Examples¶
Choose your starting point based on your project complexity:
Perfect for direct file upload and analysis:
For CLI-based basic projects with direct files:
name: "CSV Analysis Project"
project_type: "basic"
is_public: false
data_collections:
- data_collection_tag: "main_data"
description: "Primary dataset for analysis"
config:
type: "table"
metatype: "metadata"
scan:
mode: "single"
scan_parameters:
filename: "/path/to/data.csv"
dc_specific_properties:
format: "csv"
polars_kwargs:
separator: ","
has_header: true
For workflow-generated data with pattern matching:
name: "RNA-seq Analysis"
project_type: "advanced"
workflows:
- name: "rnaseq_pipeline"
engine:
name: "nextflow"
version: "24.10.3"
data_location:
structure: "sequencing-runs"
locations:
- "{DATA_LOCATION}/results"
runs_regex: "run_.*"
data_collections:
- data_collection_tag: "gene_counts"
config:
type: "table"
metatype: "aggregate"
scan:
mode: "recursive"
scan_parameters:
regex_config:
pattern: "counts/.*\\.tsv"
dc_specific_properties:
format: "tsv"
polars_kwargs:
separator: "\t"
has_header: true
Configuration Schema¶
Project-Level Configuration¶
All projects share these top-level configuration options:
# === REQUIRED FIELDS ===
# Project identification
name: string # Required: Human-readable project name
# Must be non-empty string
# Example: "Multi-omics Cancer Study"
# Advanced projects only: Workflow definitions
workflows: [Workflow] # Default: [] (empty for basic projects)
# Array of workflow configurations
# See "Workflow Configuration" section
# === REQUIRED FIELDS WITH DEFAULT VALUES ===
# Project type determines the configuration structure
project_type: "basic" | "advanced" # Default: "basic"
# Options:
# - basic: Direct file upload/processing
# - advanced: Workflow-integrated projects
# Project visibility
is_public: boolean # Default: false
# true: Visible to all users (read-only)
# false: Restricted to project members
# CLI integration (typically auto-managed)
yaml_config_path: string | null # Default: null
# Path to this YAML configuration file
# Auto-populated by CLI, rarely set manually
# === OPTIONAL FIELDS ===
# External project management integration
data_management_platform_project_url: string | null # Default: null
# URL to external project system
# Must start with http:// or https://
# Example: "https://labid.embl.org/projects/123"
Project Types Deep Dive¶
Basic Projects¶
Designed to be minimal and easy to set up, WebUI compatible, and suitable for small-scale analyses.
Use cases:
- Direct CSV/Excel file analysis
- Ad-hoc data exploration
- Small-scale studies (< 100 files)
- Quick prototyping and visualization
Configuration:
- Minimal setup required
- Data uploaded via web interface or defined in
data_collections - No workflow integration needed (default workflow is created under the hood for system compatibility)
Advanced Projects¶
Designed for complex data processing pipelines, automated workflows, and large-scale analyses.
Use cases:
- Bioinformatics pipeline outputs
- Multi-sample studies oriented
- Automated data ingestion and updates
- Core facility workflows
Configuration:
- Requires
workflowsanddata_collectionsdefinitions (1 workflow contains >= 1 data collection(s)) - CLI-driven data processing
- Regex-based file discovery
- Multi-run aggregation capabilities
Workflow Configuration (Advanced)¶
Advanced projects use workflows to describe data organization patterns. Each workflow corresponds to a computational pipeline that generates structured data.
workflows:
- # === REQUIRED FIELDS ===
name: string # Required: Workflow identifier
# Must be non-empty
# Example: "rnaseq_pipeline", "variant_calling"
engine: # Required: Execution engine information
name: string # Required: Engine name
# Examples: "nextflow", "snakemake", "python",
# "galaxy", "cwl", "shell", "r"
# Note: currently not validated against a list
version: string | null # Optional: Engine version for reproducibility
# Example: "24.10.3", "7.32.0"
# Note: version is currently saved only for the sake of documentation, no functional impact on the system, will be implemented in the future
data_location: # Required: Where to find workflow outputs
structure: string # Required: Directory organization pattern
# Options:
# - "flat": All files in single directory level
# - "sequencing-runs": Hierarchical run-based structure
locations: [string] # Required: Root directories to search
# Supports environment variable expansion: {VAR_NAME}
# Examples:
# - "/absolute/path/to/data"
# - "{DATA_ROOT}/project1"
# - "{HOME}/workflows/results"
runs_regex: string | null # Required if structure="sequencing-runs"
# Optional if structure="flat"
# Regex pattern to identify individual runs
# Example: "run_\\d+", "sample_.*", "batch[A-Z]"
data_collections: [DataCollection] # Required: Data collection definitions
# Array of data collections for this workflow
# See "Data Collections Configuration"
# === OPTIONAL FIELDS ===
version: string | null # Optional: Workflow version
# Example: "1.0.0", "v2.3.1"
# Note: version is currently saved only for the sake of documentation, no functional impact on the system, will be implemented in the future
catalog: # Optional: Workflow registry information
name: string | null # Options: "nf-core", "smk-wf-catalog", "workflowhub"
url: string | null # Catalog URL
# Example: "https://nf-co.re/rnaseq"
repository_url: string | null # Optional: Source code repository
# Example: "https://github.com/user/workflow"
workflow_tag: string | null # Optional: Auto-generated workflow identifier
# Format: "engine/name" or "catalog/name"
# Usually auto-populated, rarely set manually
config: # Optional: Workflow-specific configuration
version: string | null # Workflow configuration version
workflow_parameters: object | null # Workflow-specific parameters
Workflow Data Location Patterns¶
Flat Structure¶
data_location:
structure: "flat"
locations:
- "/data/project1/results"
- "{BACKUP_LOCATION}/project1" # Environment variable expansion
# runs_regex not needed for flat structure
# Directory layout:
# /data/project1/results/
# ├── sample1_stats.csv
# ├── sample2_stats.csv
# ├── sample1_counts.tsv
# └── sample2_counts.tsv
Sequencing-Runs Structure¶
data_location:
structure: "sequencing-runs"
locations:
- "{DATA_ROOT}/rnaseq_study"
runs_regex: "run_\\d+" # Required: matches run_001, run_002, etc.
# Directory layout:
# ${DATA_ROOT}/rnaseq_study/
# ├── run_001/
# │ ├── sample_A/
# │ │ ├── stats.tsv
# │ │ └── counts.tsv
# │ └── sample_B/
# │ ├── stats.tsv
# │ └── counts.tsv
# └── run_002/
# ├── sample_C/
# │ ├── stats.tsv
# │ └── counts.tsv
# └── sample_D/
# ├── stats.tsv
# └── counts.tsv
Environment Variable Expansion¶
Depictio supports environment variable expansion in file paths:
# Environment setup
# export DATA_ROOT="/mnt/storage/projects"
# export PROJECT_NAME="cancer_study"
# export BACKUP_LOCATION="/backup/data"
data_location:
locations:
- "{DATA_ROOT}/{PROJECT_NAME}/results" # Expands to: /mnt/storage/projects/cancer_study/results
- "{BACKUP_LOCATION}/{PROJECT_NAME}" # Expands to: /backup/data/cancer_study
Common Environment Variables:
DATA_ROOT,DATA_LOCATION- Primary data storagePROJECT_ROOT- Project base directoryHOME,USER- User-specific pathsSCRATCH_DIR,TEMP_DIR- Temporary storage locations
Data Collections Configuration¶
Data collections define how to discover, process, and structure your data files. They are the core building blocks that connect file system data to Depictio dashboards.
data_collections:
- # === REQUIRED FIELDS ===
data_collection_tag: string # Required: Unique identifier within project
# Must be unique across all data collections
# Used for referencing in links and dashboards
# Example: "gene_counts", "quality_metrics"
config: # Required: Data collection configuration
type: string # Required: Data collection type
# Options (only table for now):
# - "table": Tabular data (CSV, TSV, Excel, Parquet, Feather)
metatype: string | null # Required for table type
# Required: Data collection metatype
# Options:
# - "metadata": Single annotation/metadata file per project
# - "aggregate": Multiple files combined into unified dataset
scan: # Required: File discovery configuration
mode: string # Required: Scanning strategy
# Options:
# - "single": Single file per project/run
# - "recursive": Pattern-based file discovery
scan_parameters: # Required: Mode-specific parameters
# For mode: "single"
filename: string # Required: Relative or absolute file path
# Example: "metadata/sample_info.csv"
# For mode: "recursive"
regex_config: # Required: Pattern matching configuration
pattern: string # Required: Regex pattern for file discovery
# Example: "stats/.*_stats\\.tsv"
wildcards: [...] # Optional: Named capture groups for metadata extraction
max_depth: int | null # Optional: Maximum directory depth to search
ignore: [string] | null # Optional: Patterns to exclude from search
dc_specific_properties: # Required: Type-specific configuration
# See "Table Configuration" section for details
# === OPTIONAL FIELDS ===
description: string | null # Optional: Human-readable description
# Example: "Per-sample quality control metrics"
Specific Data Collection Types¶
Data Collection Types
Data collections can be of different types, each with its own configuration requirements. Currently, only the "table" type is supported, which handles structured tabular data. Future versions may introduce additional types for other data formats (e.g., Omics data, Images, GeoJSON).
Table Data Collection Configuration¶
Table data collections handle structured tabular data (CSV, TSV, Excel, Parquet, Feather):
dc_specific_properties:
# === REQUIRED FIELDS ===
format: string # Required: File format
# Values: "csv", "tsv", "xlsx", "xls", "parquet", "feather"
# Case-insensitive, normalized to lowercase
polars_kwargs: object # Required: Polars DataFrame configuration
# Polars-specific parameters for data reading
# See "Polars Configuration" section
# === OPTIONAL FIELDS ===
keep_columns: [string] | null # Optional: Column filtering
# If specified, only these columns are retained
# Improves performance for large datasets
# Example: ["sample_id", "expression", "p_value"]
columns_description: {string: string} | null # Optional: Column documentation
# Human-readable column descriptions
# Used in dashboard tooltips and documentation
# Example:
# sample_id: "Unique sample identifier"
# expression: "Log2 expression level"
MultiQC Data Collection Configuration (v0.5.0+)¶
MultiQC data collections provide specialized handling for quality control reports with automatic detection:
config:
# === REQUIRED FIELD ===
type: "MultiQC" # Required: Identifies this as a MultiQC data collection
# === AUTOMATIC HANDLING ===
# The following are NOT required for MultiQC type:
# - metatype: Automatically handled
# - scan: Auto-detects multiqc_data/multiqc.parquet in each run
# - dc_specific_properties: Not needed
# === REQUIREMENTS ===
# - MultiQC 1.29+ must be used to generate parquet format
# - Each run directory must contain: multiqc_data/multiqc.parquet
Linking MultiQC with Metadata (Recommended):
Use links for interactive filtering between metadata tables and MultiQC:
# At project level - enables cross-DC filtering
links:
- source_dc_id: sample_metadata # Filter from metadata table
source_column: sample_id # Filter by sample ID
target_dc_id: multiqc_data # Update MultiQC visualizations
target_type: multiqc # Target type
link_config:
resolver: sample_mapping # Maps canonical IDs to MultiQC sample variants
MultiQC-Specific Notes:
- Automatic Detection: No need to specify file paths or patterns
- Fixed Location: System looks for
multiqc_data/multiqc.parquetin each run - Format Requirement: Requires MultiQC 1.29+ to generate
.parquetoutput - No Configuration Overhead: Just specify
type: "MultiQC"and you're done - Link Column: Use
sample_mappingresolver in links to connect with metadata tables
Example Directory Structure:
project_root/
├── run_001/
│ └── multiqc_data/
│ └── multiqc.parquet # Auto-detected
├── run_002/
│ └── multiqc_data/
│ └── multiqc.parquet # Auto-detected
└── run_003/
└── multiqc_data/
└── multiqc.parquet # Auto-detected
Polars Configuration Options¶
Polars is Depictio's high-performance data processing engine. Configure data reading with these options:
polars_kwargs:
# === COMMON OPTIONS ===
# CSV/TSV specific
separator: string # Column separator character
# Default: "," for CSV, "\t" for TSV
# Example: ",", "\t", "|", ";"
has_header: boolean # First row contains column names
# Default: true
# Set false if first row is data
skip_rows: int # Number of rows to skip at file beginning
# Default: 0
# Useful for files with metadata headers
# Example: 3 (skip first 3 lines)
# === ADVANCED OPTIONS ===
# Data types
column_types: object # Explicit column type mapping
# Forces specific data types
# Example:
# sample_id: "String"
# count: "Int64"
# p_value: "Float64"
# significant: "Boolean"
column_names: [string] # Override column names
# Useful when has_header: false
# Example: ["sample", "gene", "expression"]
# Missing data handling
null_values: [string] # Values to treat as null/missing
# Default: ["", "NULL", "null", "None"]
# Example: ["NA", "N/A", "", "null", "-"]
# Performance options
n_rows: int # Limit number of rows to read
# Useful for testing configurations
# Example: 1000 (read only first 1000 rows)
# Encoding
encoding: string # File encoding
# Default: "utf8"
# Example: "utf8", "latin1", "ascii"
# Excel specific (for .xlsx, .xls files)
sheet_name: string # Excel sheet name to read
# Default: first sheet
# Example: "Results", "Sheet1"
sheet_id: int # Excel sheet index (0-based)
# Alternative to sheet_name
# Example: 0 (first sheet), 1 (second sheet)
File Scanning Patterns¶
Single File Mode¶
Best for metadata files or summary statistics generated once per project:
scan:
mode: "single"
scan_parameters:
filename: "multiqc_data/multiqc_general_stats.txt"
# Finds exactly one file:
# project_root/multiqc_data/multiqc_general_stats.txt
Recursive Mode¶
Uses regex patterns to discover files across directory structures:
scan:
mode: "recursive"
scan_parameters:
regex_config:
pattern: "star_salmon/.*/quant.sf"
# Wildcards for metadata extraction (optional)
wildcards:
- name: "sample_id"
wildcard_regex: "star_salmon/([^/]+)/quant.sf"
max_depth: 5 # Optional: limit search depth
# Matches files like:
# run_001/star_salmon/sample_A/quant.sf
# run_001/star_salmon/sample_B/quant.sf
# run_002/star_salmon/sample_C/quant.sf
Cross-DC Links (Interactive Filtering)¶
Links enable cross-DC filtering at runtime—filter one data collection and automatically update related visualizations without pre-computed joins.
# Project-level links configuration
links:
- source_dc_id: string # Required: DC containing the filter
source_column: string # Required: Column to filter on
target_dc_id: string # Required: DC to receive filtered values
target_type: string # Required: "table" or "multiqc"
link_config:
resolver: string # Required: "direct", "sample_mapping", or "pattern"
target_field: string # Optional: Field to match in target DC
Resolvers:
| Resolver | Use Case |
|---|---|
direct |
Same value in both DCs |
sample_mapping |
Canonical ID → MultiQC sample variants |
pattern |
Template substitution (e.g., {sample}.bam) |
Example:
name: "My Project"
project_type: "advanced"
links:
- source_dc_id: sample_metadata
source_column: sample_id
target_dc_id: multiqc_fastqc
target_type: multiqc
link_config:
resolver: sample_mapping
workflows:
# ... workflow configuration ...
For detailed documentation, see Cross-DC Filtering.
Data Collection Joins (Client-Side Pre-computed)¶
Joins vs Links
Joins and Links serve different purposes:
- Joins: Combine Table DCs into a single pre-computed view during
depictio-cli run. The joined dataset is pushed to the server as one unified Delta table. No dynamic joining happens on the server. - Links: Enable runtime cross-DC filtering in the dashboard UI. Data collections remain separate; filtering happens dynamically.
Joins are processed client-side when running depictio-cli and create a merged view that gets uploaded to the server. They only work with Table-type data collections.
# Join configuration (in data collection)
join:
on_columns: [string] # Column names for joining
how: string # "inner", "outer", "left", "right"
with_dc: [string] # Target data collection tags to join with
Example: Joining sample metadata with expression data
data_collections:
- data_collection_tag: "sample_metadata"
config:
type: "table"
metatype: "metadata"
# ... scan and dc_specific_properties ...
- data_collection_tag: "expression_with_metadata"
config:
type: "table"
metatype: "aggregate"
# ... scan and dc_specific_properties ...
join:
on_columns: ["sample_id"]
how: "left"
with_dc: ["sample_metadata"]
When you run depictio-cli run, the CLI will:
- Load both data collections locally
- Perform the join operation on the client
- Push the resulting joined table to the server
When to use Joins vs Links:
| Use Case | Recommended |
|---|---|
| Need a single combined dataset for analysis | Joins |
| Want to filter one DC based on another in UI | Links |
| Working with MultiQC data | Links (joins don't support MultiQC) |
| Need to reduce server-side data duplication | Joins |
| Want dynamic, runtime filtering | Links |
Configuration Patterns Library¶
Pattern 1: Multi-sample RNA-seq Study¶
name: "RNA-seq Expression Analysis"
project_type: "advanced"
is_public: false
# Cross-DC Links for interactive filtering
links:
- source_dc_id: qc_summary # Filter from QC data
source_column: Sample # Column to filter on
target_dc_id: salmon_gene_tpm # Target expression data
target_type: table # Target type
link_config:
resolver: direct # Same sample IDs in both DCs
workflows:
- name: "nextflow-custom-rnaseq"
engine:
name: "nextflow"
version: "24.10.3"
version: "3.18.0"
data_location:
structure: "sequencing-runs"
locations:
- "{DATA_ROOT}/rnaseq_studies/cohort_2024"
runs_regex: "batch_[A-C]"
data_collections:
# Quality control metrics
- data_collection_tag: "qc_summary"
description: "Aggregated quality control summary across samples"
config:
type: "table"
metatype: "metadata"
scan:
mode: "single"
scan_parameters:
filename: "qc_reports/multiqc_general_stats.txt"
dc_specific_properties:
format: "tsv"
polars_kwargs:
separator: "\t"
has_header: true
keep_columns:
- "Sample"
- "fastqc-total_sequences"
- "fastqc-percent_duplicates"
- "fastqc-percent_gc"
- "fastqc-avg_sequence_length"
- "fastqc-percent_fails"
columns_description:
"Sample": "Sample identifier"
"fastqc-total_sequences": "Total number of sequences processed by FastQC"
"fastqc-percent_duplicates": "Percentage of duplicate sequences"
"fastqc-percent_gc": "Overall GC content percentage"
"fastqc-avg_sequence_length": "Average read length"
"fastqc-percent_fails": "Percentage of FastQC modules that failed"
# Gene expression quantification
- data_collection_tag: "salmon_gene_tpm"
description: "Salmon merged gene-level TPM expression matrix"
config:
type: "table"
metatype: "metadata"
scan:
mode: "single"
scan_parameters:
filename: "salmon/salmon.gene_tpm.melted.tsv"
dc_specific_properties:
format: "tsv"
polars_kwargs:
separator: "\t"
has_header: true
columns_description:
sample_id: "Sample identifier"
gene_id: "Gene identifier"
gene_name: "Gene symbol/name"
condition: "Experimental condition (e.g., treatment, control)"
replicate: "Biological replicate identifier"
tpm: "Transcripts per million (TPM) expression value"
# Note: Use project-level links (above) for interactive filtering
Pattern 2: Multi-sample Strand-seq (single-cell) Study¶
name: "Strand-Seq data analysis"
project_type: "advanced"
data_management_platform_project_url: "https://labid.embl.org/core/projects/default/5baa8f07-bd00-46e7-b3cb-ec79d01f6f3c"
# Cross-DC Links for interactive filtering
links:
- source_dc_id: mosaicatcher_samples_metadata
source_column: sample
target_dc_id: mosaicatcher_stats
target_type: table
link_config:
resolver: direct
- source_dc_id: mosaicatcher_samples_metadata
source_column: sample
target_dc_id: ashleys_labels
target_type: table
link_config:
resolver: direct
- source_dc_id: mosaicatcher_samples_metadata
source_column: sample
target_dc_id: sv_calls
target_type: table
link_config:
resolver: direct
workflows:
- name: "mosaicatcher-pipeline"
engine:
name: "snakemake"
version: "2.3.5"
catalog:
name: "smk-wf-catalog"
url: "https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/friendsofstrandseq/mosaicatcher-pipeline.html"
repository_url: "https://github.com/friendsofstrandseq/mosaicatcher-pipeline"
workflow_tag: "snakemake/mosaicatcher-pipeline"
data_location:
structure: "sequencing-runs"
locations:
- "/Data/mosaicatcher-pipeline-results/"
runs_regex: ".*"
data_collections:
# MosaiCatcher statistics per cell
- data_collection_tag: "mosaicatcher_stats"
description: "Statistics file generated by MosaiCatcher"
config:
type: "table"
metatype: "aggregate"
scan:
mode: "recursive"
scan_parameters:
regex_config:
pattern: ".*\\.info_raw"
dc_specific_properties:
format: "tsv"
polars_kwargs:
skip_rows: 13
separator: "\t"
has_header: true
keep_columns:
- "sample"
- "cell"
- "mapped"
- "dupl"
- "pass1"
- "good"
columns_description:
sample: "Sample ID"
cell: "Cell ID"
mapped: "Total number of reads seen"
dupl: "Reads filtered out as PCR duplicates"
pass1: "Coverage compliant cells (binary)"
good: "Reads used for counting"
# Ashleys QC labels
- data_collection_tag: "ashleys_labels"
description: "Probabilities generated by ashleys-qc model"
config:
type: "table"
metatype: "aggregate"
scan:
mode: "recursive"
scan_parameters:
regex_config:
pattern: ".*cell_selection/labels\\.tsv"
dc_specific_properties:
format: "tsv"
polars_kwargs:
separator: "\t"
has_header: true
# Note: Use project-level links (above) for interactive filtering
# Structural variant calls
- data_collection_tag: "sv_calls"
description: "SV calls generated by MosaiCatcher (stringent)"
config:
type: "table"
metatype: "aggregate"
scan:
mode: "recursive"
scan_parameters:
regex_config:
pattern: "stringent_filterTRUE\\.tsv"
dc_specific_properties:
format: "tsv"
polars_kwargs:
separator: "\t"
has_header: true
columns_description:
sample: "Sample identifier"
cell: "Single cell identifier"
chrom: "Chromosome name"
start: "SV start position"
end: "SV end position"
sv_call_name: "Structural variant call identifier"
# Note: Use project-level links (above) for interactive filtering
# Sample metadata
- data_collection_tag: "mosaicatcher_samples_metadata"
description: "Metadata file for MosaiCatcher samples"
config:
type: "table"
metatype: "metadata"
scan:
mode: "single"
scan_parameters:
filename: "/Data/mosaicatcher-pipeline-results/metadata/mosaicatcher_samples_metadata_2024.xlsx"
dc_specific_properties:
format: "xlsx"
polars_kwargs:
has_header: true
columns_description:
sample: "Sample identifier"
patient_id: "Patient identifier"
tissue_type: "Type of tissue analyzed"
collection_date: "Date of sample collection"
# Note: Use project-level links (above) for interactive filtering
Pattern 3: MultiQC Quality Control Integration (v0.5.0+)¶
name: "MultiQC Quality Control Analysis"
project_type: "advanced"
# Cross-DC Links for interactive filtering
links:
- source_dc_id: sample_metadata # Filter from metadata table
source_column: sample # Column to filter on
target_dc_id: multiqc_data # Target MultiQC visualizations
target_type: multiqc # Target type
link_config:
resolver: sample_mapping # Maps canonical IDs to MultiQC sample variants
workflows:
- name: "qc-pipeline"
engine:
name: "python"
data_location:
structure: "sequencing-runs"
runs_regex: "run_*"
locations:
- "{DATA_ROOT}/qc_results"
data_collections:
# MultiQC data collection - automatically detected
- data_collection_tag: "multiqc_data"
description: "MultiQC quality control report data"
config:
type: "MultiQC"
# NOTE: No scan, metatype, or dc_specific_properties needed
# MultiQC automatically detects multiqc_data/multiqc.parquet in each run
# Requires MultiQC 1.29+ to generate parquet format
# Sample metadata for context
- data_collection_tag: "sample_metadata"
description: "Sample metadata table for MultiQC integration"
config:
type: "Table"
metatype: "Metadata"
scan:
mode: "single"
scan_parameters:
filename: "metadata/sample_info.csv"
dc_specific_properties:
format: "CSV"
polars_kwargs:
separator: ","
has_header: true
columns_description:
"sample": "Sample identifier matching MultiQC sample names"
"treatment": "Treatment condition applied to the sample"
"batch": "Batch identifier for experimental runs"
# Note: Use project-level links (above) for interactive filtering
# instead of join configuration
# Additional per-sample QC metrics
- data_collection_tag: "sample_qc_metrics"
description: "Per-sample quality control metrics and statistics"
config:
type: "Table"
metatype: "Aggregate"
scan:
mode: "recursive"
scan_parameters:
regex_config:
pattern: "qc_metrics/.*\\.csv"
dc_specific_properties:
format: "CSV"
polars_kwargs:
separator: ","
has_header: true
columns_description:
"sample": "Sample identifier"
"total_reads": "Total number of reads"
"mapped_reads": "Number of successfully mapped reads"
"mapping_rate": "Percentage of reads mapped to reference"
# Note: Use project-level links (above) for interactive filtering
Key Points for MultiQC Integration:
- Automatic Detection: MultiQC data is automatically detected from
multiqc_data/multiqc.parquetin each run - Minimal Configuration: Only requires
type: "MultiQC"- no scan parameters needed - Format Requirement: Requires MultiQC 1.29+ to generate parquet output format
- Cross-DC Filtering: Use project-level links with
sample_mappingresolver to connect metadata tables with MultiQC visualizations - Location: Each run must contain a
multiqc_data/multiqc.parquetfile generated by MultiQC
🔍 Validation¶
CLI Validation Commands¶
# Validate configuration file syntax and structure
depictio-cli config validate-project-config \
--project-config-path ./my_project.yaml \
--verbose
# Check server connectivity and permissions
depictio-cli config check-server-accessibility
# Dry-run mode: validate without processing data
depictio-cli run --project-config-path ./my_project.yaml \
--dry-run --verbose
# Test file discovery patterns
depictio-cli data scan --project-config-path ./my_project.yaml \
--verbose --verbose-level DEBUG
📖 Additional Resources¶
- Projects Guide - Comprehensive project management guide
- Configuration Reference - Complete YAML reference documentation
- CLI Reference - Complete CLI command documentation
- Dashboard Creation - Building interactive dashboards
- API Documentation - Programmatic project management
- Pydantic Models - Schema definitions and validation
This reference covers all configuration options available in Depictio. Start with the Quick Start examples and gradually add complexity as needed for your specific use case.