YAML Project configuration breakdown¶

Configuration Validation

Always validate your YAML configuration before using:

# Validate configuration syntax and structure
depictio-cli config validate-project-config \
  --project-config-path ./my_project.yaml --verbose

For YAML syntax highlighting in VS Code: Install the YAML extension and save files with .yaml or .yml extension.

Configuration Template Warning

Do not copy-paste entire configurations blindly. This reference shows all available options with their defaults and descriptions. Only specify values that differ from defaults or are required for your specific use case.

This guide provides comprehensive YAML configuration examples for Depictio projects, from simple setups to complex bioinformatics workflows. For a complete reference of all options, see the Configuration Reference.

Quick Start Examples¶

Choose your starting point based on your project complexity:

Basic Project (Minimal)

Perfect for direct file upload and analysis:

name: "My Analysis Project"
project_type: "basic"

# Files will be uploaded through the web interface
# No additional configuration needed!

Basic Project (CLI)

For CLI-based basic projects with direct files:

name: "CSV Analysis Project"
project_type: "basic"
is_public: false

data_collections:
  - data_collection_tag: "main_data"
    description: "Primary dataset for analysis"
    config:
      type: "table"
      metatype: "metadata"
      scan:
        mode: "single"
        scan_parameters:
          filename: "/path/to/data.csv"
      dc_specific_properties:
        format: "csv"
        polars_kwargs:
          separator: ","
          has_header: true

Advanced Project (Minimal)

For workflow-generated data with pattern matching:

name: "RNA-seq Analysis"
project_type: "advanced"

workflows:
  - name: "rnaseq_pipeline"
    engine:
      name: "nextflow"
      version: "24.10.3"
    data_location:
      structure: "sequencing-runs"
      locations:
        - "{DATA_LOCATION}/results"
      runs_regex: "run_.*"
    data_collections:
      - data_collection_tag: "gene_counts"
        config:
          type: "table"
          metatype: "aggregate"
          scan:
            mode: "recursive"
            scan_parameters:
              regex_config:
                pattern: "counts/.*\\.tsv"
          dc_specific_properties:
            format: "tsv"
            polars_kwargs:
              separator: "\t"
              has_header: true

Configuration Schema¶

Project-Level Configuration¶

All projects share these top-level configuration options:

# === REQUIRED FIELDS ===

# Project identification
name: string                              # Required: Human-readable project name
                                          # Must be non-empty string
                                          # Example: "Multi-omics Cancer Study"


# Advanced projects only: Workflow definitions
workflows: [Workflow]                     # Default: [] (empty for basic projects)
                                          # Array of workflow configurations
                                          # See "Workflow Configuration" section

# === REQUIRED FIELDS WITH DEFAULT VALUES ===

# Project type determines the configuration structure
project_type: "basic" | "advanced"        # Default: "basic"
                                          # Options:
                                          # - basic: Direct file upload/processing
                                          # - advanced: Workflow-integrated projects

# Project visibility
is_public: boolean                        # Default: false
                                          # true: Visible to all users (read-only)
                                          # false: Restricted to project members

# CLI integration (typically auto-managed)
yaml_config_path: string | null          # Default: null
                                          # Path to this YAML configuration file
                                          # Auto-populated by CLI, rarely set manually

# === OPTIONAL FIELDS ===

# External project management integration
data_management_platform_project_url: string | null    # Default: null
                                                        # URL to external project system
                                                        # Must start with http:// or https://
                                                        # Example: "https://labid.embl.org/projects/123"

Project Types Deep Dive¶

Basic Projects¶

Designed to be minimal and easy to set up, WebUI compatible, and suitable for small-scale analyses.

Use cases:

Direct CSV/Excel file analysis
Ad-hoc data exploration
Small-scale studies (< 100 files)
Quick prototyping and visualization

Configuration:

Minimal setup required
Data uploaded via web interface or defined in data_collections
No workflow integration needed (default workflow is created under the hood for system compatibility)

Advanced Projects¶

Designed for complex data processing pipelines, automated workflows, and large-scale analyses.

Use cases:

Bioinformatics pipeline outputs
Multi-sample studies oriented
Automated data ingestion and updates
Core facility workflows

Configuration:

Requires workflows and data_collections definitions (1 workflow contains >= 1 data collection(s))
CLI-driven data processing
Regex-based file discovery
Multi-run aggregation capabilities

Workflow Configuration (Advanced)¶

Advanced projects use workflows to describe data organization patterns. Each workflow corresponds to a computational pipeline that generates structured data.

workflows:
  - # === REQUIRED FIELDS ===

    name: string                          # Required: Workflow identifier
                                          # Must be non-empty
                                          # Example: "rnaseq_pipeline", "variant_calling"

    engine:                               # Required: Execution engine information
      name: string                        # Required: Engine name
                                          # Examples: "nextflow", "snakemake", "python",
                                          #          "galaxy", "cwl", "shell", "r"
                                          # Note: currently not validated against a list

      version: string | null              # Optional: Engine version for reproducibility
                                          # Example: "24.10.3", "7.32.0"
                                          # Note: version is currently saved only for the sake of documentation, no functional impact on the system, will be implemented in the future

    data_location:                        # Required: Where to find workflow outputs
      structure: string                   # Required: Directory organization pattern
                                          # Options:
                                          # - "flat": All files in single directory level
                                          # - "sequencing-runs": Hierarchical run-based structure

      locations: [string]                 # Required: Root directories to search
                                          # Supports environment variable expansion: {VAR_NAME}
                                          # Examples:
                                          #   - "/absolute/path/to/data"
                                          #   - "{DATA_ROOT}/project1"
                                          #   - "{HOME}/workflows/results"

      runs_regex: string | null           # Required if structure="sequencing-runs"
                                          # Optional if structure="flat"
                                          # Regex pattern to identify individual runs
                                          # Example: "run_\\d+", "sample_.*", "batch[A-Z]"

    data_collections: [DataCollection]    # Required: Data collection definitions
                                          # Array of data collections for this workflow
                                          # See "Data Collections Configuration"

    # === OPTIONAL FIELDS ===

    version: string | null                # Optional: Workflow version
                                          # Example: "1.0.0", "v2.3.1"
                                          # Note: version is currently saved only for the sake of documentation, no functional impact on the system, will be implemented in the future

    catalog:                              # Optional: Workflow registry information
      name: string | null                 # Options: "nf-core", "smk-wf-catalog", "workflowhub"
      url: string | null                  # Catalog URL
                                          # Example: "https://nf-co.re/rnaseq"

    repository_url: string | null         # Optional: Source code repository
                                          # Example: "https://github.com/user/workflow"

    workflow_tag: string | null           # Optional: Auto-generated workflow identifier
                                          # Format: "engine/name" or "catalog/name"
                                          # Usually auto-populated, rarely set manually

    config:                               # Optional: Workflow-specific configuration
      version: string | null              # Workflow configuration version
      workflow_parameters: object | null # Workflow-specific parameters

Workflow Data Location Patterns¶

Flat Structure¶

data_location:
  structure: "flat"
  locations:
    - "/data/project1/results"
    - "{BACKUP_LOCATION}/project1"  # Environment variable expansion
  # runs_regex not needed for flat structure

# Directory layout:
# /data/project1/results/
# ├── sample1_stats.csv
# ├── sample2_stats.csv
# ├── sample1_counts.tsv
# └── sample2_counts.tsv

Sequencing-Runs Structure¶

data_location:
  structure: "sequencing-runs"
  locations:
    - "{DATA_ROOT}/rnaseq_study"
  runs_regex: "run_\\d+"  # Required: matches run_001, run_002, etc.

# Directory layout:
# ${DATA_ROOT}/rnaseq_study/
# ├── run_001/
# │   ├── sample_A/
# │   │   ├── stats.tsv
# │   │   └── counts.tsv
# │   └── sample_B/
# │       ├── stats.tsv
# │       └── counts.tsv
# └── run_002/
#     ├── sample_C/
#     │   ├── stats.tsv
#     │   └── counts.tsv
#     └── sample_D/
#         ├── stats.tsv
#         └── counts.tsv

Environment Variable Expansion¶

Depictio supports environment variable expansion in file paths:

# Environment setup
# export DATA_ROOT="/mnt/storage/projects"
# export PROJECT_NAME="cancer_study"
# export BACKUP_LOCATION="/backup/data"

data_location:
  locations:
    - "{DATA_ROOT}/{PROJECT_NAME}/results"     # Expands to: /mnt/storage/projects/cancer_study/results
    - "{BACKUP_LOCATION}/{PROJECT_NAME}"       # Expands to: /backup/data/cancer_study

Common Environment Variables:

DATA_ROOT, DATA_LOCATION - Primary data storage
PROJECT_ROOT - Project base directory
HOME, USER - User-specific paths
SCRATCH_DIR, TEMP_DIR - Temporary storage locations

Data Collections Configuration¶

Data collections define how to discover, process, and structure your data files. They are the core building blocks that connect file system data to Depictio dashboards.

data_collections:
  - # === REQUIRED FIELDS ===

    data_collection_tag: string    # Required: Unique identifier within project
                                  # Must be unique across all data collections
                                  # Used for referencing in links and dashboards
                                  # Example: "gene_counts", "quality_metrics"

    config:                     # Required: Data collection configuration
      type: string                # Required: Data collection type
                                  # Options (only table for now):
                                  #  - "table": Tabular data (CSV, TSV, Excel, Parquet, Feather)

      metatype: string | null    # Required for table type
                                  # Required: Data collection metatype
                                  # Options:
                                  # - "metadata": Single annotation/metadata file per project
                                  # - "aggregate": Multiple files combined into unified dataset

      scan:                     # Required: File discovery configuration
        mode: string              # Required: Scanning strategy
                                  # Options:
                                  # - "single": Single file per project/run
                                  # - "recursive": Pattern-based file discovery

        scan_parameters:        # Required: Mode-specific parameters
          # For mode: "single"
          filename: string      # Required: Relative or absolute file path
                               # Example: "metadata/sample_info.csv"

          # For mode: "recursive"
          regex_config:         # Required: Pattern matching configuration
            pattern: string    # Required: Regex pattern for file discovery
                              # Example: "stats/.*_stats\\.tsv"
            wildcards: [...]   # Optional: Named capture groups for metadata extraction
          max_depth: int | null # Optional: Maximum directory depth to search
          ignore: [string] | null  # Optional: Patterns to exclude from search

      dc_specific_properties:   # Required: Type-specific configuration
        # See "Table Configuration" section for details

    # === OPTIONAL FIELDS ===

    description: string | null  # Optional: Human-readable description
                               # Example: "Per-sample quality control metrics"

Specific Data Collection Types¶

Data Collection Types

Data collections can be of different types, each with its own configuration requirements. Currently, only the "table" type is supported, which handles structured tabular data. Future versions may introduce additional types for other data formats (e.g., Omics data, Images, GeoJSON).

Table Data Collection Configuration¶

Table data collections handle structured tabular data (CSV, TSV, Excel, Parquet, Feather):

dc_specific_properties:
  # === REQUIRED FIELDS ===

  format: string               # Required: File format
                              # Values: "csv", "tsv", "xlsx", "xls", "parquet", "feather"
                              # Case-insensitive, normalized to lowercase

  polars_kwargs: object       # Required: Polars DataFrame configuration
                             # Polars-specific parameters for data reading
                             # See "Polars Configuration" section

  # === OPTIONAL FIELDS ===

  keep_columns: [string] | null    # Optional: Column filtering
                                  # If specified, only these columns are retained
                                  # Improves performance for large datasets
                                  # Example: ["sample_id", "expression", "p_value"]

  columns_description: {string: string} | null  # Optional: Column documentation
                                               # Human-readable column descriptions
                                               # Used in dashboard tooltips and documentation
                                               # Example:
                                               #   sample_id: "Unique sample identifier"
                                               #   expression: "Log2 expression level"

MultiQC Data Collection Configuration (v0.5.0+)¶

MultiQC data collections provide specialized handling for quality control reports with automatic detection:

config:
  # === REQUIRED FIELD ===

  type: "MultiQC"          # Required: Identifies this as a MultiQC data collection

  # === AUTOMATIC HANDLING ===
  # The following are NOT required for MultiQC type:
  # - metatype: Automatically handled
  # - scan: Auto-detects multiqc_data/multiqc.parquet in each run
  # - dc_specific_properties: Not needed

  # === REQUIREMENTS ===
  # - MultiQC 1.29+ must be used to generate parquet format
  # - Each run directory must contain: multiqc_data/multiqc.parquet

Linking MultiQC with Metadata (Recommended):

Use links for interactive filtering between metadata tables and MultiQC:

# At project level - enables cross-DC filtering
links:
  - source_dc_id: sample_metadata    # Filter from metadata table
    source_column: sample_id          # Filter by sample ID
    target_dc_id: multiqc_data        # Update MultiQC visualizations
    target_type: multiqc              # Target type
    link_config:
      resolver: sample_mapping        # Maps canonical IDs to MultiQC sample variants

MultiQC-Specific Notes:

Automatic Detection: No need to specify file paths or patterns
Fixed Location: System looks for multiqc_data/multiqc.parquet in each run
Format Requirement: Requires MultiQC 1.29+ to generate .parquet output
No Configuration Overhead: Just specify type: "MultiQC" and you're done
Link Column: Use sample_mapping resolver in links to connect with metadata tables

Example Directory Structure:

project_root/
├── run_001/
│   └── multiqc_data/
│       └── multiqc.parquet    # Auto-detected
├── run_002/
│   └── multiqc_data/
│       └── multiqc.parquet    # Auto-detected
└── run_003/
    └── multiqc_data/
        └── multiqc.parquet    # Auto-detected

Polars Configuration Options¶

Polars is Depictio's high-performance data processing engine. Configure data reading with these options:

polars_kwargs:
  # === COMMON OPTIONS ===

  # CSV/TSV specific
  separator: string           # Column separator character
                             # Default: "," for CSV, "\t" for TSV
                             # Example: ",", "\t", "|", ";"

  has_header: boolean        # First row contains column names
                            # Default: true
                            # Set false if first row is data

  skip_rows: int            # Number of rows to skip at file beginning
                           # Default: 0
                           # Useful for files with metadata headers
                           # Example: 3 (skip first 3 lines)

  # === ADVANCED OPTIONS ===

  # Data types
  column_types: object      # Explicit column type mapping
                           # Forces specific data types
                           # Example:
                           #   sample_id: "String"
                           #   count: "Int64"
                           #   p_value: "Float64"
                           #   significant: "Boolean"

  column_names: [string]    # Override column names
                           # Useful when has_header: false
                           # Example: ["sample", "gene", "expression"]

  # Missing data handling
  null_values: [string]     # Values to treat as null/missing
                           # Default: ["", "NULL", "null", "None"]
                           # Example: ["NA", "N/A", "", "null", "-"]

  # Performance options
  n_rows: int               # Limit number of rows to read
                           # Useful for testing configurations
                           # Example: 1000 (read only first 1000 rows)

  # Encoding
  encoding: string          # File encoding
                           # Default: "utf8"
                           # Example: "utf8", "latin1", "ascii"

  # Excel specific (for .xlsx, .xls files)
  sheet_name: string        # Excel sheet name to read
                           # Default: first sheet
                           # Example: "Results", "Sheet1"

  sheet_id: int            # Excel sheet index (0-based)
                          # Alternative to sheet_name
                          # Example: 0 (first sheet), 1 (second sheet)

File Scanning Patterns¶

Single File Mode¶

Best for metadata files or summary statistics generated once per project:

scan:
  mode: "single"
  scan_parameters:
    filename: "multiqc_data/multiqc_general_stats.txt"

# Finds exactly one file:
# project_root/multiqc_data/multiqc_general_stats.txt

Recursive Mode¶

Uses regex patterns to discover files across directory structures:

scan:
  mode: "recursive"
  scan_parameters:
    regex_config:
      pattern: "star_salmon/.*/quant.sf"
      # Wildcards for metadata extraction (optional)
      wildcards:
        - name: "sample_id"
          wildcard_regex: "star_salmon/([^/]+)/quant.sf"
    max_depth: 5        # Optional: limit search depth

# Matches files like:
# run_001/star_salmon/sample_A/quant.sf
# run_001/star_salmon/sample_B/quant.sf
# run_002/star_salmon/sample_C/quant.sf

Cross-DC Links (Interactive Filtering)¶

Links enable cross-DC filtering at runtime—filter one data collection and automatically update related visualizations without pre-computed joins.

# Project-level links configuration
links:
  - source_dc_id: string      # Required: DC containing the filter
    source_column: string     # Required: Column to filter on
    target_dc_id: string      # Required: DC to receive filtered values
    target_type: string       # Required: "table" or "multiqc"
    link_config:
      resolver: string        # Required: "direct", "sample_mapping", or "pattern"
      target_field: string    # Optional: Field to match in target DC

Resolvers:

Resolver	Use Case
`direct`	Same value in both DCs
`sample_mapping`	Canonical ID → MultiQC sample variants
`pattern`	Template substitution (e.g., `{sample}.bam`)

Example:

name: "My Project"
project_type: "advanced"

links:
  - source_dc_id: sample_metadata
    source_column: sample_id
    target_dc_id: multiqc_fastqc
    target_type: multiqc
    link_config:
      resolver: sample_mapping

workflows:
  # ... workflow configuration ...

For detailed documentation, see Cross-DC Filtering.

Data Collection Joins (Client-Side Pre-computed)¶

Joins vs Links

Joins and Links serve different purposes:

Joins: Combine Table DCs into a single pre-computed view during depictio-cli run. The joined dataset is pushed to the server as one unified Delta table. No dynamic joining happens on the server.
Links: Enable runtime cross-DC filtering in the dashboard UI. Data collections remain separate; filtering happens dynamically.

Joins are processed client-side when running depictio-cli and create a merged view that gets uploaded to the server. They only work with Table-type data collections.

# Join configuration (in data collection)
join:
  on_columns: [string]        # Column names for joining
  how: string                 # "inner", "outer", "left", "right"
  with_dc: [string]           # Target data collection tags to join with

Example: Joining sample metadata with expression data

data_collections:
  - data_collection_tag: "sample_metadata"
    config:
      type: "table"
      metatype: "metadata"
      # ... scan and dc_specific_properties ...

  - data_collection_tag: "expression_with_metadata"
    config:
      type: "table"
      metatype: "aggregate"
      # ... scan and dc_specific_properties ...
    join:
      on_columns: ["sample_id"]
      how: "left"
      with_dc: ["sample_metadata"]

When you run depictio-cli run, the CLI will:

Load both data collections locally
Perform the join operation on the client
Push the resulting joined table to the server

When to use Joins vs Links:

Use Case	Recommended
Need a single combined dataset for analysis	Joins
Want to filter one DC based on another in UI	Links
Working with MultiQC data	Links (joins don't support MultiQC)
Need to reduce server-side data duplication	Joins
Want dynamic, runtime filtering	Links

Configuration Patterns Library¶

Pattern 1: Multi-sample RNA-seq Study¶

name: "RNA-seq Expression Analysis"
project_type: "advanced"
is_public: false

# Cross-DC Links for interactive filtering
links:
  - source_dc_id: qc_summary            # Filter from QC data
    source_column: Sample               # Column to filter on
    target_dc_id: salmon_gene_tpm       # Target expression data
    target_type: table                  # Target type
    link_config:
      resolver: direct                  # Same sample IDs in both DCs

workflows:
  - name: "nextflow-custom-rnaseq"
    engine:
      name: "nextflow"
      version: "24.10.3"
    version: "3.18.0"

    data_location:
      structure: "sequencing-runs"
      locations:
        - "{DATA_ROOT}/rnaseq_studies/cohort_2024"
      runs_regex: "batch_[A-C]"

    data_collections:
      # Quality control metrics
      - data_collection_tag: "qc_summary"
        description: "Aggregated quality control summary across samples"
        config:
          type: "table"
          metatype: "metadata"
          scan:
            mode: "single"
            scan_parameters:
              filename: "qc_reports/multiqc_general_stats.txt"
          dc_specific_properties:
            format: "tsv"
            polars_kwargs:
              separator: "\t"
              has_header: true
            keep_columns:
              - "Sample"
              - "fastqc-total_sequences"
              - "fastqc-percent_duplicates"
              - "fastqc-percent_gc"
              - "fastqc-avg_sequence_length"
              - "fastqc-percent_fails"
            columns_description:
              "Sample": "Sample identifier"
              "fastqc-total_sequences": "Total number of sequences processed by FastQC"
              "fastqc-percent_duplicates": "Percentage of duplicate sequences"
              "fastqc-percent_gc": "Overall GC content percentage"
              "fastqc-avg_sequence_length": "Average read length"
              "fastqc-percent_fails": "Percentage of FastQC modules that failed"

      # Gene expression quantification
      - data_collection_tag: "salmon_gene_tpm"
        description: "Salmon merged gene-level TPM expression matrix"
        config:
          type: "table"
          metatype: "metadata"
          scan:
            mode: "single"
            scan_parameters:
              filename: "salmon/salmon.gene_tpm.melted.tsv"
          dc_specific_properties:
            format: "tsv"
            polars_kwargs:
              separator: "\t"
              has_header: true
            columns_description:
              sample_id: "Sample identifier"
              gene_id: "Gene identifier"
              gene_name: "Gene symbol/name"
              condition: "Experimental condition (e.g., treatment, control)"
              replicate: "Biological replicate identifier"
              tpm: "Transcripts per million (TPM) expression value"
        # Note: Use project-level links (above) for interactive filtering

Pattern 2: Multi-sample Strand-seq (single-cell) Study¶

name: "Strand-Seq data analysis"
project_type: "advanced"
data_management_platform_project_url: "https://labid.embl.org/core/projects/default/5baa8f07-bd00-46e7-b3cb-ec79d01f6f3c"

# Cross-DC Links for interactive filtering
links:
  - source_dc_id: mosaicatcher_samples_metadata
    source_column: sample
    target_dc_id: mosaicatcher_stats
    target_type: table
    link_config:
      resolver: direct
  - source_dc_id: mosaicatcher_samples_metadata
    source_column: sample
    target_dc_id: ashleys_labels
    target_type: table
    link_config:
      resolver: direct
  - source_dc_id: mosaicatcher_samples_metadata
    source_column: sample
    target_dc_id: sv_calls
    target_type: table
    link_config:
      resolver: direct

workflows:
  - name: "mosaicatcher-pipeline"
    engine:
      name: "snakemake"
    version: "2.3.5"
    catalog:
      name: "smk-wf-catalog"
      url: "https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/friendsofstrandseq/mosaicatcher-pipeline.html"
    repository_url: "https://github.com/friendsofstrandseq/mosaicatcher-pipeline"
    workflow_tag: "snakemake/mosaicatcher-pipeline"

    data_location:
      structure: "sequencing-runs"
      locations:
        - "/Data/mosaicatcher-pipeline-results/"
      runs_regex: ".*"

    data_collections:
      # MosaiCatcher statistics per cell
      - data_collection_tag: "mosaicatcher_stats"
        description: "Statistics file generated by MosaiCatcher"
        config:
          type: "table"
          metatype: "aggregate"
          scan:
            mode: "recursive"
            scan_parameters:
              regex_config:
                pattern: ".*\\.info_raw"
          dc_specific_properties:
            format: "tsv"
            polars_kwargs:
              skip_rows: 13
              separator: "\t"
              has_header: true
            keep_columns:
              - "sample"
              - "cell"
              - "mapped"
              - "dupl"
              - "pass1"
              - "good"
            columns_description:
              sample: "Sample ID"
              cell: "Cell ID"
              mapped: "Total number of reads seen"
              dupl: "Reads filtered out as PCR duplicates"
              pass1: "Coverage compliant cells (binary)"
              good: "Reads used for counting"

      # Ashleys QC labels
      - data_collection_tag: "ashleys_labels"
        description: "Probabilities generated by ashleys-qc model"
        config:
          type: "table"
          metatype: "aggregate"
          scan:
            mode: "recursive"
            scan_parameters:
              regex_config:
                pattern: ".*cell_selection/labels\\.tsv"
          dc_specific_properties:
            format: "tsv"
            polars_kwargs:
              separator: "\t"
              has_header: true
        # Note: Use project-level links (above) for interactive filtering

      # Structural variant calls
      - data_collection_tag: "sv_calls"
        description: "SV calls generated by MosaiCatcher (stringent)"
        config:
          type: "table"
          metatype: "aggregate"
          scan:
            mode: "recursive"
            scan_parameters:
              regex_config:
                pattern: "stringent_filterTRUE\\.tsv"
          dc_specific_properties:
            format: "tsv"
            polars_kwargs:
              separator: "\t"
              has_header: true
            columns_description:
              sample: "Sample identifier"
              cell: "Single cell identifier"
              chrom: "Chromosome name"
              start: "SV start position"
              end: "SV end position"
              sv_call_name: "Structural variant call identifier"
        # Note: Use project-level links (above) for interactive filtering

      # Sample metadata
      - data_collection_tag: "mosaicatcher_samples_metadata"
        description: "Metadata file for MosaiCatcher samples"
        config:
          type: "table"
          metatype: "metadata"
          scan:
            mode: "single"
            scan_parameters:
              filename: "/Data/mosaicatcher-pipeline-results/metadata/mosaicatcher_samples_metadata_2024.xlsx"
          dc_specific_properties:
            format: "xlsx"
            polars_kwargs:
              has_header: true
            columns_description:
              sample: "Sample identifier"
              patient_id: "Patient identifier"
              tissue_type: "Type of tissue analyzed"
              collection_date: "Date of sample collection"
        # Note: Use project-level links (above) for interactive filtering

Pattern 3: MultiQC Quality Control Integration (v0.5.0+)¶

name: "MultiQC Quality Control Analysis"
project_type: "advanced"

# Cross-DC Links for interactive filtering
links:
  - source_dc_id: sample_metadata       # Filter from metadata table
    source_column: sample               # Column to filter on
    target_dc_id: multiqc_data          # Target MultiQC visualizations
    target_type: multiqc                # Target type
    link_config:
      resolver: sample_mapping          # Maps canonical IDs to MultiQC sample variants

workflows:
  - name: "qc-pipeline"
    engine:
      name: "python"
    data_location:
      structure: "sequencing-runs"
      runs_regex: "run_*"
      locations:
        - "{DATA_ROOT}/qc_results"

    data_collections:
      # MultiQC data collection - automatically detected
      - data_collection_tag: "multiqc_data"
        description: "MultiQC quality control report data"
        config:
          type: "MultiQC"
          # NOTE: No scan, metatype, or dc_specific_properties needed
          # MultiQC automatically detects multiqc_data/multiqc.parquet in each run
          # Requires MultiQC 1.29+ to generate parquet format

      # Sample metadata for context
      - data_collection_tag: "sample_metadata"
        description: "Sample metadata table for MultiQC integration"
        config:
          type: "Table"
          metatype: "Metadata"
          scan:
            mode: "single"
            scan_parameters:
              filename: "metadata/sample_info.csv"
          dc_specific_properties:
            format: "CSV"
            polars_kwargs:
              separator: ","
              has_header: true
            columns_description:
              "sample": "Sample identifier matching MultiQC sample names"
              "treatment": "Treatment condition applied to the sample"
              "batch": "Batch identifier for experimental runs"
          # Note: Use project-level links (above) for interactive filtering
          # instead of join configuration

      # Additional per-sample QC metrics
      - data_collection_tag: "sample_qc_metrics"
        description: "Per-sample quality control metrics and statistics"
        config:
          type: "Table"
          metatype: "Aggregate"
          scan:
            mode: "recursive"
            scan_parameters:
              regex_config:
                pattern: "qc_metrics/.*\\.csv"
          dc_specific_properties:
            format: "CSV"
            polars_kwargs:
              separator: ","
              has_header: true
            columns_description:
              "sample": "Sample identifier"
              "total_reads": "Total number of reads"
              "mapped_reads": "Number of successfully mapped reads"
              "mapping_rate": "Percentage of reads mapped to reference"
          # Note: Use project-level links (above) for interactive filtering

Key Points for MultiQC Integration:

Automatic Detection: MultiQC data is automatically detected from multiqc_data/multiqc.parquet in each run
Minimal Configuration: Only requires type: "MultiQC" - no scan parameters needed
Format Requirement: Requires MultiQC 1.29+ to generate parquet output format
Cross-DC Filtering: Use project-level links with sample_mapping resolver to connect metadata tables with MultiQC visualizations
Location: Each run must contain a multiqc_data/multiqc.parquet file generated by MultiQC

🔍 Validation¶

CLI Validation Commands¶

# Validate configuration file syntax and structure
depictio-cli config validate-project-config \
  --project-config-path ./my_project.yaml \
  --verbose

# Check server connectivity and permissions
depictio-cli config check-server-accessibility

# Dry-run mode: validate without processing data
depictio-cli run --project-config-path ./my_project.yaml \
  --dry-run --verbose

# Test file discovery patterns
depictio-cli data scan --project-config-path ./my_project.yaml \
  --verbose --verbose-level DEBUG

📖 Additional Resources¶

Projects Guide - Comprehensive project management guide
Configuration Reference - Complete YAML reference documentation
CLI Reference - Complete CLI command documentation
Dashboard Creation - Building interactive dashboards
API Documentation - Programmatic project management
Pydantic Models - Schema definitions and validation

This reference covers all configuration options available in Depictio. Start with the Quick Start examples and gradually add complexity as needed for your specific use case.