Recipes¶
Recipes are the data transformation layer of the Depictio CLI. They convert raw bioinformatics pipeline output files into clean, dashboard-ready Polars DataFrames — automatically, reproducibly, and with full validation.
nf-core / Nextflow / Snakemake
- wide CSV
- nested TSV
- non-standard headers
- reformat & reshape
- wide → long
- rename columns
- compute metrics
- long format
- clean schema
- validated types
What is a Recipe?¶
A recipe is a plain Python module that describes how to transform one or more raw files into a single tidy DataFrame. Each recipe lives in depictio/projects/<pipeline>/recipes/ and declares:
SOURCES— input files to read (paths relative to--data-dir, or references to other data collections viadc_ref)EXPECTED_SCHEMA— required output columns and their Polars data typesOPTIONAL_SCHEMA(optional) — columns that may or may not be present (e.g. user-defined metadata columns)transform(sources)— a pure function that takes loaded DataFrames and returns the output DataFrame
Recipes are used in two ways:
- Standalone testing —
depictio recipe runexecutes a recipe locally against a data directory, so you can validate your data before registering a project. - Project integration — a
data_collectionwithsource: "transformed"tells the CLI to run the recipe duringdepictio run, storing the result as a Delta Lake table.
The 4-Checkpoint Validation Pipeline¶
Every recipe execution — whether via depictio recipe run or depictio run — runs through four automatic checkpoints:
| # | Checkpoint | What it checks |
|---|---|---|
| 1 | Load | Import the recipe module; verify SOURCES, EXPECTED_SCHEMA, and a callable transform() exist |
| 2 | Resolve | Find each file under --data-dir; skip optional sources gracefully; fail fast if required files are missing |
| 3 | Transform | Call transform(sources), verify it returns a non-empty pl.DataFrame |
| 4 | Schema | Assert every column in EXPECTED_SCHEMA is present with the correct dtype; validate OPTIONAL_SCHEMA columns if present |
If any checkpoint fails, execution stops with a clear error message pointing to the exact problem.
CLI Commands¶
depictio recipe list¶
List all bundled recipes.
Output:
Available recipes (6):
nf-core/ampliseq/alpha_diversity.py
nf-core/ampliseq/alpha_rarefaction.py
nf-core/ampliseq/ancombc.py
nf-core/ampliseq/taxonomy_composition.py
nf-core/ampliseq/taxonomy_heatmap.py
nf-core/ampliseq/taxonomy_rel_abundance.py
depictio recipe info <name>¶
Show recipe details: docstring, sources, and expected output schema. Pass --version to inspect a version-specific override.
Output:
Recipe: nf-core/ampliseq/alpha_diversity.py
Description: Transform QIIME2 alpha diversity vector to per-sample Faith PD table.
Sources (1):
faith_pd: qiime2/diversity/alpha_diversity/faith_pd_vector/metadata.tsv (TSV)
Expected output schema (2 columns):
sample: Utf8
faith_pd: Float64
Optional schema: {} (metadata columns passed through dynamically)
depictio recipe run <name>¶
Execute a recipe against local data with all 4 validation checkpoints.
Options:
| Flag | Short | Default | Description |
|---|---|---|---|
--data-dir |
-d |
required | Root directory with workflow output files |
--version |
-v |
null |
Pipeline version for version-specific recipe (e.g. 2.14.0) |
--output |
-o |
null |
Save result to .parquet or .csv file |
--head |
-n |
20 |
Number of rows to display |
dc_ref sources
Recipes that reference another data collection via dc_ref (e.g. taxonomy_rel_abundance.py) cannot be fully executed standalone. The CLI will report which sources are skipped and exit with code 0. These sources are resolved automatically during depictio run when all data collections are available.
Recipe Anatomy: Ampliseq Examples¶
The six bundled ampliseq recipes cover the full spectrum of transformation patterns. Study them as templates for writing your own recipes.
Example 1 — Simple file transformation (alpha_diversity.py)¶
Pattern: Filter comment rows, rename columns, cast types. Metadata columns embedded by the pipeline are passed through dynamically.
"""Transform QIIME2 alpha diversity vector to per-sample Faith PD table."""
import polars as pl
from depictio.models.models.transforms import RecipeSource
SOURCES: list[RecipeSource] = [
RecipeSource(
ref="faith_pd",
path="qiime2/diversity/alpha_diversity/faith_pd_vector/metadata.tsv",
format="TSV",
),
]
EXPECTED_SCHEMA: dict[str, type[pl.DataType]] = {
"sample": pl.Utf8,
"faith_pd": pl.Float64,
}
# Any metadata columns embedded by QIIME2 (e.g. habitat) are passed through.
OPTIONAL_SCHEMA: dict[str, type[pl.DataType]] = {}
def transform(sources: dict[str, pl.DataFrame]) -> pl.DataFrame:
df = sources["faith_pd"]
df = df.filter(~pl.col("id").str.starts_with("#"))
df = df.rename({"id": "sample"})
df = df.with_columns(pl.col("faith_pd").cast(pl.Float64))
return df # all columns preserved, including any embedded metadata
Example 2 — Wide-to-long reshape (alpha_rarefaction.py)¶
Pattern: Unpivot wide columns into long format with regex extraction.
Input: Wide CSV where columns are depth-500_iter-3, depth-1000_iter-1, etc.
Output: Long-format (sample, depth, iter, faith_pd) — ready for line charts.
"""Transform QIIME2 alpha rarefaction wide CSV to long-format rarefaction curves."""
import polars as pl
from depictio.models.models.transforms import RecipeSource
SOURCES: list[RecipeSource] = [
RecipeSource(
ref="faith_pd_csv",
path="qiime2/alpha-rarefaction/faith_pd.csv",
format="CSV",
),
]
EXPECTED_SCHEMA = {
"sample": pl.Utf8,
"depth": pl.Int64,
"iter": pl.Int64,
"faith_pd": pl.Float64,
}
def transform(sources: dict[str, pl.DataFrame]) -> pl.DataFrame:
df = sources["faith_pd_csv"]
value_cols = [c for c in df.columns if c.startswith("depth-")]
df = df.unpivot(
on=value_cols, index="sample-id", variable_name="depth_iter", value_name="faith_pd"
)
df = df.rename({"sample-id": "sample"})
df = df.with_columns(
pl.col("depth_iter").str.extract(r"depth-(\d+)", 1).cast(pl.Int64).alias("depth"),
pl.col("depth_iter").str.extract(r"iter-(\d+)", 1).cast(pl.Int64).alias("iter"),
pl.col("faith_pd").cast(pl.Float64),
)
return df.drop_nulls(subset=["faith_pd"]).select("sample", "depth", "iter", "faith_pd")
Example 3 — Cross-DC join with optional metadata (taxonomy_rel_abundance.py)¶
Pattern: Reference another data collection via dc_ref, join generically on a shared key. The metadata source is optional — when absent, the recipe produces core columns only.
"""Transform QIIME2 relative abundance table to long-format per-sample taxonomy table."""
import polars as pl
from depictio.models.models.transforms import RecipeSource
SOURCES: list[RecipeSource] = [
RecipeSource(
ref="rel_table",
path="qiime2/rel_abundance_tables/rel-table-2.tsv",
format="TSV",
read_kwargs={"skip_rows": 1},
),
RecipeSource(
ref="metadata",
dc_ref="metadata", # references another DC by tag
optional=True, # absent when no metadata provided
),
]
EXPECTED_SCHEMA = {
"sample": pl.Utf8,
"taxonomy": pl.Utf8,
"rel_abundance": pl.Float64,
"Kingdom": pl.Utf8,
"Phylum": pl.Utf8,
}
# Metadata columns are user-defined; validated dynamically
OPTIONAL_SCHEMA: dict[str, type[pl.DataType]] = {}
def transform(sources: dict[str, pl.DataFrame]) -> pl.DataFrame:
df = sources["rel_table"]
df = df.rename({"#OTU ID": "taxonomy"})
sample_cols = [c for c in df.columns if c != "taxonomy"]
df = df.with_columns(pl.col(sample_cols).cast(pl.Float64))
df = df.unpivot(
on=sample_cols, index="taxonomy", variable_name="sample", value_name="rel_abundance"
)
df = df.filter(pl.col("rel_abundance").is_not_null() & (pl.col("rel_abundance") > 0))
df = df.with_columns(
pl.col("taxonomy").str.split(";").list.get(0).alias("Kingdom"),
pl.col("taxonomy").str.split(";").list.get(1).fill_null("Unclassified").alias("Phylum"),
)
# Join ALL metadata columns generically when metadata is available
metadata = sources.get("metadata")
if metadata is not None:
metadata = metadata.rename({"ID": "sample"})
df = df.join(metadata, on="sample", how="left")
core = ["sample", "taxonomy", "rel_abundance", "Kingdom", "Phylum"]
extra = [c for c in df.columns if c not in core]
return df.select(core + extra)
Example 4 — Multi-file merge with source overrides (ancombc.py)¶
Pattern: Merge multiple slices of the same analysis into one long-format table. The recipe declares default source paths, but the template overrides them via source_overrides to parameterize the directory name with {GROUP_COL}.
"""Merge ANCOM-BC differential abundance results (5 files) into one long-format table."""
import polars as pl
from depictio.models.models.transforms import RecipeSource
# Default paths — overridden by template.yaml source_overrides with {GROUP_COL}
SOURCES = [
RecipeSource(ref="lfc", path="qiime2/ancombc/differentials/Category-habitat-level-2/lfc_slice.csv", format="CSV"),
RecipeSource(ref="p_val", path="qiime2/ancombc/differentials/Category-habitat-level-2/p_val_slice.csv", format="CSV"),
RecipeSource(ref="q_val", path="qiime2/ancombc/differentials/Category-habitat-level-2/q_val_slice.csv", format="CSV"),
RecipeSource(ref="w", path="qiime2/ancombc/differentials/Category-habitat-level-2/w_slice.csv", format="CSV"),
RecipeSource(ref="se", path="qiime2/ancombc/differentials/Category-habitat-level-2/se_slice.csv", format="CSV"),
]
EXPECTED_SCHEMA = {
"id": pl.Utf8, "contrast": pl.Utf8,
"lfc": pl.Float64, "p_val": pl.Float64, "q_val": pl.Float64,
"w": pl.Float64, "se": pl.Float64,
"Kingdom": pl.Utf8, "Phylum": pl.Utf8,
"neg_log10_qval": pl.Float64, "significant": pl.Boolean,
}
def transform(sources: dict[str, pl.DataFrame]) -> pl.DataFrame:
contrast_cols = [c for c in sources["lfc"].columns if c not in ("id", "(Intercept)")]
melted = {
name: sources[name].select("id", *contrast_cols)
.unpivot(on=contrast_cols, index="id", variable_name="contrast", value_name=name)
for name in ["lfc", "p_val", "q_val", "w", "se"]
}
result = melted["lfc"]
for name in ["p_val", "q_val", "w", "se"]:
result = result.join(melted[name], on=["id", "contrast"], how="left")
return result.with_columns(
pl.col("id").str.split(";").list.get(0).alias("Kingdom"),
pl.col("id").str.split(";").list.get(1).fill_null("Unclassified").alias("Phylum"),
(-pl.col("q_val").log(base=10)).alias("neg_log10_qval"),
(pl.col("q_val") < 0.05).alias("significant"),
).select("id", "contrast", "lfc", "p_val", "q_val", "w", "se",
"Kingdom", "Phylum", "neg_log10_qval", "significant")
Source overrides
The Category-habitat-level-2 directory name in the default SOURCES paths is a fallback. In the ampliseq template, these paths are overridden to Category-{GROUP_COL}-level-2/ so they resolve dynamically based on the user's metadata grouping column.
RecipeSource Reference¶
RecipeSource is the Pydantic model that describes one input to a recipe.
| Field | Type | Required | Description |
|---|---|---|---|
ref |
str |
yes | Key in the sources dict passed to transform() |
path |
str |
if no dc_ref |
File path relative to --data-dir |
dc_ref |
str |
if no path |
Tag of another data collection to inject (resolved by the API) |
format |
str |
yes if path set |
CSV, TSV, or Parquet (case-insensitive) |
read_kwargs |
dict |
no | Extra kwargs forwarded to the Polars reader (e.g. {"skip_rows": 1}) |
optional |
bool |
no | If true, source is skipped when unavailable instead of failing |
glob_pattern |
str |
no | Glob pattern for matching multiple files (concatenated) |
Exactly one of path, dc_ref, or glob_pattern must be set per source.
glob_pattern (per-sample inputs)¶
Use glob_pattern instead of path to fan multiple per-sample files
into one DataFrame. The glob is expanded by the recipe runner relative
to the project's data_dir (typically --data-root):
SOURCES: list[RecipeSource] = [
RecipeSource(
ref="pangolin_raw",
glob_pattern="variants/ivar/consensus/bcftools/pangolin/*.pangolin.csv",
format="CSV",
),
]
Matched files are read individually and concatenated with
pl.concat([...], how="diagonal_relaxed"). glob_pattern is mutually
exclusive with path and dc_ref.
Declaring a Recipe in project.yaml¶
To use a recipe in a project, set source: "transformed" on the data collection and specify the recipe name under transform.recipe:
data_collections:
- data_collection_tag: "alpha_diversity"
description: "Per-sample alpha diversity"
config:
type: "Table"
source: "transformed"
transform:
recipe: "nf-core/ampliseq/alpha_diversity.py"
dc_specific_properties:
format: "TSV"
columns_description:
"sample": "Sample identifier"
"faith_pd": "Faith's Phylogenetic Diversity"
Use source_overrides to parameterize recipe source paths via template variables:
- data_collection_tag: "ancombc_results"
config:
type: "Table"
source: "transformed"
transform:
recipe: "nf-core/ampliseq/ancombc.py"
source_overrides:
lfc:
path: "qiime2/ancombc/differentials/Category-{GROUP_COL}-level-2/lfc_slice.csv"
# ... one override per source ref
The recipe is executed during depictio data process (Step 5 of depictio run). All 4 checkpoints run automatically. If the recipe fails, the data collection is skipped and an error is logged.
Using templates
For nf-core/ampliseq, all six recipes are pre-configured in the bundled template. Use depictio run --template nf-core/ampliseq/2.16.0 --data-root /your/data --var SAMPLESHEET_FILE=samplesheet.csv to set up the complete project without writing any YAML. See Templates.
Recipe File Locations¶
Recipes live inside the depictio/projects/ directory, co-located with the templates that use them:
depictio/projects/
└── nf-core/
└── ampliseq/
├── recipes/ ← shared recipes (all pipeline versions)
│ ├── alpha_diversity.py
│ ├── alpha_rarefaction.py
│ ├── ancombc.py
│ ├── taxonomy_composition.py
│ ├── taxonomy_heatmap.py
│ └── taxonomy_rel_abundance.py
├── 2.14.0/
│ ├── template.yaml
│ └── recipes/ ← version-specific overrides
│ └── taxonomy_rel_abundance.py
└── 2.16.0/
├── template.yaml ← no overrides, inherits shared
└── dashboards/
├── base.yaml ← minimal (no metadata)
└── full_analysis.yaml ← metadata-aware with {GROUP_COL}
To add a recipe for a new pipeline, create depictio/projects/{org}/{pipeline}/recipes/{name}.py following the contract: define SOURCES, EXPECTED_SCHEMA, and transform().
Version-Specific Recipes¶
When a pipeline output format changes between versions (column renames, file moves, schema changes), you can add a version-specific override without touching the shared recipe.
Resolution order when running pipeline version 2.14.0:
projects/{pipeline}/2.14.0/recipes/{name}.py— checked first (override)projects/{pipeline}/recipes/{name}.py— used if no override exists (shared)
Most recipes are shared across all versions. Only the ones that actually differ need an override.
To test a version-specific recipe standalone:
# Uses shared recipe
depictio recipe run nf-core/ampliseq/taxonomy_rel_abundance.py --data-dir /data/run
# Uses the v2.14.0 override if it exists, falls back to shared otherwise
depictio recipe run nf-core/ampliseq/taxonomy_rel_abundance.py \
--data-dir /data/run \
--version 2.14.0
Additional Resources¶
- Templates — pre-packaged project configs that use recipes automatically
- YAML Examples — recipe data collection patterns
- CLI Usage — full CLI command reference