📊 Depictio Project Types: Choose Your Data Strategy¶

From simple CSV files to complex bioinformatics pipelines - discover which project type fits your needs.

🎬 Projects Management Overview: Discover how Depictio's project types organize your data workflow - from simple file uploads to complex bioinformatics pipelines

General features of Depictio Projects¶

Projects are designed to help you organize your data and structure your analysis, corresponding to the top-level entity in Depictio. Each project can contain multiple workflows and data collections (advanced type), which are essentially groups of related data files relevant to your analysis. User can then create interactive dashboards out of these data collections, allowing for flexible and powerful data visualization.

Generic features of Depictio projects include:

Data Organization: Projects help you structure your data, making it easier to manage and analyze.
Workflow Integration (advanced): Projects can be linked to specific workflows, allowing you to track and manage your data processing pipelines.
Interactive Dashboards: Create visualizations and dashboards based on the data within your projects.
Role Management: Control access to projects, related data and resulting dashboards through user roles and permissions. Each user can be either : Owner, Editor, or Viewer of a project.

🎯 Two Approaches, a large number of possibilities¶

Depictio offers two distinct project types designed for different data scenarios. Think of them as different answers for organizing your data - each optimized for specific use cases.

Whether you're looking for a plotly studio like experience for immediate analysis of a tabular dataset, or a bioinformatician managing complex pipeline outputs, choosing the right project type sets the foundation for effective data exploration.

🏠 Basic Projects: Your Gateway to Interactive Dashboards¶

Perfect for: Direct data analysis

Basic projects are designed for users who want to quickly visualize and explore tabular data without the overhead of complex configurations. They are ideal for one-off analyses or exploratory data visualization.

Main Features of Basic Projects¶

Start Quickly - Upload files and start visualizing within minutes
Minimal configuration required - Works with any tabular data format
Perfect for exploration - Ideal when you want to "see what the data tells you"

🎬 Basic Project Creation: Watch how to create a basic project from scratch - upload data, configure settings, and start visualizing in minutes

🧬 Advanced Projects: Suited for Bioinformatics Workflows¶

Perfect for: Standardised workflows outputs, multi-sample studies

Advanced projects are tailored for bioinformatics workflows executed in core facilities-like setup. Users can discover and organize files generated by complex pipelines like nf-core, Snakemake, or Nextflow in a structured, centralized manner in order to create interactive dashboards that aggregate data across multiple samples, timepoints, or experimental conditions.

Main Features of Advanced Projects compared to Basic Projects¶

Sequencing run data organisation compatible - Automatically finds and organizes files based on naming conventions and directory structures, perfect for core facilities managing recurrent sequencing projects relying on standardized processing pipelines.
Multi-sample analysis - Handles large datasets with hundreds of samples, automatically aggregating results across multiple runs or timepoints.
Combine multiple data collections - Joins different data types (e.g., gene expression, variant calls) into unified dashboards. Ingest data without modifying the original data files at the workflow level by doing post-processing in Depictio.

📋 Quick Decision Guide¶

Choose Basic when:¶

✅ You have limited number of files ready to analyze (CSV, Excel, Parquet)
✅ One-time analysis or ad-hoc exploration
✅ Manual data preparation is acceptable
✅ Quick insights are the primary goal

Choose Advanced when:¶

✅ Automated pipeline generates your data (nf-core, Snakemake, etc.)
✅ Standardized file organization and naming conventions exist
✅ You need to aggregate data across multiple samples or runs ✅ Regular data updates are expected

🏭 Getting Started Paths¶

Basic Project Quickstart¶

Through the Web Interface:

Visit demo.depictio.embl.org
Click "Create Project" → Choose "Basic" and fill in project details
Travels to your newly created project and click "Create Data Collection"
Fill in details and upload your tabular file as data collection
Go to the "Dashboards" tab
Start creating dashboard components corresponding to your project

Common File Formats Supported:

CSV files (most common) & TSV files (tab-separated values)
Excel spreadsheets (.xlsx, .xls)
Parquet files (efficient for large datasets)
Feather files (.feather)

Advanced Project Quickstart¶

Project Structure Setup: Advanced projects require a YAML configuration file that describes your data organization patterns. This tells Depictio how to automatically discover and organize your files.

Example Study Structure:

study_directory/
├── depictio_project.yaml    # Depictio configuration
├── run_001/                 # First batch of samples - run 001
│   ├── sample_A/
│   │   ├── stats/
│   │   │   └── sample_A_stats.tsv # Statistics for sample A
│   │   └── analysis_results/
│   │       └── sample_A_analysis.tsv # Analysis results for sample A
│   └── sample_B/
│       ├── stats/
│       │   └── sample_B_stats.tsv # Statistics for sample B
│       └── analysis_results/
│           └── sample_B_analysis.tsv
└── run_002/                 # Second batch of samples - run 002
    ├── sample_C/
    │   ├── stats/
    │   │   └── sample_C_stats.tsv
    │   └── analysis_results/
    │       └── sample_C_analysis.tsv
    └── sample_D/
        ├── stats/
        │   └── sample_D_stats.tsv
        └── analysis_results/
            └── sample_D_analysis.tsv

The configuration file below describes patterns for finding and organizing these files automatically.

Example depictio_project.yaml:

# =============================================================================
# DEPICTIO PROJECT CONFIGURATION
# Complete configuration for the study structure shown above
# =============================================================================


# Project identification - displayed in Depictio web interface
name: "My Bioinformatics Study"

# Description of the project
description: "A comprehensive study of multiple samples with detailed statistics and analysis results"

# Project type
# This is an advanced project with structured data discovery
type: "advanced"

# Data management platform project URL
# Optional: link to external project management system
data_management_platform_project_url: "https://example.com/project/my-bioinformatics-study"

# Public/private visibility
# Set to true if you want this project to be publicly accessible
is_public: false

# =============================================================================
# WORKFLOW DEFINITIONS
# Define the pipelines that generated your data
# =============================================================================
workflows:
  - name: "bioinformatics_pipeline"

    # Engine that executed the workflow
    engine:
      name: "nextflow"           # Workflow management system used
      version: "24.10.3"         # Version for reproducibility

    description: "Multi-sample bioinformatics analysis pipeline"
    version: "3.1"               # Your pipeline version

    # =============================================================================
    # DATA DISCOVERY CONFIGURATION  
    # Tell Depictio where to find your data and how it's organized
    # =============================================================================
    config:
      # Where your workflow output directories are located
      parent_runs_location:
        # Environment variables (like {DATA_LOCATION}) are resolved at runtime
        # This allows flexible deployment across different systems
        - "{DATA_LOCATION}/study_directory"

      # Regular expression to identify run directories
      # "run_.*" matches: run_001, run_002, run_abc, etc.
      runs_regex: "run_.*"

      # =============================================================================
      # DATA COLLECTIONS
      # Define the different types of data files to be ingested
      # =============================================================================
      data_collections:

        # COLLECTION 1: Sample Statistics
        # ========================================
        - data_collection_tag: "sample_stats"
          description: "Statistics for each sample"

          config:
            # Table = tabular data (CSV, TSV, Excel, etc.)
            # Futur Alternative: JBrowse2, GeoJSON, etc.
            type: "Table"

            # Aggregate = combine multiple files into one dataset
            # Alternative: Metadata = single file per run
            metatype: "Aggregate"

            # File discovery settings
            scan:
              # recursive = search through subdirectories
              # single = look for one specific file per run
              mode: "recursive"

              scan_parameters:
                regex_config:
                  # Find all files matching this pattern within each run directory
                  # Example matches: run_001/sample_A/stats/sample_A_stats.tsv
                  #                 run_001/sample_B/stats/sample_B_stats.tsv
                  pattern: "stats/.*_stats.tsv"

            # Data processing configuration specific to type Table
            dc_specific_properties:
              format: "TSV"        # Tab-separated values

              # Polars DataFrame configuration (high-performance data processing)
              polars_kwargs:
                separator: "\t"    # Tab separator
                has_header: true   # First row contains column names
                # Other options: skip_rows, column_types, etc.

              # Only keep these columns (improves performance and reduces memory)
              keep_columns:
                - "sample_id"      # Links samples across datasets
                - "total_reads"    # Sequencing depth metric
                - "mapped_reads"   # Alignment quality metric  
                - "quality_score"  # Overall sample quality

              # Human-readable descriptions for dashboard tooltips
              columns_description:
                sample_id: "Unique sample identifier"
                total_reads: "Total number of sequencing reads"
                mapped_reads: "Successfully aligned reads" 
                quality_score: "Overall sample quality metric"

        # COLLECTION 2: Analysis Results  
        # ========================================
        - data_collection_tag: "analysis_results"
          description: "Analysis results for each sample"

          config:
            type: "Table"
            metatype: "Aggregate"

            scan:
              mode: "recursive"
              scan_parameters:
                regex_config:
                  # Find analysis result files in each sample directory
                  # Example: run_001/sample_A/analysis_results/sample_A_analysis.tsv
                  pattern: "analysis_results/.*_analysis.tsv"

            dc_specific_properties:
              format: "TSV"
              polars_kwargs:
                separator: "\t"
                has_header: true

              keep_columns:
                - "sample_id"        # Join key for linking datasets
                - "gene_expression"  # Expression analysis results
                - "variant_count"    # Variant calling results
                - "pathway_enrichment"  # Functional analysis results

          # =============================================================================
          # DATA JOINING
          # Combine this collection with others for integrated analysis
          # =============================================================================
          join:
            # Columns used to match records between datasets
            on_columns:
              - "sample_id"        # Common identifier across collections

            # Type of join (inner = only samples present in both datasets)
            # Options: inner, left, right, outer
            how: "inner"

            # Other collections to join with
            with_dc:
              - "sample_stats"     # Combine analysis with quality metrics

🔧 Technical Deep Dive¶

CLI Workflow Commands¶

Once you have your configuration file, use the CLI to process your project:

# Complete workflow: validate → sync → scan → process
depictio-cli run --project-config-path ./bioinfoinformatics_project.yaml

This single command handles the entire pipeline automatically. For detailed CLI usage and individual step commands, see the CLI documentation.

Data Processing Pipeline using `depictio-cli`¶

Using the run command, the CLI executes this pipeline for advanced projects:

✅ Server Check - Verify connection to Depictio backend
✅ S3 Storage Check - Validate cloud storage configuration
✅ Config Validation - Ensure YAML structure is correct
✅ Config Sync - Register project with server
✅ File Scan - Discover files matching patterns
✅ Data Process - Convert files to Delta Lake format for dashboarding

Each step can be skipped with flags like --skip-scan or --skip-process for debugging.

🎬 🖥️ `depictio-cli run` command example

File Discovery Patterns¶

Depictio supports two main scanning modes that adapt to different data organization structures:

Single File Collection:

This is usually suitable for metadata files or summary statistics that are generated once at the project level.

scan:
  mode: "single" 
  scan_parameters:
    filename: "multiqc_data/multiqc_general_stats.txt"

Finds one specific file per run directory. This is adapted for cases where you have a single summary file per sample or run.

Recursive File Collection:

scan:
  mode: "recursive"
  scan_parameters:
    regex_config:
      pattern: "star_salmon/.*/quant.sf"

Uses regex patterns to find files at any depth in the directory structure.

📊 Project Types Comparison¶

Choosing the right project type is crucial for your data analysis success. Here's a comprehensive comparison to help you decide:

Feature	Basic Projects	Advanced Projects
Setup Complexity	Minimal - Web UI or CLI	CLI with YAML config required
Data Compatibility	Simple tabular data (CSV, Excel)	Complex bioinformatics workflows
Multi-sample Support	Limited to single datasets	Designed for hundreds of samples
Data Processing	Direct conversion to Delta table	Aggregation & joining capabilities
Best For	Quick analysis	Production workflows, core facilities
Learning Curve	Immediate - no learning required	Moderate - requires YAML reference knowledge
Scalability	Small to medium datasets	Large-scale, multi-run studies

💡 Key Takeaway¶

Basic Projects excel at getting you from data to insights quickly, perfect for exploratory analysis and presentations. Advanced Projects are useful when you need systematic, reproducible data management for complex, multi-sample studies with standardized workflows.

Both project types deliver the same rich, interactive dashboard experience - the difference lies in how your data is processed and gets ingested by the system.

🗺️ What's Next?¶

Now that you understand project types, you're ready to create your first interactive dashboard:

🎨 Create Your First Dashboard - Step-by-step tutorial
📊 CLI Usage Guide - Complete command documentation
🔧 Advanced Configuration - Multi-collection joins, custom workflows

Thomas Weber
August 2025