Scalable, Config-Driven Data Ingestion Across Any Orchestrator

Streamline data ingestion across your platform with Datilis’ Ingestion as a Service—a scalable, Spark & Scala-based framework that enables configuration-driven pipelines, real-time validation, and seamless integration with any orchestrator.

A Scalable, Orchestrator-Agnostic Data Ingestion Framework Built on Spark & Scala

Introduction

Data ingestion is often the most underestimated layer in modern data platforms. While organizations invest heavily in analytics, AI, and dashboards, they frequently rely on fragmented, hardcoded ingestion pipelines that are difficult to scale, maintain, and standardize.

At Datilis, we address this challenge with Ingestion as a Service (IaaS)—a Spark & Scala-based ingestion framework designed to be reusable, configurable, and fully decoupled from orchestration tools.

This approach transforms ingestion from a collection of pipelines into a standardized, scalable platform capability.

The Problem: Fragmented and Rigid Ingestion Pipelines

Most organizations face common ingestion challenges:

  • Pipelines tightly coupled to specific orchestrators
  • Hardcoded logic for each data source
  • Limited reusability across teams
  • Difficult environment management (dev, test, prod)
  • No built-in data quality validation during ingestion
  • Scaling challenges as data volume grows

The result is:

  • Increased development effort
  • Inconsistent ingestion patterns
  • Higher operational risk
  • Slower onboarding of new data sources

The Datilis Approach: Ingestion as a Service

Our framework introduces a modular ingestion architecture built around three core principles:

1. Configuration-Driven Ingestion

Instead of writing pipelines per dataset, ingestion is defined using YAML-based configuration:

  • Source type (JDBC, Kafka, SFTP, etc.)
  • Target system (e.g., BigQuery, Hive)
  • Load strategy (full, partitioned, incremental)
  • Execution mode (cluster or local)
  • Data quality checks

This enables:

  • Rapid onboarding of new datasets
  • Consistent ingestion patterns
  • Reduced development effort
2. Decoupled Execution Layer (Spark & Scala)

At the core of the framework is a Spark-based ingestion engine implemented in Scala:

  • Handles large-scale distributed ingestion
  • Supports multiple ingestion patterns
  • Optimized for performance and scalability
  • Reusable across all ingestion use cases

Most importantly:

The ingestion engine is completely independent of orchestration tools.

It can be triggered by:

  • Dagster
  • Airflow
  • Oozie
  • Any scheduler or event-driven system
3. Orchestrator Integration (Flexible & Pluggable)

Using a Dagster-asset-factory/Airlfow-asset-centric approach, ingestion jobs are dynamically created from configuration:

  • YAML → Airflow/Dagster assets
  • Automatic context preparation
  • Standardized execution patterns

This allows:

  • Native integration with orchestration tools
  • Consistent pipeline definitions
  • Simplified scheduling and dependency management

Architecture Overview

The ingestion framework operates as a layered pipeline:

Step 1: Configuration Layer
  • YAML files define ingestion behavior
  • Supports environment separation (dev, test, prod)
  • Enables CI/CD-driven deployment
Step 2: Asset Generation Layer
  • Airflow/Dagster asset factory creates ingestion jobs
  • Translates configuration into executable pipelines
Step 3: Execution Layer (Spark)
  • Distributed ingestion processing
  • Handles large-scale data movement
  • Supports multiple load strategies
Step 4: Data Platform Integration
  • Data lands in target systems (e.g., BigQuery, Hive)
  • Immediately available for transformation (dbt)
Step 5: Transformation & Validation
  • dbt transformations applied
  • Data quality checks executed during and after ingestion

Key Capabilities

Multi-Source Ingestion

Supports:

  • JDBC databases
  • Kafka streams
  • SFTP/file-based ingestion
Flexible Load Strategies
  • Full load
  • Partition-based ingestion
  • Incremental ingestion
Built-In Data Quality Checks

Data validation is executed during ingestion, ensuring:

  • Early detection of issues
  • Reduced downstream failures
  • Higher data reliability
Reusable Framework Components
  • Shared ingestion logic
  • Centralized utilities
  • Standardized processing patterns
Environment Isolation
  • Separate dev, test, and production environments
  • Safe deployment and testing
  • CI/CD integration

Integration with Data Platforms

The framework integrates seamlessly into modern data platforms:

  • Data is ingested into staging layers
  • dbt transformations build downstream models
  • Data becomes immediately usable for analytics and AI

This creates a continuous flow from ingestion to insight.

Business Benefits

Organizations adopting Ingestion as a Service achieve:

  • 70–80% reduction in ingestion development effort
  • Faster onboarding of new data sources
  • Consistent and standardized ingestion patterns
  • Improved data quality from the start
  • Reduced operational complexity
  • Scalability across growing data volumes

Strategic Value

Ingestion is not just a technical step—it is the foundation of your data platform.

By productizing ingestion, organizations gain:

  • A reusable ingestion capability
  • Faster time-to-data
  • Improved governance and control
  • A strong foundation for analytics and AI

Why Datilis

Datilis brings together:

  • Deep expertise in data engineering and distributed systems
  • Proven frameworks built on Spark, Scala, dbt, and modern orchestration tools
  • A strong focus on standardization and reusability
  • A commitment to delivering business value, not just pipelines

Conclusion

Ingestion as a Service transforms ingestion from a bottleneck into a scalable, reusable platform capability.

By combining:

  • Configuration-driven design
  • Distributed Spark execution
  • Orchestrator flexibility
  • Built-in data quality

Datilis enables organizations to build robust, future-proof data platforms.

Next Steps

  • Identify key ingestion bottlenecks
  • Standardize ingestion patterns across teams
  • Launch a pilot with 1–2 data sources

Contact Datilis to implement your ingestion framework