Ingestion as a Service

A Scalable, Orchestrator-Agnostic Data Ingestion Framework Built on Spark & Scala

Introduction

Data ingestion is often the most underestimated layer in modern data platforms. While organizations invest heavily in analytics, AI, and dashboards, they frequently rely on fragmented, hardcoded ingestion pipelines that are difficult to scale, maintain, and standardize.

At Datilis, we address this challenge with Ingestion as a Service (IaaS)—a Spark & Scala-based ingestion framework designed to be reusable, configurable, and fully decoupled from orchestration tools.

This approach transforms ingestion from a collection of pipelines into a standardized, scalable platform capability.

The Problem: Fragmented and Rigid Ingestion Pipelines

Most organizations face common ingestion challenges:

Pipelines tightly coupled to specific orchestrators
Hardcoded logic for each data source
Limited reusability across teams
Difficult environment management (dev, test, prod)
No built-in data quality validation during ingestion
Scaling challenges as data volume grows

The result is:

Increased development effort
Inconsistent ingestion patterns
Higher operational risk
Slower onboarding of new data sources

The Datilis Approach: Ingestion as a Service

Our framework introduces a modular ingestion architecture built around three core principles:

1. Configuration-Driven Ingestion

Instead of writing pipelines per dataset, ingestion is defined using YAML-based configuration:

Source type (JDBC, Kafka, SFTP, etc.)
Target system (e.g., BigQuery, Hive)
Load strategy (full, partitioned, incremental)
Execution mode (cluster or local)
Data quality checks

This enables:

Rapid onboarding of new datasets
Consistent ingestion patterns
Reduced development effort

2. Decoupled Execution Layer (Spark & Scala)

At the core of the framework is a Spark-based ingestion engine implemented in Scala:

Handles large-scale distributed ingestion
Supports multiple ingestion patterns
Optimized for performance and scalability
Reusable across all ingestion use cases

Most importantly:

The ingestion engine is completely independent of orchestration tools.

It can be triggered by:

Dagster
Airflow
Oozie
Any scheduler or event-driven system

3. Orchestrator Integration (Flexible & Pluggable)

Using a Dagster-asset-factory/Airlfow-asset-centric approach, ingestion jobs are dynamically created from configuration:

YAML → Airflow/Dagster assets
Automatic context preparation
Standardized execution patterns

This allows:

Native integration with orchestration tools
Consistent pipeline definitions
Simplified scheduling and dependency management

Architecture Overview

The ingestion framework operates as a layered pipeline:

Step 1: Configuration Layer

YAML files define ingestion behavior
Supports environment separation (dev, test, prod)
Enables CI/CD-driven deployment

Step 2: Asset Generation Layer

Airflow/Dagster asset factory creates ingestion jobs
Translates configuration into executable pipelines

Step 3: Execution Layer (Spark)

Distributed ingestion processing
Handles large-scale data movement
Supports multiple load strategies

Step 4: Data Platform Integration

Data lands in target systems (e.g., BigQuery, Hive)
Immediately available for transformation (dbt)

Step 5: Transformation & Validation

dbt transformations applied
Data quality checks executed during and after ingestion

Key Capabilities

Multi-Source Ingestion

Supports:

JDBC databases
Kafka streams
SFTP/file-based ingestion

Flexible Load Strategies

Full load
Partition-based ingestion
Incremental ingestion

Built-In Data Quality Checks

Data validation is executed during ingestion, ensuring:

Early detection of issues
Reduced downstream failures
Higher data reliability

Reusable Framework Components

Shared ingestion logic
Centralized utilities
Standardized processing patterns

Environment Isolation

Separate dev, test, and production environments
Safe deployment and testing
CI/CD integration

Integration with Data Platforms

The framework integrates seamlessly into modern data platforms:

Data is ingested into staging layers
dbt transformations build downstream models
Data becomes immediately usable for analytics and AI

This creates a continuous flow from ingestion to insight.

Business Benefits

Organizations adopting Ingestion as a Service achieve:

70–80% reduction in ingestion development effort
Faster onboarding of new data sources
Consistent and standardized ingestion patterns
Improved data quality from the start
Reduced operational complexity
Scalability across growing data volumes

Strategic Value

Ingestion is not just a technical step—it is the foundation of your data platform.

By productizing ingestion, organizations gain:

A reusable ingestion capability
Faster time-to-data
Improved governance and control
A strong foundation for analytics and AI

Why Datilis

Datilis brings together:

Deep expertise in data engineering and distributed systems
Proven frameworks built on Spark, Scala, dbt, and modern orchestration tools
A strong focus on standardization and reusability
A commitment to delivering business value, not just pipelines

Conclusion

Ingestion as a Service transforms ingestion from a bottleneck into a scalable, reusable platform capability.

By combining:

Configuration-driven design
Distributed Spark execution
Orchestrator flexibility
Built-in data quality

Datilis enables organizations to build robust, future-proof data platforms.

Next Steps

Identify key ingestion bottlenecks
Standardize ingestion patterns across teams
Launch a pilot with 1–2 data sources

Contact Datilis to implement your ingestion framework