Databricks Open Sources Spark Declarative Pipelines for Scalable Data Engineering
Databricks announced it is open-sourcing its core ETL framework, Spark Declarative Pipelines, enabling developers to build scalable, reliable data pipelines using SQL or Python. This framework automates complex pipeline operations and supports batch and streaming data, simplifying data engineering and competing with Snowflake’s Openflow. Proven at scale, it reduces development and maintenance time significantly.
Databricks has taken a significant step by open-sourcing its core declarative ETL framework, known as Apache Spark Declarative Pipelines. Originally launched as Delta Live Tables (DLT) in 2022, this framework is designed to simplify the creation and operation of reliable, scalable data pipelines, supporting both batch and streaming workloads seamlessly.
Traditionally, data engineering involves complex pipeline authoring, heavy manual operations, and maintaining separate systems for batch and streaming data. Spark Declarative Pipelines changes this by allowing engineers to declare their pipelines using SQL or Python, while Apache Spark manages execution details like dependency tracking, table management, parallelism, checkpoints, and retries automatically.
This declarative approach supports modern data realities such as change data feeds, message buses, and real-time analytics, making it ideal for powering AI systems. It integrates smoothly with object storage platforms like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage, enabling unified processing of batch and streaming data through a single API.
The framework’s effectiveness is proven at scale. Enterprises like Block have cut development time by over 90%, while Navy Federal Credit Union reduced pipeline maintenance by 99%. These improvements translate into faster time-to-insight, lower operational costs, and enhanced performance tailored to specific latency requirements, including sub-second streaming.
Databricks’ open-source move also sets it apart from competitors like Snowflake, whose recent Openflow service focuses mainly on data ingestion into its platform. In contrast, Spark Declarative Pipelines empower users to build end-to-end pipelines from raw data sources to transformed, usable data, all within an open ecosystem accessible beyond Databricks’ own platform.
The upcoming integration of Spark Declarative Pipelines into the Apache Spark codebase marks a milestone in open data infrastructure. While the exact release date is pending, the technology is already commercially available through Databricks Lakeflow Declarative Pipelines, offering enterprise-grade features and support.
By open-sourcing this framework, Databricks reinforces its commitment to simplifying data engineering and fostering an open ecosystem where users can innovate without vendor lock-in. For organizations aiming to streamline their data pipelines, reduce operational overhead, and accelerate AI initiatives, Spark Declarative Pipelines represent a powerful new tool in the data infrastructure landscape.
Keep Reading
View AllBattery Recycling Industry Faces Uncertainty Amid Trump Clean Energy Rollbacks
Trump's clean energy policy shifts create challenges and opportunities for U.S. battery recyclers amid evolving tariffs and funding changes.
Mohela Borrowers Must Reapply for Income-Driven Repayment Plans
Mohela student loan borrowers need to reapply for income-driven repayment plans submitted before April 27, 2025, to avoid cancellation.
Vast Data Targets $25 Billion Valuation with AI-Optimized Storage
Vast Data aims for a $25B valuation by revolutionizing AI-friendly unified data storage, fueling rapid growth and enterprise adoption.
AI Tools Built for Agencies That Move Fast.
QuarkyByte offers deep insights and practical guidance on leveraging open-source data engineering frameworks like Spark Declarative Pipelines. Discover how to accelerate your data pipeline development, optimize costs, and enhance reliability with our expert analysis and case studies from top enterprises.