Bright Sparks: Databricks emits system to sort out ‘data mess’

Data-nom from stream, lake and warehouse, they chirp

Apache Spark-wrangling biz Databricks has added a third pillar to its Unified Analytics Platform aimed at unifying data management.

The unified data management system, Delta, aims to simplify enterprises’ complex data architecture, which sees data spread across multiple data lakes and data warehouses.

CEO and co-founder Ali Ghodsi told The Register that Delta addressed one of three major roadblocks to widespread use of data analytics.

These are the need for data scientists to collaborate with non-experts, to manage complex infrastructure, and to ensure good performance, often in real-time of data in many formats.

Ghodsi said Delta – launched today at the Spark Summit in Dublin – aims to tackle the third problem, which sees customers dealing with a “data mess”, with data in data lakes and data warehouses.

At the same time, they also have streaming systems thanks to increased need for real-time performance analytics for fraud detection that can’t operate on stale data.

The idea of Delta, Databricks said, is to let customers cut out "complex, brittle extract, transform, and load processes that run across a variety of systems".

Ghodsi said it will combine streaming and batch processing, and do it with “the performance and reliability of data warehouses, with the advantages of data lakes - essentially that it’s separating compute and storage”.

Delta will store its data in Amazon S3 - Databricks said this would offer the scale of a data lake, and that it would be stored in a non-proprietary and open file format “to ensure data portability and prevent data lock-in”.

Meanwhile, the company said, Delta tables are used as data source and sink, and will provide transactional guarantees for multiple concurrent writes for batch and streaming jobs.

Delta also claims a number of automated abilities, including automated performance management, cutting out the need for manual tuning, a self-optimising data layout and intelligent data skipping and indexing.

Ghodsi said that, as a cloud company, Databricks' “number one priority” was security, listing security accreditations and its partnership with the CIA’s investment arm In-Q-Tel.

He said that customers can be given access to full audits and logs for metadata and data, for data governance requirements, claiming that - because all data is validated when it is brought into the system - it is also reliable. ?


Biting the hand that feeds IT ? 1998–2017