Data Lakehouse Solutions

Fractional Head of Data | Tool-Agnostic. Outcome-Obsessed

178 553 abonnés 1 ans

Data lakes went from the answer to the data bottleneck to data swamps at many companies I have talked to. When I first broke into the data world, everyone wanted to build a data lake. They thought it was the key to letting data scientists and ML engineers deliver value quickly. Schema-on-read was hailed as a revolutionary idea! But you don’t hear too many people talking about it anymore. Data teams quickly found that many data lakes still required some level of pre-processing. To make matters worse it wasn’t uncommon to find data workflows that had thousands of lines of SQL queries and Python scripts that were untested and hard to trace all to calculate a few metrics. I recall having to help a few friends look through script after script trying to figure out where their small change request would actually need to go because of how complex some workflows were. This isn’t to say that data lakes don’t have their place. I have seen them used successfully at companies where they often used it as a layer to develop MVPs of ML models prior to implementing a more reliable process to take the data from the data lake into a more standard data storage solution. But in these cases, there was generally a clear process and some level of governance(both in terms of data and code). To many degrees data lakes and data lakehouses were all developed around technologies and vendors(data lakes - Hadoop, data lakehouses Databricks), and whether you think they are right or wrong is less of the point. You need to get past the marketing and figure out what processes make your implementation successful otherwise we’ll just keep going through the same cycle every decade or so. But I'd love to hear your thoughts, how can companies successfully create processes that build reliable data systems and teams.

41 commentaires

Roy Hasson

Product @ Microsoft | Data engineer | Advocate for better data

10 008 abonnés 1 ans

We have data warehouse, data lake, but what does "Data Lakehouse" actually mean? Data warehouse offers a lot of capabilities built into the platform that makes it easy to author queries and transformations, process data in a very efficient way, manage the data (sort, distribute, partition, etc.) and provides some built in tools to bring data in and out. Data lake on the other hand is mostly a highly scalable, cost effective way to store lots of data of varying types - text, images, videos, etc. Data lakes lack much of what's needed to manage and analyze the data. So we added those external components, like catalog, query engines, processing engines and wrapped them all up in many engineering best practices to ensure everything works as expected, which is a nightmare to maintain. Data lakehouse is the combination of many core warehouse capabilities with a scalable data lake storage. But ok, you're probably saying this is no different than "separation of compute and storage". True, but it goes far beyond that. A data lakehouse decouples the following components: 1. Storage - open file formats 2. Metadata/Transactions - standard handling, maintained with the data 3. Table services - standard implementation of best practices 4. Processing/Compute - choose your weapon... There are two unique aspects to the Lakehouse that don't exist with data lakes or warehouses. 🔥 Shared storage: A dataset is comprised of physical files, in open formats, and metadata files that describe how the data is structured, like schema, partitions, sort order, row min/max, etc. This package, managed by Apache Iceberg, can be shared and accessed by many tools and services making it ubiquitous and independent of any single cloud or tool vendor. 🔥 Table services: Best practices as processes and "glue code" that we've developed over the years to accomplish a task well. In data lakes that's compacting small files, partitioning and sorting data, etc. In a lakehouse, these are still required. However, Apache Iceberg provides a reference implementation that makes it easy for you to use without writing custom code. The same applies regardless of cloud or vendor you use, which makes building a robust lakehouse, that's open, portable and performant easier. The future of data platforms will be based on Lakehouse patterns, in particular shared storage. We're already seeing this taking share with Snowflake, AWS Athena/Redshift/Glue, BigQuery, Azure Fabric (DeltaLake), Databricks (DeltaLake) and others. ---- Upsolver provides a #zeroETL ingestion from production sources into #ApacheIceberg shared storage. Upsolver also provides powerful table services to continuously optimize and manage your lake.

6 commentaires

Lars Kamp

I write about cloud security and capital markets.

9 909 abonnés 1 ans

A year ago I had Jason Reid from Tabular on my podcast, and we talked about "Apache Iceberg". Yesterday, on the Snowflake Earnings Call, the word "Iceberg" was mentioned 18 times. Why? 👇 Because Iceberg has the potential to disintermediate cloud warehouses like Snowflake. Some background on Apache Iceberg: It's a table format that offers both the simplicity of SQL and separation of storage and compute. The Iceberg table format works with any compute engine, so users are not limited to working with a single engine. Popular engines (e.g., Spark, Trino, Flink, and Hive) and modern cloud warehouses (e.g., Snowflake, Redshift, and BigQuery) can work with Iceberg tables at the same time. A table format is a layer that sits between the file format and database. Iceberg is an abstraction layer above file formats like Parquet, Avro, and ORC born out of necessity at Netflix. Like many other companies at the time, Netflix shifted from MPP data warehouses to the Hadoop ecosystem in the 2010s. MPP warehouses like Teradata were hitting scale limitations and becoming too expensive at Netflix's scale. The Hadoop ecosystem abandoned the table abstraction layer in favor of scale. In Hadoop, we deal directly with file systems like HDFS. The conventional wisdom at the time was that bringing compute to storage was easier than moving the data to compute. Hadoop scales compute and disk together, which turned out to be incredibly hard to manage in the on-premise world. Early on, Netflix shifted to the cloud and started storing data in Amazon S3 instead, which separated storage from compute. Snowflake also picked up on that principle, bringing back SQL semantics and tables from "old" data warehouses. Netflix wanted to separate storage/compute and SQL table semantics. They wanted to add, remove, and rename columns without S3 paths. But rather than going with another proprietary vendor, Netflix wanted to stay in open source and open formats. And thus, Iceberg was developed and eventually donated to the Apache Foundation. Today, Iceberg is also in use at companies like Apple and LinkedIn. Tabular is the company behind Apache Iceberg. Working with open-source Iceberg tables still requires understanding of object stores and distributed data processing engines and how various components interact with each other. Tabular lowers the bar for adoption and removes the heavy lifting. And so the analyst community picked up on Iceberg - because with Iceberg, customers essentially don't need to copy their data into a cloud warehouse anymore. And that means less storage revenue for Snowflake (and any other warehouse that charges for storage). Hindsight is 20/20, but on the podcast, Jason and I talked about the impact of Iceberg on the cloud warehouse market. Jason made a few predictions and looked into the future - funny how just a year later these predictions have already played out! #data #snowflake #iceberg

6 commentaires

Alex Merced

Co-Author of “Apache Iceberg: The Definitive Guide” | Head of DevRel at Dremio | LinkedIn Learning Instructor | Tech Content Creator

33 807 abonnés 2 ans

🏠 Example Streaming Data Lakehouse Architecture with Kafka, Flink, Iceberg, Nessie, and Dremio! Data ingestion, processing, and analytics have become crucial in today's fast-paced digital world. To shed light on one such architecture, let's dive into an interesting combination of technologies that provides a seamless flow of streaming data: 1️⃣ Data Ingestion: Data is first published onto Apache Kafka topics, making Kafka the starting point and heartbeat of our architecture. 2️⃣ Stream Processing: Apache Flink comes into play, processing data in near real-time and ensuring it's ready for the next steps. 3️⃣ Data Storage: Processed data lands into Apache Iceberg tables. Iceberg provides atomicity and fine-grained incremental data access. 4️⃣ Catalog Management: The Nessie catalog maintains and versions our Iceberg tables, enabling better organization and management of our data. 5️⃣ Query & Analytics: Dremio facilitates querying this structured data. Dremio's semantic layer allows you to organize, document and govern the data. Dremio's data reflections allow additional iceberg-backed performance boosts. All the data curated in Dremio can be accessed by data consumers using Dremio's intuitive Web UI, REST API, Arrow Flight, or through traditional ODBC/JDBC connectivity. Pros: 🟢 Real-time Processing: With Kafka and Flink, get insights in near real-time, aiding quick decision-making. 🟢 Scalability: This stack scales to handle large volumes of data. 🟢 Flexibility & Version Control: Nessie ensures data versioning while Dremio offers multiple data access methods. (Add in automated table maintenance if using Dremio's Nessie-based Arctic catalog service) 🟢 Efficient Storage: Iceberg metadata on top of Apache Parquet data files optimizes query planning and processing. #DataLakehouse #Kafka #Flink #Iceberg #Nessie #Dremio #DataArchitecture

3 commentaires

Andrew Madson MSc, MBA

Data Professor | 250K Followers | O'Reilly Author

92 263 abonnés 1 ans

Hey, Data Analysts! Should you compress your data? Efficiently storing and processing large volumes of data is necessary for effective analytics. One technique that makes this possible is compression. 👉 What is compression, and why does it matter? Compression is the process of reducing data size to save storage space and improve data transfer speeds. By minimizing the amount of space data occupies, we can store more information in the same amount of storage, reduce costs, and speed up data processing tasks. One popular file format is Apache Parquet, a columnar storage format that supports efficient compression and encoding schemes. Parquet employs various compression techniques, including run-length encoding (RLE). Run-length encoding is particularly effective for compressing data with repeated values. It replaces consecutive identical values with a single instance of the value and a count of its occurrences. For example, the sequence "AAAABBBCCCDDDD" would be encoded as "4A3B3C4D". RLE excels at compressing datasets with low cardinality columns, making it highly efficient for storing and processing data in Parquet files. But compression alone isn't enough to build a robust and efficient data lake. That's where technologies like Apache Iceberg come into play. Apache Iceberg is an open table format for huge analytic datasets, designed to solve the challenges faced in large-scale data lake implementations. With Apache Iceberg, data engineers can build more effective data lakes by leveraging features such as schema evolution, time travel, and atomic commits. Iceberg's snapshot-based architecture enables data practitioners to track changes over time, roll back to previous versions, and ensure data consistency across multiple readers and writers. By combining efficient compression techniques like run-length encoding on Parquet files with the power of Apache Iceberg, data engineers can create data lakes that are more manageable, scalable, and performant. This synergy allows organizations to extract valuable insights from their data while optimizing storage costs and processing times. Run-length encoding on Parquet files, coupled with technologies like Apache Iceberg, empowers data engineers to build efficient and effective data lakes. Want to learn more? Check out - ➡️ Alex Merced - "Apache Iceberg: The Definitive Guide" 🔗https://lnkd.in/gXSEQDEA ➡️ Zach Wilson - "Data Lake Fundamentals, Apache Iceberg and Parquet " on YouTube 🔗https://lnkd.in/gX2X3PCT ➡️ Joe Reis 🤓 and Matthew Housley "The Fundamentals of Data Engineering" 🔗 https://lnkd.in/gz2ZteAR Happy Learning! #dataanalytics #dataengineering #apacheiceberg

Chitrang Davé

Data + AI + Analytics + Technology in Healthcare & Life Sciences | Chief Data and Analytics Officer | Growth, Innovation, and Efficiency

4 417 abonnés 1 ans

This whitepaper is a must read for anyone thinking about the future of data. Raghu Ramakrishnan and Josh Caplan lay out Microsoft’s vision and the design principles behind OneLake & #Fabric. I appreciate the open approach, based on a desire to: ✅ minimize data movement and eliminate multiple copies of data, and ✅ get a consolidated enterprise view – the “entire data estate”! Are we “entering the golden age of analytics”? 💯 Here’s why: 1️⃣ Elastic cloud compute and storage, and 2️⃣ The best tools that can take advantage of it. What does this mean for data? That we can: 🚂 bring the most appropriate tools (compute engines) to bear on the task at hand (workloads) 💾 using data stored in open efficient accessible formats So what are these workloads and engines? #Analytics – #SQL & BI tools like Tableau, Power BI, Qlik, Sigma, Trino Data integration engineering for #ETL or #ELT, using SQL, Python, Spark Data Streaming – KQL or Kafka with ksql and/or Flink AI/ML – R, #Python, #GenAI using open source models Statistical analysis using tools like SAS Open Parquet-based table formats Delta Lake, Icerbeg, and Hudi dominate the open storage discussions. Microsoft (& Databricks) use Delta in Fabric while actively supporting the Apache Xtable project for interoperability with Iceberg and Hudi. It is worth noting that Snowflake's native format is proprietary but they now supports Iceberg Tables. 🔐 Security and governance are top of mind for enterprises and it only makes sense for this to be designed and built into the data layer from the ground up. The Purview integration with OneLake in Fabric addresses a big challenge for enterprises. If you are interested in the future of data & analytics this will be worth your time - better than any webinar, conference, or sales briefing. https://lnkd.in/gTQtczX9

Microsoft’s vision of an open data lake ecosystem: Open lakes, not walled gardens | Microsoft Fabric Blog

https://www.microsoft.com/en-us/microsoft-fabric/blog

1 commentaire

Dmitriy Braverman

Data Architect @ Western Midstream

7 563 abonnés 1 mois

🚀 The Evolution of Data Warehousing: From ETL to Lakehouse The data warehousing landscape has undergone a massive #transformation over the past few decades — driven by growing data volumes, the demand for agility, and the need for faster, more reliable insights. 🏛️ The Birth of the Enterprise Data Warehouse (EDW) 35–40 years ago, the Enterprise Data Warehouse (EDW) emerged as a centralized repository for reporting and analytics. * Data was integrated from multiple operational systems via #ETL (Extract → Transform → Load). * Tables were predefined, and transformations happened before loading — a #schema-on-write approach. * Reporting tools relied on consistent, structured, relational data. * This model prioritized #governance, #quality, and #reliability, but struggled with flexibility and scalability. 🌊 The Rise of the Data Lake About 15 years ago, the Data Lake emerged — first via Hadoop Distributed File System (#HDFS) and later through cloud-native object storage like #Amazon S3 and Azure Data Lake Storage (#ADLS). This era introduced two key shifts: * #ELT (Extract → Load → Transform) replaced traditional ETL, allowing more flexibility by performing transformations post-load. * A #schema-on-read approach enabled storing raw, #unstructured, or semi-structured data without enforcing a schema upfront. 🔻 Limitations of Classic Data Lakes Despite their flexibility and scalability, traditional data lakes had critical shortcomings: ❌ Lack of schema enforcement – Made it harder to manage and validate data. ❌ No ACID guarantees – Data consistency was not ensured in concurrent environments. ❌ No transactional consistency – No safe way to update or delete data without risks. As a result, data lakes were often unsuitable for BI, governance, or regulatory use cases. ☁️ The #Cloud #Data #Warehouse Era (2012- Present) To address the limitations of both EDWs and classic data lakes, cloud data warehouses emerged. They brought scalability, performance, and accessibility by leveraging cloud infrastructure. Key platforms include: * Snowflake * Google BigQuery * Azure Synapse Analytics * Amazon Redshift Key benefits: * Fully managed infrastructure * High performance and concurrency * Familiar #SQL interfaces However, these systems still had limitations, including closed formats, vendor lock-in, and cost challenges at extreme scale. 🏠 The Data Lakehouse: The Best of Both Worlds (2019 - Present) The Lakehouse architecture emerged as a hybrid solution, combining the cost-efficiency and flexibility of data lakes with the structure and reliability of data warehouses. Key components: * Open table formats like Apache Iceberg and Delta Lake * Open, scalable storage (e.g., S3, ADLS) * ACID transactions directly on the data lake * Query engines like #Presto, Trino, #Spark SQL, and Athena enable #SQL queries directly on lake data This unified architecture allows organizations to support #BI, data #engineering, #datascience, and #ML.

2 commentaires

Microsoft’s vision of an open data lake ecosystem: Open lakes, not walled gardens | Microsoft Fabric Blog

https://www.microsoft.com/en-us/microsoft-fabric/blog

Plus dans “data lakehouse solutions”

Plus dans les sujets “technology”

Explorer les catégories