Imagine: A data management and storage layer developed by big tech companies operating at web scale being promised as the future of all data processing.
Sound familiar?
No, I’m not talking about Hadoop. I’m talking about Iceberg. And there are echoes of that bygone era when everyone had to be ready for “BIG DATA”.
I’m not the first to say this, but the similarities are impossible to ignore. To be clear, I do think Iceberg is different, and that it will likely be broadly adopted and useful.
BUT, we’re in an interesting time right now. Why? Well, in terms of actual adoption, Iceberg hasn’t actually been adopted all that broadly yet. The ecosystem is still young.
Iceberg had great momentum in 2024, Snowflake was all in on it. Confluent announced they were moving in that direction…
Then, Databricks announced their acquisition of Tabular, during Snowflake Summit 😬, for a rumoured to be “more than” $1B (I’ve heard everything from $1.5B-$2B🤯).
Tabular didn’t have meaningful customers or revenue. What Tabular did have was their founders, Ryan Blue and Dan Weeks, the creators of Iceberg.
This was about DRAMA! (As my wife put it when I told her the story, “These tech companies are more dramatic than the real housewives…”)
No this was a major shot across the bow of Snowflake. This was Databricks stating they were going to win no matter what. Behind the scenes it had been a heated battle with both Snowflake and Databricks bidding the price up. The acquisition made more sense for Snowflake really, Databricks was taking it from them.
Databricks had some advantages in the bidding war, it is the best funded venture backed company in history with finances and a valuation not subject to the vagaries of the public markets, $43B at the time of acquisition and now $62B. This meant that a stock acquisition could be marked up, and public investors were not going to crush them the day after for overpaying.
Anyway, that was a fun story, but where does it leave the data landscape today?
In a strange place.
Let’s take a step back and explain what Iceberg is. Apache Iceberg, officially, and other open table formats like Apachi Hudi and Delta Table, is a storage format for data. I won’t go into the full history, there is a nice summary here, but the essence is that it allows data teams to store data on object storage, like Amazon S3, Azure Data Lake Storage, or GCP, and retain metadata that allows multiple services query and manage the underlying data. Iceberg and other open table formats generally came out of the requirements of big tech companies like Netflix and Uber and gained initial adopt with those sorts of businesses. Those with true big data.
This is in contrast to the other major model of data architecture which gained so much adoption in the past 10 years, the cloud data warehouse, e.g. Snowflake, BigQuery, Redshift, and increasingly Databricks SQL. Snowflake and BigQuery added the further advantage of separating compute and storage, within their services, which made them incredibly easy to adopt and scale. This also meant that if all your data was stored in Snowflake and you wanted to do some analysis on that data, you had to use Snowflake compute to do that analysis. It was a beautiful product to build a business around because the data was valuable, and Snowflake made it more valuable! But you had to pay!
Coming back to Iceberg, one of the main advantages is the ability to use any compatible compute engine on top of it. This means you can use Trino/Presto, Snowflake, BigQuery, Databricks/Spark, DuckDB, Daft, Flink (inc. Streamkap👀) etc. etc. etc.
So, the promise is now that your data is not locked in with any one vendor, you can use whatever compute is cheapest or best for your particular workload.
That sounds great but the reality is we’re still very early. Databrick’s buying Tabular accelerated part of the adoption curve, the bit where everyone (date vendors and end practitioners/users alike) now knew that Iceberg had won and was going to be THE format.
This consensus is accelerating development and commitment to Iceberg industry-wide but the reality on the ground is that it’s still hard to run a stack around Iceberg that truly gives you access to all of the benefits described above. Amazon launched S3 Tables last year, but indications are that it’s still pretty half baked.
You still need to figure out your Iceberg catalogue which is often a separate service, some examples here. As a confusing aside, there are 5 different “catalogs” in data.
It’s not clear to me that there are good services, yet, which help a.) decide which compute engine to use and b.) manage the interface with other tools that use the compute, e.g. dbt, BI tools, etc. Though I know some people are working on it.
So, where does this leave data teams today? You’re going to need headcount to manage Iceberg.
If you spend millions per year on a cloud data warehouse, it probably makes sense to start looking at making the move (though if that’s the case you probably already are). If you’re smaller but doing a platform shift right now, maybe it’s worth it, but it will take more work than just adopting a cloud data warehouse and the landscape is almost certainly going to change in the next couple of years so you risk betting on the wrong architecture or tooling so tread carefully.
My prediction is that most mid-market companies don’t have enough pain and will just wait until things get a lot easier and then make the move.
All that said, get in touch if you are interested in streaming to Iceberg. We’re accepting companies to beta testers for Streamkap’s Iceberg destination :)
Warmly,
Paul Dudley
P.S. Iceberg Summit 2025 is on my calendar, and if you're working with Apache Iceberg or just curious about where the ecosystem is headed, you should join too! Let’s connect and talk data! Check it out here.