If you are storing raw data in Snowflake’s native tables, you are trusting that you will always want to be in Snowflake, that the pricing will always be acceptable, and that no other tool will ever need to read that data. That is a lot of trust.

Apache Iceberg stored in S3 with Glue as the catalog fixes this. The files are yours, in an open format, in your own object storage. You define the Iceberg table in Glue, Snowflake reads it as an external table, and everything works. Snowflake becomes one reader, not the authority.

What makes the Glue catalog worth using is that it is not exclusive. Spark on EMR reads from it, Athena reads from it, Flink can read from it, your ML pipeline can read from it, and Snowflake reads from it, all without any of them needing to talk to each other, all without staging data into separate systems, all pointing at the same Parquet files. Last year a client asked me to add Athena access to data that was already in Snowflake. The answer was: “We can’t, the files are in Snowflake’s internal storage.” We had to copy everything. I’m sure they were not the first team to have that conversation.

People worry about performance, and there is something to it. Snowflake has to query Glue for metadata, parse the Iceberg manifest files, and then read the actual Parquet data. There is latency compared to native Snowflake tables. In my experience it is real but small, and it mostly disappears with a decent partition strategy. And the alternative, copying raw data into Snowflake’s proprietary format just to avoid that overhead, is not a performance choice, it is a lock-in choice dressed up as one.

You own the files. You own the catalog. You can add a new query engine without moving a single byte.

If you are building a new data platform now and not seriously looking at Iceberg with an open catalog, I would want to know what you are optimizing for.

If you want the step-by-step setup, I wrote a guide: Reading Iceberg Tables from S3 in Snowflake with Glue as the Catalog.