Selecting the correct Warehouse size in Snowflake can be tricky. If your queries are spilling over to the Remote Storage then you either need to re-write the query or increase the Warehouse size. But how do you identify these problematic queries that are running on undersized Warehouse? Identifying these queries is a Maximum Vector problem that be solved using Skyline queries. A Skyline query finds points in a vector space that dominate other points. Here is how a Skyline Query can be used to identify Snowflake queries that are problematic.
with query_history as (
select
query_text
, bytes_spilled_to_remote_storage
, decode(warehouse_size, 'X-Small', 1, 'Small', 2, 'Medium', 3, 'Large', 4, 'X-Large', 5) as warehouse_size
from snowflake.account_usage.query_history
where warehouse_size is not null and bytes_spilled_to_remote_storage > 1000
order by start_time desc
limit 10000
)
select
query_text
, bytes_spilled_to_remote_storage
, warehouse_size
from query_history as o
where not exists ( -- Skyline query to identify worst performing queries
select 1
from query_history as i
where i.warehouse_size <= o.warehouse_size and i.bytes_spilled_to_remote_storage >= o.bytes_spilled_to_remote_storage
and (i.warehouse_size < o.warehouse_size or i.bytes_spilled_to_remote_storage > o.bytes_spilled_to_remote_storage)
);
QUERY_TEXT | bytes_spilled_to_remote_storage (GB) | WAREHOUSE_SIZE |
---|---|---|
select * from … | 4 | 1 |
select * from … | 5 | 4 |