Selecting the correct Warehouse size in Snowflake can be tricky. If your queries are spilling over to the Remote Storage then you either need to re-write the query or increase the Warehouse size. But how do you identify these problematic queries that are running on undersized Warehouse? Identifying these queries is a Maximum Vector problem that be solved using Skyline queries. A Skyline query finds points in a vector space that dominate other points. Here is how a Skyline Query can be used to identify Snowflake queries that are problematic.

image

Skyline Query

with query_history as (
  select 
    query_text
    , bytes_spilled_to_remote_storage
    , decode(warehouse_size, 'X-Small', 1, 'Small', 2, 'Medium', 3, 'Large', 4, 'X-Large', 5) as warehouse_size
    from snowflake.account_usage.query_history
    where warehouse_size is not null and bytes_spilled_to_remote_storage > 1000
    order by start_time desc
    limit 10000
)
select 
  query_text 
  , bytes_spilled_to_remote_storage
  , warehouse_size
from query_history as o
where not exists ( -- Skyline query to identify worst performing queries
  select 1
  from query_history as i
  where i.warehouse_size <= o.warehouse_size and i.bytes_spilled_to_remote_storage >= o.bytes_spilled_to_remote_storage
  and (i.warehouse_size < o.warehouse_size or i.bytes_spilled_to_remote_storage > o.bytes_spilled_to_remote_storage)
);

Query Output

QUERY_TEXT bytes_spilled_to_remote_storage (GB) WAREHOUSE_SIZE
select * from … 4 1
select * from … 5 4

See also

Skyline Query