Data Metric Functions (DMF) allow you to automate data quality validation and monitoring. DMFs allow you to:
Snowflake has the following DMF pre-defined:
FRESHNESS
NULL_COUNT
DUPLICATE_COUNT
UNIQUE_COUNT
ROW_COUNT
You can use try out the pre-defined DMF functions in Snowflake as following:
select snowflake.core.null_count(select email_address from customer_dim);
select snowflake.core.duplicate_count(select email_address from customer_dim);
select snowflake.core.unique_count(select email_address from customer_dim);
The following ALTER
will attach a built-in Data Metric Function NULL_COUNT
to the EMAIL_ADDRESS
column in the CUSTOMER_DIM
table
ALTER TABLE customer_dim ADD DATA METRIC FUNCTION NULL_COUNT ON (EMAIL_ADDRESS);
Tell Snowflake to evaluate the DMF every hour:
ALTER TABLE customer_dim SET DATA_METRIC_SCHEDULE = '1 HOUR';
Inpect the quality metrics:
SELECT * FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
Creating a Data Metric Function is as simple as creating a UDF. The function takes a new TABLE data type with one or more column arguments and returns a single result value.
CREATE DATA METRIC FUNCTION count_of_nulls(
ARG_T TABLE (ARG_C1 STRING, ARG_C2 STRING, ARG_C3 STRING)
)
RETURNS NUMBER
AS
$$
SELECT COUNT_IF(ARG_C1 IS NULL OR ARG_C2 IS NULL OR ARG_C3 IS NULL)
from ARG_T
$$;
All Data Metric Functions are added to your warehouse so other users can reuse them in a consistent manner without having to recreate the same rules over and over.
You can see which metrics are available by running:
SHOW DATA METRIC FUNCTIONS IN <DATABASE_NAME>.<SCHEMA_NAME>;
You attach the custom Data Metric Function to the columns you want to evaluate:
ALTER TABLE CUSTOMER_DIM
ADD DATA METRIC FUNCTION count_of_nulls ON (FIRST_NAME, LAST_NAME, EMAIL_ADDRESS);
Next, you set the interval at which metrics should be calculated. This is configured at the table level and can be as frequent as 1 minute.
ALTER TABLE some_table SET DATA_METRIC_SCHEDULE = '1 HOUR';
You can test a customer DMFs by executing them ad-hoc or include them directly in your data pipelines running on Snowflake for in-line validation. This is how you execute a DMF:
SELECT scratch.saqib_ali.count_of_nulls (
SELECT FIRST_NAME, LAST_NAME, EMAIL_ADDRESS FROM CUSTOMER_DIM
) AS VALUE;