Sessionization is the act of turning event-based data into sessions. It is widely used in several domains, such as: Web Analytics and Trip Analytics. Basically given a “user id” or a “machine id”, the question is: how can we recreate sessions? Or more precisely, how do we choose the boundaries of a session?

A session, for example, can be defined by threshold time such as if the next action is in a range of time greater than T, it defines the start of a new session. Sessionization of event-based data can be easily implemented in Snowflake using the powerful CONDITIONAL_TRUE_EVENT Window function. The following query creates session boundaries using T=50 mins i.e. if more than 50 minutes have elapsed between the events for a given visitor a new Session is defined.

Sample Clickstream data

create or replace table clickstream (post_visid_high number, post_visid_low number, date_time timestamp);
insert into clickstream values (1,0, '2013-09-04T15:49:49');
insert into clickstream values (1,0, '2013-09-04T15:49:58');
insert into clickstream values (1,0, '2013-09-04T15:49:49');
insert into clickstream values (1,0, '2013-09-04T15:49:58');
insert into clickstream values (1,0, '2013-09-04T16:37:11');
insert into clickstream values (1,0, '2013-09-04T16:37:18');
insert into clickstream values (1,0, '2013-09-04T16:39:27');
insert into clickstream values (1,0, '2013-09-04T16:43:57');
insert into clickstream values (1,0, '2013-09-04T20:12:03');
insert into clickstream values (1,0, '2013-09-05T00:00:17');
insert into clickstream values (1,0, '2013-09-05T00:20:35');
insert into clickstream values (2,0, '2013-09-05T00:22:37');
insert into clickstream values (2,0, '2013-09-05T01:19:29');
insert into clickstream values (1,0, '2013-09-05T01:19:39');
insert into clickstream values (1,0, '2013-09-05T01:20:03');
insert into clickstream values (1,0, '2013-09-05T01:20:17');
insert into clickstream values (1,0, '2013-09-05T02:33:42');

select * from clickstream;
POST_VISID_HIGH POST_VISID_LOW DATE_TIME
1 0 2013-09-04 15:49:49.000
1 0 2013-09-04 15:49:58.000
1 0 2013-09-04 15:49:49.000
1 0 2013-09-04 15:49:58.000
1 0 2013-09-04 16:37:11.000
1 0 2013-09-04 16:37:18.000
1 0 2013-09-04 16:39:27.000
1 0 2013-09-04 16:43:57.000
1 0 2013-09-04 20:12:03.000
1 0 2013-09-05 00:00:17.000
1 0 2013-09-05 00:20:35.000
2 0 2013-09-05 00:22:37.000
2 0 2013-09-05 01:19:29.000
1 0 2013-09-05 01:19:39.000
1 0 2013-09-05 01:20:03.000
1 0 2013-09-05 01:20:17.000
1 0 2013-09-05 02:33:42.000

Sessionization using conditional_true_event

select 
  post_visid_high || ':' || post_visid_low as visitor_id
  , date_time
  , datediff(
      minute
      , lag(date_time) over (partition by visitor_id order by date_time asc)
      , date_time
    ) as minutes_since_last_action
  , conditional_true_event(minutes_since_last_action > 50) 
      over (partition by visitor_id order by date_time asc)
    as session_count
from clickstream;

Query output with Sessions

VISITOR_ID DATE_TIME MINUTES_SINCE_LAST_ACTION SESSION_COUNT
2:0 2013-09-05 00:22:37.000   0
2:0 2013-09-05 01:19:29.000 57 1
1:0 2013-09-04 15:49:49.000   0
1:0 2013-09-04 15:49:49.000 0 0
1:0 2013-09-04 15:49:58.000 0 0
1:0 2013-09-04 15:49:58.000 0 0
1:0 2013-09-04 16:37:11.000 48 0
1:0 2013-09-04 16:37:18.000 0 0
1:0 2013-09-04 16:39:27.000 2 0
1:0 2013-09-04 16:43:57.000 4 0
1:0 2013-09-04 20:12:03.000 209 1
1:0 2013-09-05 00:00:17.000 228 2
1:0 2013-09-05 00:20:35.000 20 2
1:0 2013-09-05 01:19:39.000 59 3
1:0 2013-09-05 01:20:03.000 1 3
1:0 2013-09-05 01:20:17.000 0 3
1:0 2013-09-05 02:33:42.000 73 4

image

Edit this page on GitHub