Finding a Gap in Timeseries data and / or Gaps-and-Islands using SQL
A common problem in analyzing a Log data from applications is to create sessions from logged user activity. A user may use the app for couple of hours in day and then come back to it the next day. The log will register the user activity, but the logs will not tell us when the session ended for the day, and a new session started the next day. Typically a sessions are based on m minutes of activity followed by n minutes of no activity. n could be 10 minutes, for example. This is also known as Gaps-and-Islands problem in Computer Science.
Sample log data
username | log_timestmap |
---|---|
Angela | 2020-08-07 20:10:00.000 |
Scott | 2020-08-07 20:10:00.000 |
Bob | 2020-08-07 20:10:00.000 |
Bob | 2020-08-07 20:20:00.000 |
Angela | 2020-08-07 20:20:00.000 |
Scott | 2020-08-07 20:20:00.000 |
Bob | 2020-08-07 20:30:00.000 |
Angela | 2020-08-07 20:30:00.000 |
Scott | 2020-08-07 20:30:00.000 |
Angela | 2020-08-07 20:40:00.000 |
Scott | 2020-08-07 20:40:00.000 |
Bob | 2020-08-07 20:50:00.000 |
Angela | 2020-08-07 20:50:00.000 |
Scott | 2020-08-07 20:50:00.000 |
Bob | 2020-08-07 21:00:00.000 |
Bob | 2020-08-07 21:10:00.000 |
Scott | 2020-08-07 22:00:00.000 |
Scott | 2020-08-07 22:20:00.000 |
Scott | 2020-08-07 22:30:00.000 |
Analysis Goal / Desired Output
From the above App Log, let’s say we need to display
- user
- begin_timestamp (begin of the session)
- end_timestamp (end of the session)
Figure 1. Sessionized User Activity. No activity for 10 minutes or more indicates and end of session. |
We will define the end of a session to be 10+ minutes i.e. If there is no activity for more than 10 minutes, the session should be considered as ended.
Using MATCH_RECOGNIZE to Sessionize
Below we will use SQL’s MATCH_RECOGNIZE to sessionize this data
SELECT username,
session_start
, session_end
FROM app_log
MATCH_RECOGNIZE(
PARTITION BY username
ORDER BY log_timestamp
MEASURES
first_value(log_timestamp) AS session_start,
last_value(log_timestamp) AS session_end
PATTERN (session_start continuous_activity * )
DEFINE
continuous_activity AS log_timestamp <= dateadd('minute', 10, lag(log_timestamp))
)
Query output
Username | session_start | session_end |
---|---|---|
Scott | 2020-08-07 20:10:00.000 | 2020-08-07 20:50:00.000 |
Scott | 2020-08-07 22:00:00.000 | 2020-08-07 22:07:00.000 |
Scott | 2020-08-07 22:20:00.000 | 2020-08-07 22:30:00.000 |
Bob | 2020-08-07 20:10:00.000 | 2020-08-07 20:30:00.000 |
Bob | 2020-08-07 20:50:00.000 | 2020-08-07 21:10:00.000 |
Angela | 2020-08-07 20:10:00.000 | 2020-08-07 20:50:00.000 |
Using CONDITIONAL_TRUE_EVENT to Sessionize
Another way to sessionize this App Log is to use CONDITIONAL_TRUE_EVENT in SQL
select
username
, log_timestamp
, datediff(
minute
, lag(log_timestamp) over (partition by username order by log_timestamp asc)
, log_timestamp
) as minutes_since_last_action
, conditional_true_event(minutes_since_last_action > 10)
over (partition by username order by log_timestamp asc)
as session_count
from app_log;
Query output
usernane | log_timestamp | minutes_since_last_action | session_count |
---|---|---|---|
Scott | 2020-08-07 20:10:00.000 | 0 | |
Scott | 2020-08-07 20:20:00.000 | 10 | 0 |
Scott | 2020-08-07 20:30:00.000 | 10 | 0 |
Scott | 2020-08-07 20:40:00.000 | 10 | 0 |
Scott | 2020-08-07 20:50:00.000 | 10 | 0 |
Scott | 2020-08-07 22:00:00.000 | 70 | 1 |
Scott | 2020-08-07 22:07:00.000 | 7 | 1 |
Scott | 2020-08-07 22:20:00.000 | 13 | 2 |
Scott | 2020-08-07 22:30:00.000 | 10 | 2 |
Bob | 2020-08-07 20:10:00.000 | 0 | |
Bob | 2020-08-07 20:20:00.000 | 10 | 0 |
Bob | 2020-08-07 20:30:00.000 | 10 | 0 |
Bob | 2020-08-07 20:50:00.000 | 20 | 1 |
Bob | 2020-08-07 21:00:00.000 | 10 | 1 |
Bob | 2020-08-07 21:10:00.000 | 10 | 1 |
Angela | 2020-08-07 20:10:00.000 | 0 | |
Angela | 2020-08-07 20:20:00.000 | 10 | 0 |
Angela | 2020-08-07 20:30:00.000 | 10 | 0 |
Angela | 2020-08-07 20:40:00.000 | 10 | 0 |
Angela | 2020-08-07 20:50:00.000 | 10 | 0 |
See also
- A/B Analysis on Streaming Data using MATCH_RECOGNIZE
- Applied overview of MATCH_RECOGNIZE clause
- Conditional True Event
- Correlated pattern definitions and Snowflake’s MATCH_RECOGNIZE
- Finding a Gap in Timeseries data and / or Gaps-and-Islands using SQL
- How to Predict Customer Churn Using SQL Pattern Detection - MATCH_RECOGNIZE
- Association Rule Mining using MATCH_RECOGNIZE
- MATCH_RECOGNIZE
- Sessionization, or solving the Gaps-and-Islands Problem using RANGE_SESSIONIZE in Google BigQuery
- Sessionization, or solving the Gaps-and-Islands Problem using conditional_true_event
- Stock Analysis using MATCH_RECOGNIZE
- MATCH_RECOGNIZE and Data Vault Effectivity Satellite - Part II
- MATCH_RECOGNIZE and Data Vault Effectivity Satellite
- Using conditional_true_event to get a subset of data from timeseries data based on an event
- What’s eating up your Snowflake Virtual Warehouse - Part II featuring MATCH_RECOGNIZE