A common problem in analyzing a Log data from applications is to create sessions from logged user activity. A user may use the app for couple of hours in day and then come back to it the next day. The log will register the user activity, but the logs will not tell us when the session ended for the day, and a new session started the next day. Typically a sessions are based on m minutes of activity followed by n minutes of no activity. n could be 10 minutes, for example. This is also known as Gaps-and-Islands problem in Computer Science.
username | log_timestmap |
---|---|
Angela | 2020-08-07 20:10:00.000 |
Scott | 2020-08-07 20:10:00.000 |
Bob | 2020-08-07 20:10:00.000 |
Bob | 2020-08-07 20:20:00.000 |
Angela | 2020-08-07 20:20:00.000 |
Scott | 2020-08-07 20:20:00.000 |
Bob | 2020-08-07 20:30:00.000 |
Angela | 2020-08-07 20:30:00.000 |
Scott | 2020-08-07 20:30:00.000 |
Angela | 2020-08-07 20:40:00.000 |
Scott | 2020-08-07 20:40:00.000 |
Bob | 2020-08-07 20:50:00.000 |
Angela | 2020-08-07 20:50:00.000 |
Scott | 2020-08-07 20:50:00.000 |
Bob | 2020-08-07 21:00:00.000 |
Bob | 2020-08-07 21:10:00.000 |
Scott | 2020-08-07 22:00:00.000 |
Scott | 2020-08-07 22:20:00.000 |
Scott | 2020-08-07 22:30:00.000 |
From the above App Log, let’s say we need to display
Figure 1. Sessionized User Activity. No activity for 10 minutes or more indicates and end of session. |
We will define the end of a session to be 10+ minutes i.e. If there is no activity for more than 10 minutes, the session should be considered as ended.
Below we will use SQL’s MATCH_RECOGNIZE to sessionize this data
SELECT username,
session_start
, session_end
FROM app_log
MATCH_RECOGNIZE(
PARTITION BY username
ORDER BY log_timestamp
MEASURES
first_value(log_timestamp) AS session_start,
last_value(log_timestamp) AS session_end
PATTERN (session_start continuous_activity * )
DEFINE
continuous_activity AS log_timestamp <= dateadd('minute', 10, lag(log_timestamp))
)
Username | session_start | session_end |
---|---|---|
Scott | 2020-08-07 20:10:00.000 | 2020-08-07 20:50:00.000 |
Scott | 2020-08-07 22:00:00.000 | 2020-08-07 22:07:00.000 |
Scott | 2020-08-07 22:20:00.000 | 2020-08-07 22:30:00.000 |
Bob | 2020-08-07 20:10:00.000 | 2020-08-07 20:30:00.000 |
Bob | 2020-08-07 20:50:00.000 | 2020-08-07 21:10:00.000 |
Angela | 2020-08-07 20:10:00.000 | 2020-08-07 20:50:00.000 |
Another way to sessionize this App Log is to use CONDITIONAL_TRUE_EVENT in SQL
select
username
, log_timestamp
, datediff(
minute
, lag(log_timestamp) over (partition by username order by log_timestamp asc)
, log_timestamp
) as minutes_since_last_action
, conditional_true_event(minutes_since_last_action > 10)
over (partition by username order by log_timestamp asc)
as session_count
from app_log;
usernane | log_timestamp | minutes_since_last_action | session_count |
---|---|---|---|
Scott | 2020-08-07 20:10:00.000 | 0 | |
Scott | 2020-08-07 20:20:00.000 | 10 | 0 |
Scott | 2020-08-07 20:30:00.000 | 10 | 0 |
Scott | 2020-08-07 20:40:00.000 | 10 | 0 |
Scott | 2020-08-07 20:50:00.000 | 10 | 0 |
Scott | 2020-08-07 22:00:00.000 | 70 | 1 |
Scott | 2020-08-07 22:07:00.000 | 7 | 1 |
Scott | 2020-08-07 22:20:00.000 | 13 | 2 |
Scott | 2020-08-07 22:30:00.000 | 10 | 2 |
Bob | 2020-08-07 20:10:00.000 | 0 | |
Bob | 2020-08-07 20:20:00.000 | 10 | 0 |
Bob | 2020-08-07 20:30:00.000 | 10 | 0 |
Bob | 2020-08-07 20:50:00.000 | 20 | 1 |
Bob | 2020-08-07 21:00:00.000 | 10 | 1 |
Bob | 2020-08-07 21:10:00.000 | 10 | 1 |
Angela | 2020-08-07 20:10:00.000 | 0 | |
Angela | 2020-08-07 20:20:00.000 | 10 | 0 |
Angela | 2020-08-07 20:30:00.000 | 10 | 0 |
Angela | 2020-08-07 20:40:00.000 | 10 | 0 |
Angela | 2020-08-07 20:50:00.000 | 10 | 0 |