Logpickr UDF Sessions (Task Mining)
Présentation
UDTF distribuant une collection de lignes en differentes sessions.
Utilisé en Task Mining, elle permet de générer des CaseIDs initialement absents, selon des critères définis par l'utilisateur.
UDF Sessions
This UDF of type Tabular allows to divide a collection of ordered lines into sessions (a session has its own ID and corresponds to a regroupment of lines sharing a common point). To create a session, Regex are used in order to describe the lines belonging to a same group, the session's starting lines, the session's ending lines, and the lines that need to be ignored.
To retrieve information about this UDF directly in ksqlDB, use the following command :
DESCRIBE FUNCTION LOGPICKR_SESSIONS;
UDF Signature :
def logpickrSessions(
inputLines: util.List[String],
ignorePattern: String,
groupSessionPattern: String,
startSessionPattern: String,
endSessionPattern: String,
sessionIdPattern: String,
isSessionIdHash: Boolean,
isIgnoreIfNoStart: Boolean,
isIgnoreIfNoEnd: Boolean
): util.List[Struct]
Here, we take in input a collection of rows (where each row has some columns). Then we regroup the rows per groups (in function of the values of the columns specified by groupSessionPattern, where two rows having the same value for the specified columns will end up in the same group). And finally within each group we calculate sessions for some consecutive rows. To calculate the sessions, some rows correspond to the start of the sessions, others to the end, and rows in between just belong to the session (a row corresponds to a start/end of a session according to whether or not the row matches the startSessionPattern/endSessionPattern defined by the user). Hence, groups are used to separate and reorganize the rows of the input collection. Each session is only calculated from the rows of a same group and correspond to a gathering of related events, with limits defined by the user.
Additionaly, there are options to choose if we want to hash the sessionId created for each new session, and to choose if we ignore the sessions not having one row matching the descriptions given either for the start of a session or for the end of a session
The Structure used in the function's return :
STRUCT<SESSION_ID VARCHAR(STRING), LINE VARCHAR(STRING)>
The LINE field corresponds to a line from the initial input collection, and the SESSION_ID field corresponds to the ID of the session associated to the line
Parameters :
- inputLines : Corresponds to the initial collection of rows
- ignorePattern : Regex describing the rows to ignore. Rows verifying this pattern won't be used for the sessions creation and won't be returned by the function
- groupSessionPattern : Regex allowing to regroup lines having the same values for the specified columns. The session will be determined within these groups. For instance for lines with the following format :
timeStamp;userID;targetApp;eventType
and for the following pattern :
".*;(.*);.*;(.*)"
The group of a row will be determined by concatenating its userId and eventType columns values (because those columns are into brackets in the Regex) - startSessionPattern : Regex describing the lines that can be considered as a Start of a session
- endSessionPattern : Regex describing the lines that can be considered as End of a session
- sessionIdPattern : Regex informing about the parts of the lines that will be used to create the sessionId. For instance for lines with the following format :
timeStamp;userID;targetApp;eventType
and for the following pattern :
".*;(.*);(.*);.*"
The sessionID will be created by concatenating the userId and targetApp columns (which are into brackets in the Regex) - isSessionIdHash : A sessionId is created according to the columns specified in the sessionIdPattern parameter. If isSessionIdHash is false, then the sessionId will only correspond to the concatenation of the values of the columns specified in sessionIdPattern. But if isSessionIdHash is true, the result of this concatenation is hashed to create the sessionId. The Hash function used is MD5
- isIgnoreIfNoStart : Boolean indicating if sessions that don't have a line matching the startSessionPattern are kept or not. If true, the corresponding sessions are not returned. If false, they are returned
- isIgnoreIfNoEnd : Boolean indicating if sessions that don't have a line matching the endSessionPattern are kept or not. If true, the corresponding sessions are not returned. If false, they are returned
For more information about Regex follow this link : https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
Examples in ksqlDB
To follow those examples, start by using the following command to apply modifications of a STREAM on data inserted before the creation/display of the STREAM :
SET 'auto.offset.reset'='earliest';
We consider the following collection of rows :
timeStamp | userID | targetApp | eventType |
---|---|---|---|
2020-06-16T04 | 1 | appli1 | Start |
2020-06-16T04 | 1 | appli1 | event1 |
2020-06-16T04 | 1 | appli1 | event2 |
2020-06-16T04 | 2 | appli1 | Start |
2020-06-16T04 | 2 | appli1 | event4 |
2020-06-16T04 | 2 | appli2 | Start |
2020-06-16T04 | 2 | appli3 | event5 |
2020-06-16T04 | 1 | appli1 | event3 |
2020-06-16T04 | 1 | appli1 | ignoreEvent |
2020-06-16T04 | 1 | appli1 | End |
2020-06-16T04 | 1 | appli2 | aloneEvent1 |
2020-06-16T04 | 1 | appli2 | aloneEvent2 |
2020-06-16T04 | 2 | appli2 | event6 |
2020-06-16T04 | 2 | appli2 | End |
2020-06-16T04 | 2 | appli2 | event7 |
2020-06-16T04 | 2 | appli3 | End |
The first STREAM to create in ksqlDB is :
CREATE STREAM s1 (
lines ARRAY<VARCHAR>
) WITH (
kafka_topic = 's1',
partitions = 1,
value_format = 'avro'
);
where each row is present in the lines array
It is then possible to insert data :
INSERT INTO s1 (lines) VALUES (ARRAY[
'2020-06-16T04;1;appli1;Start',
'2020-06-16T04;1;appli1;event1',
'2020-06-16T04;1;appli1;event2',
'2020-06-16T04;2;appli1;Start',
'2020-06-16T04;2;appli1;event4',
'2020-06-16T04;2;appli2;Start',
'2020-06-16T04;2;appli3;event5',
'2020-06-16T04;1;appli1;event3',
'2020-06-16T04;1;appli1;ignoreEvent',
'2020-06-16T04;1;appli1;End',
'2020-06-16T04;1;appli2;aloneEvent1',
'2020-06-16T04;1;appli2;aloneEvent2',
'2020-06-16T04;2;appli2;event6',
'2020-06-16T04;2;appli2;End' ,
'2020-06-16T04;2;appli2;event7',
'2020-06-16T04;2;appli3;End']);
And everything is ready to call the UDF.
In these examples, the rows verifying :
- ignorePattern = '.*;.*;.*;ignoreEvent' are ignored
- startSessionPattern = '.*;.*;.*;Start' correspond to the start of a new session
- endSessionPattern = '.*;.*;.*;End' correspond to the end of the current session
Furthermore, the rows follow the format : timeStamp;userID;targetApp;eventType
The group pattern used in the examples is :
- groupSessionPattern = '.*;(.*);.*;.*' meaning that rows are divided into group according to the value of the userID column
And the pattern used to create the sessionId of a session starting row is :
- sessionIdPattern = '.*;(.*);(.*);.*' meaning that for a session starting row the sessionId will be calculated according to the values of the userID and targetApp columns
Moreover, in function of the isSessionIdHash value, the following sessionId1/sessionId2/sessionId3/sessionId4 will correspond either to the concatenation of the userId and targetApp columns, or to the hashed value of the concatenation of the userId and targetApp columns (those columns because they are described by sessionIdPattern = '.;(.);(.);.')
Numerous combinations for the logpickr_sessions functions are possible :
- isIgnoreIfNoStart = true and isIgnoreIfNoEnd = true
CREATE STREAM s2 AS SELECT
logpickr_sessions(lines, '.*;.*;.*;ignoreEvent', '.*;(.*);.*;.*', '.*;.*;.*;Start', '.*;.*;.*;End', '.*;(.*);(.*);.*', true, true, true) AS sessions
FROM s1 EMIT CHANGES;
CREATE STREAM s3 AS SELECT
sessions->session_id AS session_id,
sessions->line AS session_line
FROM s2 EMIT CHANGES;
SELECT session_id, session_line FROM s3 EMIT CHANGES;
The awaited result is then :
session_id | session_line |
---|---|
sessionId1 | 2020-06-16T04;1;appli1;Start |
sessionId1 | 2020-06-16T04;1;appli1;event1 |
sessionId1 | 2020-06-16T04;1;appli1;event2 |
sessionId1 | 2020-06-16T04;1;appli1;event3 |
sessionId1 | 2020-06-16T04;1;appli1;End |
sessionId2 | 2020-06-16T04;2;appli2;Start |
sessionId2 | 2020-06-16T04;2;appli3;event5 |
sessionId2 | 2020-06-16T04;2;appli2;event6 |
sessionId2 | 2020-06-16T04;2;appli2;End |
- isIgnoreIfNoStart = false and isIgnoreIfNoEnd = true
CREATE STREAM s4 AS SELECT
logpickr_sessions(lines, '.*;.*;.*;ignoreEvent', '.*;(.*);.*;.*', '.*;.*;.*;Start', '.*;.*;.*;End', '.*;(.*);(.*);.*', true, false, true) AS sessions
FROM s1 EMIT CHANGES;
CREATE STREAM s5 AS SELECT
sessions->session_id AS session_id,
sessions->line AS session_line
FROM s4 EMIT CHANGES;
SELECT session_id, session_line FROM s5 EMIT CHANGES;
The awaited result is then :
session_id | session_line |
---|---|
sessionId1 | 2020-06-16T04;1;appli1;Start |
sessionId1 | 2020-06-16T04;1;appli1;event1 |
sessionId1 | 2020-06-16T04;1;appli1;event2 |
sessionId1 | 2020-06-16T04;1;appli1;event3 |
sessionId1 | 2020-06-16T04;1;appli1;End |
sessionId2 | 2020-06-16T04;2;appli2;Start |
sessionId2 | 2020-06-16T04;2;appli3;event5 |
sessionId2 | 2020-06-16T04;2;appli2;event6 |
sessionId2 | 2020-06-16T04;2;appli2;End |
sessionId2 | 2020-06-16T04;2;appli2;event7 |
sessionId2 | 2020-06-16T04;2;appli3;End |
- isIgnoreIfNoStart = true and isIgnoreIfNoEnd = false
CREATE STREAM s6 AS SELECT
logpickr_sessions(lines, '.*;.*;.*;ignoreEvent', '.*;(.*);.*;.*', '.*;.*;.*;Start', '.*;.*;.*;End', '.*;(.*);(.*);.*', true, true, false) AS sessions
FROM s1 EMIT CHANGES;
CREATE STREAM s7 AS SELECT
sessions->session_id AS session_id,
sessions->line AS session_line
FROM s6 EMIT CHANGES;
SELECT session_id, session_line FROM s7 EMIT CHANGES;
The awaited result is then :
session_id | session_line |
---|---|
sessionId1 | 2020-06-16T04;1;appli1;Start |
sessionId1 | 2020-06-16T04;1;appli1;event1 |
sessionId1 | 2020-06-16T04;1;appli1;event2 |
sessionId1 | 2020-06-16T04;1;appli1;event3 |
sessionId1 | 2020-06-16T04;1;appli1;End |
sessionId2 | 2020-06-16T04;2;appli1;Start |
sessionId2 | 2020-06-16T04;2;appli1;event4 |
sessionId3 | 2020-06-16T04;2;appli2;Start |
sessionId3 | 2020-06-16T04;2;appli3;event5 |
sessionId3 | 2020-06-16T04;2;appli2;event6 |
sessionId3 | 2020-06-16T04;2;appli2;End |
- isIgnoreIfNoStart = false and isIgnoreIfNoEnd = false
CREATE STREAM s8 AS SELECT
logpickr_sessions(lines, '.*;.*;.*;ignoreEvent', '.*;(.*);.*;.*', '.*;.*;.*;Start', '.*;.*;.*;End', '.*;(.*);(.*);.*', true, false, false) AS sessions
FROM s1 EMIT CHANGES;
CREATE STREAM s9 AS SELECT
sessions->session_id AS session_id,
sessions->line AS session_line
FROM s8 EMIT CHANGES;
SELECT session_id, session_line FROM s9 EMIT CHANGES;
session_id | session_line |
---|---|
sessionId1 | 2020-06-16T04;1;appli1;Start |
sessionId1 | 2020-06-16T04;1;appli1;event1 |
sessionId1 | 2020-06-16T04;1;appli1;event2 |
sessionId1 | 2020-06-16T04;1;appli1;event3 |
sessionId1 | 2020-06-16T04;1;appli1;End |
sessionId2 | 2020-06-16T04;1;appli2;aloneEvent1 |
sessionId2 | 2020-06-16T04;1;appli2;aloneEvent2 |
sessionId3 | 2020-06-16T04;2;appli1;Start |
sessionId3 | 2020-06-16T04;2;appli1;event4 |
sessionId4 | 2020-06-16T04;2;appli2;Start |
sessionId4 | 2020-06-16T04;2;appli3;event5 |
sessionId4 | 2020-06-16T04;2;appli2;event6 |
sessionId4 | 2020-06-16T04;2;appli2;End |
sessionId4 | 2020-06-16T04;2;appli2;event7 |
sessionId4 | 2020-06-16T04;2;appli3;End |
Further Information
In the case where startSessionPattern and endSessionPattern are both verified by the same row, the row is considered as the start of a new session and end the previous session. The new session is then kept independently of the isIgnoreIfNoEnd value