To prevent spam users, you can only post on this forum after registration, which is by invitation. If you want to post on the forum, please send me a mail (h DOT m DOT w DOT verbeek AT tue DOT nl) and I'll send you an invitation in return for an account.
Clustering traces based on edit distance
Hi everyone,
Is there a ProM plugin that can cluster traces based on edit distance (also called Levenshtein distance)? I see that there are some papers [1] mentioning the approach, but I'm unable to find a plugin for the same.
Please tell me if you know any other plugin for clustering that is uses similar approach, that will also work.
I'm working with data that come from place where there is no underlying normative process model. Basically, its kids using educational software, and so things differ from teacher to teacher, and also from student to student so event log has quite a lot of variety. And so process models I get are always complicated. I'm hoping that clustering traces beforehand might help.
Thanks,
Nirmal
[1] Context Aware Trace Clustering: Towards Improving Process Mining Results
Is there a ProM plugin that can cluster traces based on edit distance (also called Levenshtein distance)? I see that there are some papers [1] mentioning the approach, but I'm unable to find a plugin for the same.
Please tell me if you know any other plugin for clustering that is uses similar approach, that will also work.
I'm working with data that come from place where there is no underlying normative process model. Basically, its kids using educational software, and so things differ from teacher to teacher, and also from student to student so event log has quite a lot of variety. And so process models I get are always complicated. I'm hoping that clustering traces beforehand might help.
Thanks,
Nirmal
[1] Context Aware Trace Clustering: Towards Improving Process Mining Results
Comments

Hi Folks,
Right now I've done a little arrangement to do clustering outside ProM. I'm trying out hierarchical clustering to cluster traces. I use R, and R's hclust() function that does hierarchical clustering requires user to provide a distance matrix that contains pairwise distances between observations. So I just plug in edit distance matrix of all traces and I am getting very interesting results. I haven't optimized code, so it takes a bit of time, but results are looking good to me.
One problem with using centroid based algorithm (like KMeans) for trace clustering is deciding up on centroid, which is a bit difficult. Earlier I tried to make centroid as trace of length that occurs most in the cluster and events in trace were most common events in each position. But I think hierarchical clustering seems more sound to use, without any kind of tricks involved.
 Nirmal

Hi Nirmal,
Sorry, I do not know the answer. I am contacting one of my colleague to help you. Please be patient.
Cong

Dear Nirmal,
If I understand correctly, your question is twofold: you want to provide a certain similarity/distance measure yourself (e.g. Levenshtein distance), and would like to obtain the number of clusters automatically (i.e. not having to select the k for kMeans).
At the moment, the easiest thing to do is to run the plugin called "Cluster traces using Markov clustering". The default implementation uses the cosine similarity between profile vectors created from the traces using a selected set of 'perspectives'. The perspectives are the properties used to cluster upon, such as attribute values or controlflow constructs. Then, the Markov cluster algorithm (MCL) is used to cluster the traces. MCL is a clustering algorithm that does not take the number of clusters as input.
The MCL clustering technique uses two parameters, expansion and inflation, which both affect clustering granularity. When a low level (finegrained) clustering is of interest, expansion can be decreased and inflation increased, and viceversa for when only highlevel clustering is required.
The technique used in the plugin is described in our paper titled "Discovering Deviating Cases and Process Variants Using Trace Clustering".
There's also some other topics on this forum talking about this technique here, here, and here.
Would you prefer to use another similarity measure, you'd need to run the plugin (and ProM) from code and implement the similarity measure. The code is prepared for this (interfaced).
Let me know whether this plugin suits your needs and if you need more help with the custom implementation.
For the record, the TraceClustering package also contains the code for the plugin called "Cluster traces over time using Markov clustering", which takes into account the changes over time in the similarity matrix, in order to let the user detect concept drift (see our paper titled "Detecting Changes in Process Behavior Using Comparative Case Clustering").
Kind regards,
Bart
Bart Hompes  Eindhoven University of Technology 
Thanks a lot Bart! I read first 4 sections of your paper, and changed parameters of clustering algorithm (I decreased expansion to 1 for start) and I'm seeing all the different variants now!
I'll try and make more use of the MCL plugin, and see how results turn out. Hopefully, I'll be able to post some updates here later.
 Nirmal
Howdy, Stranger!
Categories
 1.5K All Categories
 45 Announcements / News
 214 Process Mining
 6  BPI Challenge 2020
 9  BPI Challenge 2019
 24  BPI Challenge 2018
 27  BPI Challenge 2017
 8  BPI Challenge 2016
 65 Research
 961 ProM 6
 371  Usage
 284  Development
 8 RapidProM
 1  Usage
 6  Development
 54 ProM5
 19  Usage
 185 Event Logs
 30  ProMimport
 75  XESame