To prevent spam users, you can only post on this forum after registration, which is by invitation. If you want to post on the forum, please send me a mail (h DOT m DOT w DOT verbeek AT tue DOT nl) and I'll send you an invitation in return for an account.
Performance with large logs
Hi all,
after having played a little bit with test log files, scoped to 1-2 traces and around 100 events (works perfect, nice tool ), I am trying to use a real log of mine.
The csv file is 108Mb large, containing around 1million event, and it took 6 hours 25minues to convert it into an 231mb XES file using XESame. As this operation is not performed regularly, this was not really an issue, I just let it run during the night.
Some problem comes when using ProM. Importing the file takes around 2mn. Applying the alpha algorithm worked in like 5mns.
But trying to visualize the log (summary) is endless (I killed the process after some time..)
So my question is did any of you had any experience with large logs, and how did you managed id ? Split the log (which is not really good, as the underlying process holds some very long running events that could be split across the entire file) ? Is there a way in ProM to speed up the process (ignore certain steps, ...), or to get a summary of the log in another way ?
thanks
after having played a little bit with test log files, scoped to 1-2 traces and around 100 events (works perfect, nice tool ), I am trying to use a real log of mine.
The csv file is 108Mb large, containing around 1million event, and it took 6 hours 25minues to convert it into an 231mb XES file using XESame. As this operation is not performed regularly, this was not really an issue, I just let it run during the night.
Some problem comes when using ProM. Importing the file takes around 2mn. Applying the alpha algorithm worked in like 5mns.
But trying to visualize the log (summary) is endless (I killed the process after some time..)
So my question is did any of you had any experience with large logs, and how did you managed id ? Split the log (which is not really good, as the underlying process holds some very long running events that could be split across the entire file) ? Is there a way in ProM to speed up the process (ignore certain steps, ...), or to get a summary of the log in another way ?
thanks
Comments
-
Hi gSeb,
The first thing you might want to try is to increase the amount of memory ProM can use. If you start ProM from the prom6.bat file you can assign the memory it is allowed to use there. There might already be an option like -Xmx1000M or -Xmx1G, try to increase this value. If this option is not in the ProM6.bat file already, add it after the 'java' command.
The following blog post mentions some 'ideal' values for these options: http://www.linuxquestions.org/questions/debian-26/how-to-allocate-memory-to-java-763703/#post3729440
More information about these (and other) options can be found at the following Java page: http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html#PerformanceTuning
Furthermore, if you are running a 64bit operating system, make sure that ProM is run using a 64-bit version of Java.
Recently, Michael Westergaard made a quick analysis of the ProM memory usage which might also help you to increase performance:
http://westergaard.eu/2011/05/how-much-memory-is-needed-to-store-a-log/
Let us know what worked!
Joos Buijs
Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
Previously Assistant Professor in Process Mining at Eindhoven University of Technology -
Minor addition for other people that are trying to convert large text files to XES/MXML:
XESame performs significantly faster when ran on a 'real' database source.
I have just created an event log with 100.000 events (spread over 5000+ traces) with 10+ attributes per trace from an Oracle database is just over 5 minutes.
Therefore it might be a good idea to import text/CSV/Excel files into a database (MS Access, Oracle, MySQL, ...., ......) and then run XESame on that database. Adding the correct indexes (on columns that appear in your where clauses) also help to improve performance.
Just thought that I should share this tip
Joos Buijs
Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
Previously Assistant Professor in Process Mining at Eindhoven University of Technology -
Hi Joos,
thanks for the answer, here are some feed-backs.
I have set up a clean environment from scratch, using a Xeon 3.2Ghz server with 2Gb Ram.
As a reminder my log files contains 1'092'971 events over 141'381 traces. The generated XES file is 237.4Mb.
The initial loading of Prom 6 takes 4-5 seconds and the jvm (java virtual machine) uses 48Mb.
Importing the log took around 2 minutes (3 tests: 2''05, 2''04 & 2''10), with the JVM growing to 766Mb.
When working with ProM, consumption was varying from 750mb to 1.2Gb. Interesting enough, when staying idle for quite a long time (15mns +), the java garbage collector must be doing its work as the consumption is lowered.
One big issue is that after having imported the log, closing ProM took 3minutes. When restarting ProM afterwards, it tried to read the persisting file, which took 9 minutes. So in fact it was faster to kill the jvm and re-import the log than waiting for it to be reloaded in a clean way from the previous execution.
From the imported log, clicking on visualize took 2mn 45sec. Loading the summary took 6 minutes, and the explorer 1mn. In opposite, the loader could be loaded in 1 sec.
As a test, I ran the Alpha miner, which provided a result (quite surprised ) after 40 minutes.
In general visualization takes quite a long time (and that meet the analyse from Westergaard).
These values where quite constant across the time today:
- load Prom: 5 sec
- import log: 2mns
- visualize log: from 2mns 45 to 6mns
Memory consumption: 750mb to 1.2Gb
So, it is possible (meaning, it can be done) to work with such large log, but in fact it is not really feasible, every operation taking too much time.
However visualizing the log was very useful, and showed me that it could be cleaned/aggregated/improved a lot. This should reduce the log size and complexity.
Once that is done, I will re-post the performance results. I have the feeling that the performance drop could be expotential with the size of the log, and not linear, but that would be an interesting thing to see.
It could be nice to find what is the maximum reasonable size of logs to work with ProM.
--Sebastien
-
Hi Sebastien,
Thank you for your extensive post!!!
And good luck analyzing your 1 million events
Joos Buijs
Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
Previously Assistant Professor in Process Mining at Eindhoven University of Technology -
Hi,
some news: after having cleaned the logs, and using directly XESame to query the database, I was able to drop the export time to 3hrs31mns. Which is still too long...
Mining 1mio records is a bit too ambitious, so I have reduced the scope to 3 months of log (instead of 2 years). This reduces the size to around 10% (100k events, 13k traces, but still a 41mb file) of the initial amount and such size is manageable. It takes a 15mns to export it to xes format, and the operations in ProM takes less than 1 minute. Moreover, it allows me to split the log in many "3-months slides", and that will be useful to validate the results.
--Sebastien
-
Hi Sebastien,
Although time-slicing might sound like a good option, please take into account the effect this will have on process discovery algorithms. The chance that you capture the complete 'process' for a single case becomes smaller, e.g. the head or tail might be chopped of. This will influence the results of the process discovery algorithms.
If your focus is more on the 'dotted chart' type of analysis of those 3 months then this option sounds good.
An alternative would be to make chunks of 100.000 cases or so.
I'm also curious what you mean exactly by "using derectly XESame to query the database". Do you mean that XESame now directly queries the data source without intermediate files etc.?
Joos Buijs
Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
Previously Assistant Professor in Process Mining at Eindhoven University of Technology -
Hi Joos,
yes, as you told me I used XESame to query my SQL database, without intermediate files.
However, I finally implemented my own stored procedure that query and filter the data exactly as I want, before saving it as an xes file. And with such solutions performances are again a little better.
--Sebastien
-
Hi Sebastien,Regarding the performance of converting large CSV logs to XES, you may want to try our tool Nitro, which has been built specifically for this task: http://fluxicon.com/nitro/In my experience, Nitro will convert a log around the size you have reported in a few minutes, so that may cut down the waiting time significantly.Best,Christian
-
Hi all,
I manipulated the ProM62.bat as described by Joos to speed up the computation of large logs (in this case the public log from the dutch hospital). But when I use that batch version and run the transition miner, the visualization fails. The transition system seems to be generated correctly (at least it doesn´t fail during computation). If I switch to the original batch, everything works fine, again.
I tried some parameter variations like 2G and 512 or 4G and 1024m as well as forcing to 64Bit mode with -d64 flag:
java -ea -Xmx1G -XX:MaxPermSize=512m -classpath ProM62.jar org.processmining.contexts.uitopia.UI
java -d64 -ea -Xmx4G -XX:MaxPermSize=1024m -classpath ProM62.jar org.processmining.contexts.uitopia.UI
Every variation produces that issue.
Despite of that I wonder if ProM has support for multi-core-processors...? I know the paper from Wil about distributed genetic mining, but maybe scaling ProM with multi-core support would be worth a try. Are there any provisions concerning that?
Best regards,
Marian
Howdy, Stranger!
Categories
- 1.6K All Categories
- 45 Announcements / News
- 225 Process Mining
- 6 - BPI Challenge 2020
- 9 - BPI Challenge 2019
- 24 - BPI Challenge 2018
- 27 - BPI Challenge 2017
- 8 - BPI Challenge 2016
- 68 Research
- 1K ProM 6
- 393 - Usage
- 287 - Development
- 9 RapidProM
- 1 - Usage
- 7 - Development
- 54 ProM5
- 19 - Usage
- 187 Event Logs
- 32 - ProMimport
- 75 - XESame