To prevent spam users, you can only post on this forum after registration, which is by invitation. If you want to post on the forum, please send me a mail (h DOT m DOT w DOT verbeek AT tue DOT nl) and I'll send you an invitation in return for an account.

event log classification

ajay_devils
edited June 2011 in Event Logs
i am bachelor level student doing my final year projects. my project title is 'event log classification' . what i wish to do is- i would be given suppose 10000 texts(small 1 liner- i think any type of i.e. generic logs classification wont be problem after this one) and i want to classify them or lets say cluster them in best possible way. eg. if 'jack logs in' and 'marshall logs in' are 2 texts then they should fall in the same class and so on.
what i have thought is representing each text by some equivalent hash-value and cluster the text according to the value. the text can be passes just once. so at the moment i need a way to develop an algorithm to represent a text with hash value. 
can anybody do me a favor? i just need some guidelines. if u didn't understand my problem i can express it in more detail. thanks.

Comments

  • Dear Ajay,

    I would suggest the use of vector-based classic statistical clustering approaches. You need to transform your text/input into a vector representation using some features (e.g., bag of words, n-grams etc.) and use standard distance/similarity measures (e.g, Euclidean distance, Mahalanobis distance etc.) for clustering.

    If all of your input is similar to what you have mentioned, I would recommend that you first preprocess the data with some natural language processing such as POS (part-of-speech) tagging and consider only verbs etc. as features. For the above example, based on how you expect the two sentences to be in the same cluster, I infer that the names of the persons (Jack/Marshall, which are nouns) are irrelevant. Then using verbs such as "logs" would be an appropriate feature space to consider.

    Remember that clustering is a subjective notion and it largely depends on the context of analysis. Based on the focus/context of analysis, an appropriate feature space, distance/similarity metric, clustering algorithm needs to be chosen. There are no thumb rules as such that can give you exact clusters of your liking.

    You can have a look at the paper "Context Aware Trace Clustering: Towards Improving Process Mining Results"  (R. P. J.C. Bose and W. M. P. van der Aalst, SIAM SDM, 2009, 401-412) for further details.

    Hope this helps.

    JC


  • Thanks JC for such an wonderful response. 
    well its sorry situation here as i dont have much time for research but i promise i can learn and at the moment what i need is just little motivation from guys like you and some help. 

    About the project , as far as i have thought is - at first
    - there would be some logs passed and find out the frequent words
    - create a cluster candidate by combining the fixed attributes. (For example, if the line is "Password authentication for john accepted" , and words 'Password', 'authentication', 'for', and 'accepted' are frequent, then the candidate is 'Password authentication for * accepted'.)
    - map them with some numeric value just that for real time streaming data, texts can be passed just once and kept into suitable cluster.

    problem is: 
    - nothing has been started but promise it would be started soon.
    - what if the new incoming data represents new cluster ( and its not outlier as well). hope i 'll think abt it as well.

    i hope u guys can help me out how to actually start and get going. my principal aim is to classify or cluster whatever, general text files or at least log files.

    thanks

  • Dear Ajay,

    Considering the examples that you have quoted, I think that the bag-of-words approach with features defined after pruning some words by POS tagging seems to be a good starting point (e.g., you can ignore nouns (names of persons etc.) and some prepositions (e.g., for, of, on, into etc)). You can also define some stop words (words that you can safely ignore) based on the domain. Apart from these, you would need to apply stemming to the words to retrieve the root words as features (e.g., the root word for authentication, authenticating, authenticated is authenticate) There are good open source tools such as "lucene" that you may want to use for pre-processing.

    Classic metrics such as "within-cluster distance" and "between-cluster distance" may help you decide on the number of clusters to form.

    JC

Sign In or Register to comment.