To prevent spam users, you can only post on this forum after registration, which is by invitation. If you want to post on the forum, please send me a mail (h DOT m DOT w DOT verbeek AT tue DOT nl) and I'll send you an invitation in return for an account.

hashing logs

abinash
edited June 2011 in Event Logs
hey guys,
 
I want to cluster group of texts or just messages of logs. i want to give hash value to the string on such a ways that the two texts on same cluster have similar values than that of other cluster. Suppose there are many texts like-
' user1 logged in'
'hari logged in vsat1'
.........................
..........................
'hari logged in isat'
'user2 logged in'

well, what i need to do is to give certain value to all the texts such that  ' user1 logged in' and  'user2 logged in' would have similar value and 'hari logged in vsat1' and ''hari logged in isat'' have similar values. 

There could be thousands of the logs or texts and its almost impossible to compare each with other. Do you have any idea given that the texts i am given may be context free and of any type. I dont want to go towards NLP. what i want is just to devise and algorithm that can give suitable value to each text.

Comments

  • Hi Abinash,

    I'm not sure but do the replies of JC in your other topic answer your question?
    E.g. to use existing linguistic clustering techniques???
    Joos Buijs

    Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
    Previously Assistant Professor in Process Mining at Eindhoven University of Technology
  • hey Jbuijs,

    Actually the other topic was about my idea of another approach to solve the problem. Ajay and me are among the group members of this project. Actually, I wanna classify the logs according to the messages only. And i want to classify the streaming logs...only one pass or iterations. This is just my thought and if it can be done then it'll be the best possible and simplest way. 
    Mapping the strings into some hash value such that the two interrelated strings/logs has similar values. Taking about 1000 logs to find the pattern in this way and then i hope the classification for streaming logs would be easier to accomplish in one pass.
    the other approach could involve syntatic analysis or on some contextual basis which include some form of NLP. So, i thought if i could do it in some simpler and more efficient way. the efficiency of classification depends upon efficiency of hash functions. So, i hope you could provide me some of the techniques of mapping strings(general unstructured context free) by effective hashing.
    If the idea seems somewhat immature please help me. I am just bachelor level student without much of experiences. But i promise i can learn and try hard.
    Thanks you!
Sign In or Register to comment.