Clustering tree structured data

前端 未结 2 998
醉话见心
醉话见心 2021-02-01 21:27

Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it

相关标签:
2条回答
  • 2021-02-01 21:41

    Here you may find a paper that seems closely related to your problem.

    From the abstract:

    This thesis presents Ixor, a system which collects, stores, and analyzes stack traces in distributed Java systems. When combined with third-party clustering software and adaptive cluster filtering, unusual executions can be identified.

    0 讨论(0)
  • 2021-02-01 21:44

    Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.

    If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.

    Example:

    Mapping:

    • Exception A -> 0
    • Exception B -> 1
    • Exception C -> 2
    • Exception D -> 3

    Doc A: 0-1-2 Doc B: 1-2-3

    2-grams for doc A: X0, 01, 12, 2X

    2-grams for doc B: X1, 12, 23, 3X

    Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)

    However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).

    Here are some pointers:

    • http://portal.acm.org/citation.cfm?id=1227182
    • http://www.springerlink.com/content/yu0bajqnn4cvh9w9/

    Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.

    0 讨论(0)
提交回复
热议问题