There are two binary trees T1 and T2 which store character data, duplicates allowed.
How can I find whether T2 is a subtree of T1 ? .
T1 has millions of nodes and
If your trees are not sorted in any way, I don't see any other way than to do a brute-force search: walk through tree T1
and check to see if you reach a node which matches the first node of tree T2
. If not, continue traversing T1
. If so, check if the next nodes match, until you find the end of T2
, in which case you have a hit: your tree T2
is indeed a subtree of T1
.
If you know the depth of every single node of T1
(from leaf to node), you could skip any nodes which are not as deep as the subtree you are looking for. This could help you eliminate a lot of needless comparisons. Say that T1
and T2
are well balanced, then tree T1
will have a total depth of 20 (2**20
> 1,000,000
) and tree T2
will have a depth of 7 (2**7
> 100
). You'll just have to walk the 13 first layers of T1
(8192 nodes -- or is that 14 layers and 16384 nodes?) and will be able to skip about 90% of T1
...
However, if by subtree you mean that leaf nodes of T2
are also leaf nodes of T1
, then you could do a first traversal of T1
and compute the depth of every node (distance from leaf to node) and then only check the subtrees which have the same depth as T2
.
I think we need to go by brute force, but why do you need to match the trees after you find your root of T2 in T1. Its not same as when we are not supposed to find if the tree are identical.(Then only we need to compare the whole tree)
You are given trees T1 and T2,pointers, not the values.
If Node T2 (which is a pointer), is present in T1 tree.
Then the Tree T2 is subtree of T1.
Edit:
Only if its said that T2 is actually a different Tree by object wise, and we need to find out that if there is a Subtree in T1, which is identical to T2.
Then this wont work.
And we have no other option than to go for the solutions already discussed here.
I assume that your tree are immutable trees so you never change any subtree (you don't do set-car!
in Scheme parlance), but just you are constructing new trees from leaves or from existing trees.
Then I would advise to keep in every node (or subtree) an hash code of that node. In C parlance, declare the tree-s to be
struct tree_st {
const unsigned hash;
const bool isleaf;
union {
const char*leafstring; // when isleaf is true
struct { // when isleaf is false
const struct tree_st* left;
const struct tree_st* right;
};
};
};
then compute the hash at construction time, and when comparing nodes for equality first compare their hash for equality; most of the time the hash code would be different (and you won't bother comparing content).
Here is a possible leaf construction function:
struct tree_st* make_leaf (const char*string) {
assert (string != NULL);
struct tree_st* t = malloc(sizeof(struct tree_st));
if (!t) { perror("malloc"); exit(EXIT_FAILURE); };
t->hash = hash_of_string(string);
t->isleaf = true;
t->leafstring = string;
return t;
}
The function to compute an hash code is
unsigned tree_hash(const struct tree_st *t) {
return (t==NULL)?0:t->hash;
}
The function to construct a node from two subtrees sleft
& sright
is
struct tree_st*make_node (const struct tree_st* sleft,
const struct tree_st* sright) {
struct tree_st* t = malloc(sizeof(struct tree_st));
if (!t) { perror("malloc"); exit(EXIT_FAILURE); };
/// some hashing composition, e.g.
unsigned h = (tree_hash(sleft)*313) ^ (tree_hash(sright)*617);
t->hash = h;
t->left = sleft;
t->right = sright;
return t;
}
The compare function (of two trees tx
& ty
) take advantage that if the hashcodes are different the comparands are different
bool equal_tree (const struct tree_st* tx, const struct tree_st* ty) {
if (tx==ty) return true;
if (tree_hash(tx) != tree_hash(ty)) return false;
if (!tx || !ty) return false;
if (tx->isleaf != ty->isleaf) return false;
if (tx->isleaf) return !strcmp(tx->leafstring, ty->leafstring);
else return equal_tree(tx->left, ty->left)
&& equal_tree(tx->right, ty->right);
}
Most of the time the tree_hash(tx) != tree_hash(ty)
test would succeed and you won't have to recurse.
Read about hash consing.
Once you have such an efficient hash-based equal_tree
function you could use the techniques mentioned in other answers (or in handbooks).
If you are memory/storage bound (i.e. can't pre-manipulate and store the trees in alternate forms), you probably won't be able to do anything better than the brute force search suggested by some other answers (traverse P1 looking for matching root of P2, traverse both to determine whether the node is in-fact the root of a matching sub-tree, continue with original traversal if it isn't a match). This search operates in O(n * m) where n is the size of P1 and m is the size of P2. With depth checks and other potential optimizations depending on the tree data you have available, this man be optimized a bit, but it's still generally O(n * m). Depending on your specific circumstances, this may be the only reasonable approach.
If you're not memory/storage bound and don't mind a bit of complexity, I believe this could be improved to O(n + m) by reducing to a longest common substring problem with the help of a generalized suffix tree. Some discussion on this for a similar problem can be found here. Maybe when I have more time, I'll come back and edit with more specifics on an implementation.
I suggest an algorithm, that uses preprocessing:
1) Pre-process the T1 tree, keeping the list of all possible subtree roots (the cache list will have a millions of entries);
2) Sort the cached list of roots by the descending order of data, kept in root. You may choose more elegant sorting function, for example, parse a character tree into string.
3) Use binary search to locate the necessary sub-tree. You may quickly reject subtrees, with the number of nodes, not equal to the T2 number of nodes (or with the different depth).
Note that steps 1) and 2) must be done only once, then you can test many sub-tree candidates, using the same preprocessed result.
Traverse T1. If the current node is equal to the root node of T2, traverse both of the trees (T2 and the current subtree of T1) at the same time. Compare the current node. If they are always equal, T2 is a subtree of T1.