I know that Google’s search algorithm is mainly based on pagerank. However, it also does analysis and uses the structure of the document H1
, H2
,
I believe what you are interested in is called structural-fingerprinting, and it is often used to determine the similarity of two structures. In Google's case, applying a weight to different tags and applying to a secret algorithm that (probably) uses the frequencies of the different elements in the fingerprint. This is deeply routed in information theory - if you are looking for academic papers on information theory, I would start with "A Mathematical Theory of Communication" by Claude Shannon