Building an HTML Diff/Patch Algorithm

后端 未结 3 649
日久生厌
日久生厌 2021-02-04 03:48

A description of what I\'m going to accomplish:

  • Input 2 (N is not essential) HTML documents.
  • Standardize the HTML format
  • Diff the two documents
相关标签:
3条回答
  • 2021-02-04 04:11

    If you were going to start from scratch, a useful search term would be "tree diff".

    There's a pretty awesome blog post here, although I just found it by googling "daisydiff python" so I bet you've already seen it. Besides all the interesting theoretical stuff, he mentions the existence of Logilab's xmldiff, an open-source XML differ written in Python. That might be a decent starting point — maybe less correct than trying to wrap or reimplement DaisyDiff, but probably easier to get up and running quickly.

    There's also html-tree-diff on pypi, which I found via this Quora link: http://www.quora.com/Is-there-any-good-Python-implementation-of-a-tree-diff-algorithm

    There's some theoretical stuff about tree diffing at efficient diff algorithm for trees and Levenshtein distance on cstheory.stackexchange.

    BTW, just to clarify, you are talking about diffing two DOM trees, but not necessarily rendering the diff/merge back into any particular HTML, right? (EDIT: Right.) A lot of the similarly-worded questions on here are really asking "how can I color deleted lines red and added lines green" or "how can I make matching paragraphs line up visually", skipping right over the theoretical hard part of "how do I diff two DOM trees in the first place" and the practical hard part of "how do I parse possibly malformed HTML into a DOM tree even before that". :)

    0 讨论(0)
  • 2021-02-04 04:25

    I know this questions is related to python but you could take a look 3DM - XML 3-way Merging and Differencing Tool (default implementation in java) but here is the actual paper describing the algorithm used http://www.cs.hut.fi/~ctl/3dm/thesis.pdf, and here is the link to the site.

    Drawback to this is that you do have to cleanup the document and be able to pars it as XML.

    0 讨论(0)
  • 2021-02-04 04:30

    You could start by using beautifulsoup to parse both documents.

    Then you have a choice:

    • use prettify to render both documents as more or less standardized HTML and diff those.
    • compare the parse trees.

    The latter allows you to e.g. discard elements that only affect the presentation, not the content. The former is probably easier.

    0 讨论(0)
提交回复
热议问题