How to distinguish between added sentences and altered sentences with difflib and nltk?

有些话、适合烂在心里 提交于 2019-12-24 15:32:46

问题


Downloading this page and making a very minor edit to it, changing the first 65 in this paragraph to 68:

I then run it through the following code to pull out the diffs.

import bs4
from bs4 import BeautifulSoup
import urllib2
import lxml.html as lh
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read()  # get response as list of lines
root = lh.fromstring(content)
section1 = root.xpath("//div[@class = 'column-12']")[0]
section1_text = section1.text_content()

url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read()  # get response as list of lines
root2 = lh.fromstring(content2)
section2 = root2.xpath("//div[@class = 'column-12']")[0]
section2_text = section2.text_content()

d = difflib.Differ()

soup = bs4.BeautifulSoup(unicode(section1_text))
soup2= bs4.BeautifulSoup(unicode(section2_text))

from nltk import sent_tokenize

sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]

diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print(change)

Printing the changes gives:

- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).

So a change gets marked with a +, whether it's a new addition (a brand new full sentence also gets marked with a +) or a minor change to an existing sentence. As it stands then, unless my program does some additional processing, it will think that a new sentence was added and another one was removed.

How can we take advantage of the fact that what difflib sees as the apparently 'removed' sentence and the apparently 'added' sentence are very similar, in order to determine that we are in fact dealing with an in-place change to an existing sentence?

NOTE: The solution will need to be able to process potentially several changes in a single page, so it won't be sufficient to apply something like if sentence1 very similar to sentence 2: then it's a modification, since there will be several diffs to compare and contrast.

来源:https://stackoverflow.com/questions/35513457/how-to-distinguish-between-added-sentences-and-altered-sentences-with-difflib-an

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!