Downloading this page and making a minor edit to it, changing the first 65 in this paragraph to 68:
I then parse both sources with
You can use nltk.sent_tokenize() to split soup strings into sentences:
from nltk import sent_tokenize
sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]
diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or change.startswith('+')]
for change in changes:
print(change)
Prints only an appropriate sentence where the change was detected:
- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).