Getting more granular diffs from difflib (or a way to post-process a diff to achieve the same thing)

后端未结

关注

 1  1819

Downloading this page and making a minor edit to it, changing the first 65 in this paragraph to 68:

I then parse both sources with

相关标签:

1条回答

悲哀的现实

2021-01-16 05:13

You can use nltk.sent_tokenize() to split soup strings into sentences:

from nltk import sent_tokenize sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)] sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)] diff = d.compare(sentences, sentences2) changes = [change for change in diff if change.startswith('-') or change.startswith('+')] for change in changes: print(change)

Prints only an appropriate sentence where the change was detected:

- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA). + It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复