Getting more granular diffs from difflib (or a way to post-process a diff to achieve the same thing)

后端 未结 1 1811
一向
一向 2021-01-16 04:39

Downloading this page and making a minor edit to it, changing the first 65 in this paragraph to 68:

I then parse both sources with

相关标签:
1条回答
  • 2021-01-16 05:13

    You can use nltk.sent_tokenize() to split soup strings into sentences:

    from nltk import sent_tokenize
    
    sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
    sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]
    
    diff = d.compare(sentences, sentences2)
    changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
    for change in changes:
        print(change)
    

    Prints only an appropriate sentence where the change was detected:

    - It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
    + It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).
    
    0 讨论(0)
提交回复
热议问题