How to check if content of webpage has been changed?

前端 未结 6 1061
失恋的感觉
失恋的感觉 2020-12-30 11:23

Basically I\'m trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.

I\'m thinking of comparing

相关标签:
6条回答
  • 2020-12-30 11:49

    Safest solution:

    download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.

    Pros: You are not dependent to any Server headers and will detect any modifications.
    Cons: Too much bandwidth usage. You have to download all the content every time.

    Using Head

    Request page using HEAD verb and check the Header Tags:

    • Last-Modified: Server should provide last time page generated or Modified.
    • ETag: A checksum-like value which is defined by server and should change as soon as content changed.

    Pros: Much less bandwidth usage and very quick update.
    Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch

    Using GET

    Request page using GET verb and using conditional Header Tags: * If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified

    Pros: Still Using less bandwidth, Single trip to receive data.
    Cons: Again not all resource support this header.

    Finally, maybe mix of above solution is optimum way for doing such action.

    0 讨论(0)
  • 2020-12-30 11:51

    If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.

    Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?

    That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.

    Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

    You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).

    An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.

    You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".

    You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.

    0 讨论(0)
  • 2020-12-30 11:52

    You should do an HTTP HEAD request (so you don't download the file) and look at the "Last-modified" header in the response.

    import requests
    
    response = requests.head(url)
    datetime_str = response.headers["last-modified"]
    

    And keep checking if that field changes in a while loop and compare the datetime difference.

    I did a little program on Python to do that:

    https://github.com/javierdechile/check_updates_http

    0 讨论(0)
  • 2020-12-30 11:56

    Use git, which has excellent reporting capabilities on what has changed between two states of a file; plus you won't eat up disk space as git manages the deltas for you.

    You can even tell git to ignore "trivial" changes, such as adding and removing of whitespace characters to further optimize the search.

    Practically what this comes down to is parsing the output of git diff -b --numstat HEAD HEAD^; which roughly translates to "find me what has changed in all the files, ignoring any whitespace changes, between the current state, and the previous state"; which will result in output like this:

    2       37      en/index.html
    

    2 insertions were made, 37 deletions were made to en/index.html

    Next you'll have to do some experimentation to find a "threshold" at which you would consider a change significant in order to process the files further; this will take time as you will have to train the system (you can also automate this part, but that is another topic all together).

    Unless you have a very good reason to do so - don't use your traditional, relational database as a file system. Let the operating system take care of files, which its very good at (something a relational database is not designed to manage).

    0 讨论(0)
  • 2020-12-30 11:57

    Hope this helps.

    store the html files -- two versions..

    one was the html which was taken before an hour. -- first.html

    second is the html which was taken now -- second.html

    Run the command :

    $ diff first.html second.html > diffs.txt
    

    If the diffs has some text then the file is changed.

    0 讨论(0)
  • 2020-12-30 11:59

    There is no universal solution.

    • Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
    • Use RSS when possible.
    • Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
    • Only hash interesting elements of page (build site-specific model) excluding volatile parts
    • Hash whole content (useless for dynamic pages)
    0 讨论(0)
提交回复
热议问题