How can I extract the list of urls obtained during a HTML page render in python?

后端 未结 2 2013
谎友^
谎友^ 2021-01-22 01:59

I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For eg: if we try to open cnn.com, there are multiple URLs wit

2条回答
  •  北海茫月
    2021-01-22 03:04

    I guess you will have to create a list of all known file extensions that you do NOT want, and then scan the content of the http response, checking with "if substring not in nono-list:"

    The problem is all href's ending with TLDs, forwardslashes, url-delivered variables and so on, so i think it would be easier to check for stuff you know you dont want.

提交回复
热议问题