How can I extract the list of urls obtained during a HTML page render in python?

后端未结

关注

 2  2013

谎友^ 2021-01-22 01:59

I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For eg: if we try to open cnn.com, there are multiple URLs wit

2条回答

北海茫月 (楼主)

2021-01-22 03:04

I guess you will have to create a list of all known file extensions that you do NOT want, and then scan the content of the http response, checking with "if substring not in nono-list:"

The problem is all href's ending with TLDs, forwardslashes, url-delivered variables and so on, so i think it would be easier to check for stuff you know you dont want.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...