Web Crawler To get Links From New Website

后端未结

关注

 3  1606

忘了有多久 2021-01-26 10:49

I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:

main.py contains :

3条回答

-上瘾入骨i (楼主)

2021-01-26 11:42

you are using link_dictionary vaguely. If you are not using it for reading purpose then try the following code :

 br =  mechanize.Browser()
 htmltext = br.open(url).read()

 articletext = ""
 for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
    for link in tag_li.findAll('a'):
        urlnew = urlnew = link.get('href')
        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()            
        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print re.sub('\s+', ' ', articletext, flags=re.M)

Note : re is for regulare expression. for this you import the module of re.

0 讨论(0)

查看其它3个回答