Web Crawler To get Links From New Website

后端 未结 3 1604
忘了有多久
忘了有多久 2021-01-26 10:49

I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:

main.py contains :



        
3条回答
  •  -上瘾入骨i
    2021-01-26 11:42

    you are using link_dictionary vaguely. If you are not using it for reading purpose then try the following code :

     br =  mechanize.Browser()
     htmltext = br.open(url).read()
    
     articletext = ""
     for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
        for link in tag_li.findAll('a'):
            urlnew = urlnew = link.get('href')
            brnew =  mechanize.Browser()
            htmltextnew = brnew.open(urlnew).read()            
            articletext = ""
            soupnew = BeautifulSoup(htmltextnew)
            for tag in soupnew.findAll('p'):
                articletext += tag.text
            print re.sub('\s+', ' ', articletext, flags=re.M)
    

    Note : re is for regulare expression. for this you import the module of re.

提交回复
热议问题