Cant Scrape webpage with Python Requests Library

后端 未结 2 1481
一向
一向 2020-12-11 10:20

I am trying to get some info from a webpage (link below) using Requests in python; however, the HTML data that I see in my browser doesn\'t seem to exist when I connect via

相关标签:
2条回答
  • 2020-12-11 10:54

    here is the code, how i scrap a table from one site. in that site, they didn't define id or class in table so you no need to put anything. if id or class there means just use html.xpath('//table[@id=id_val]/tr') instead of html.xpath('//table/tr')

    from lxml import etree
    import urllib
    web = urllib.urlopen("http://www.yourpage.com/")
    html = etree.HTML(web.read())
    tr_nodes = html.xpath('//table/tr')
    td_content = [tr.xpath('td') for tr in tr_nodes  if [td.text for td in tr.xpath('td')][2] == 'Chennai' or [td.text for td in tr.xpath('td')][2] == 'Across India'  or 'Chennai' in [td.text for td in tr.xpath('td')][2].split('/') ]
    main_list = []
    for i in td_content:
        if i[5].text == 'Freshers' or  'Freshers' in i[5].text.split('/') or  '0' in i[5].text.split(' '):
           sub_list = [td.text for td in i]
           sub_list.insert(6,'http://yourpage.com/%s'%i[6].xpath('a')[0].get('href'))
           main_list.append(sub_list)
    print 'main_list',main_list
    
    0 讨论(0)
  • 2020-12-11 11:06

    The element is generated using javascript, you can use selenium to get the source, to get headless browsing combine it with phantomjs:

    url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'
    
    from selenium import webdriver
    
    browser = webdriver.PhantomJS()
    browser.get(url)
    _html = browser.page_source
    
    from bs4 import BeautifulSoup
    
    print(BeautifulSoup(_html).find("span",{"id":"ourPrice"}).text)
    $50
    
    0 讨论(0)
提交回复
热议问题