Web scraping results in 403 Forbidden Error

后端 未结 3 1138
时光取名叫无心
时光取名叫无心 2021-01-06 08:14

I\'m trying to web scrape the earnings for each company off SeekingAlpha using BeautifulSoup. However, it seems like the site is detecting that a web scraper is being used?

相关标签:
3条回答
  • 2021-01-06 08:35

    I was able to access the site contents by using a proxy, found from here:

    https://free-proxy-list.net/

    Then, creating a playload using the requests module, you can scrape the site:

    import requests
    import re
    from bs4 import BeautifulSoup as soup
    r = requests.get('https://seekingalpha.com/symbol/AMAT/earnings', proxies={'http':'50.207.31.221:80'}).text
    results = re.findall('Revenue of \$[a-zA-Z0-9\.]+', r)
    s = soup(r, 'lxml')
    titles = list(map(lambda x:x.text, s.find_all('span', {'class':'title-period'})))
    epas = list(map(lambda x:x.text, s.find_all('span', {'class':'eps'})))
    deciding = list(map(lambda x:x.text, s.find_all('span', {'class':re.compile('green|red')})))
    results = list(map(list, zip(titles, epas, results, epas)))
    

    Output:

    [[u'Q4: 11-16-17', u'EPS of $0.93 beat by $0.02', u'Revenue of $3.97B', u'EPS of $0.93 beat by $0.02'], [u'Q3: 08-17-17', u'EPS of $0.86 beat by $0.02', u'Revenue of $3.74B', u'EPS of $0.86 beat by $0.02'], [u'Q2: 05-18-17', u'EPS of $0.79 beat by $0.03', u'Revenue of $3.55B', u'EPS of $0.79 beat by $0.03'], [u'Q1: 02-15-17', u'EPS of $0.67 beat by $0.01', u'Revenue of $3.28B', u'EPS of $0.67 beat by $0.01'], [u'Q4: 11-17-16', u'EPS of $0.66 beat by $0.01', u'Revenue of $3.30B', u'EPS of $0.66 beat by $0.01'], [u'Q3: 08-18-16', u'EPS of $0.50 beat by $0.02', u'Revenue of $2.82B', u'EPS of $0.50 beat by $0.02'], [u'Q2: 05-19-16', u'EPS of $0.34 beat by $0.02', u'Revenue of $2.45B', u'EPS of $0.34 beat by $0.02'], [u'Q1: 02-18-16', u'EPS of $0.26 beat by $0.01', u'Revenue of $2.26B', u'EPS of $0.26 beat by $0.01'], [u'Q4: 11-12-15', u'EPS of $0.29  in-line ', u'Revenue of $2.37B', u'EPS of $0.29  in-line '], [u'Q3: 08-13-15', u'EPS of $0.33  in-line ', u'Revenue of $2.49B', u'EPS of $0.33  in-line '], [u'Q2: 05-14-15', u'EPS of $0.29 beat by $0.01', u'Revenue of $2.44B', u'EPS of $0.29 beat by $0.01'], [u'Q1: 02-11-15', u'EPS of $0.27  in-line ', u'Revenue of $2.36B', u'EPS of $0.27  in-line '], [u'Q4: 11-13-14', u'EPS of $0.27  in-line ', u'Revenue of $2.26B', u'EPS of $0.27  in-line '], [u'Q3: 08-14-14', u'EPS of $0.28 beat by $0.01', u'Revenue of $2.27B', u'EPS of $0.28 beat by $0.01'], [u'Q2: 05-15-14', u'EPS of $0.28  in-line ', u'Revenue of $2.35B', u'EPS of $0.28  in-line '], [u'Q1: 02-11-14', u'EPS of $0.23 beat by $0.01', u'Revenue of $2.19B', u'EPS of $0.23 beat by $0.01']]
    
    0 讨论(0)
  • 2021-01-06 08:36

    For anyone out there using PyQuery:

    from pyquery import PyQuery as pq
    import requests
    
    
    page = pq('https://seekingalpha.com/article/4151372-tesla-fools-media-model-s-model-x-demand', proxies={'http':'34.231.147.235:8080'})
    print(page)
    
    • (Used proxy info from https://free-proxy-list.net/)
    • Make sure you are using the Requests library and not Urllib. Don't try and load a page with 'urlopen'.
    0 讨论(0)
  • 2021-01-06 08:39

    You should try setting User-Agent as one of request headers. Value can be of any known browser.

    Example:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36

    0 讨论(0)
提交回复
热议问题