Cannot get table data - HTML

后端 未结 1 524
悲&欢浪女
悲&欢浪女 2021-01-21 22:32

I am trying to get the \'Earnings Announcements table\' from: https://www.zacks.com/stock/research/amzn/earnings-announcements

I am using different beautifulsoup options

相关标签:
1条回答
  • 2021-01-21 23:21

    So the solution is to parse the whole HTML document using Python's string and RegExp functions instead of BeautifulSoup because we are not trying to get the data from HTML tags but instead we want to get them inside a JS code.

    So this code basically, get the JS array inside "earnings_announcements_earnings_table" and since the JS Array is the same as Python's list structure, I just parse it using ast. The result is a list were you can loop into and it shows all data from all the pages of the table.

    import urllib2
    import re
    import ast
    
    user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
    req = urllib2.Request('https://www.zacks.com/stock/research/amzn/earnings-announcements', None, user_agent)
    source = urllib2.urlopen(req).read()
    
    compiled = re.compile('"earnings_announcements_earnings_table"\s+\:', flags=re.IGNORECASE | re.DOTALL)
    match = re.search(compiled, source)
    if match:
        source = source[match.end(): len(source)]
    
    compiled = re.compile('"earnings_announcements_webcasts_table"', flags=re.IGNORECASE | re.DOTALL)
    match = re.search(compiled, source)
    if match:
        source = source[0: match.start()]
    
    result = ast.literal_eval(str(source).strip('\r\n\t, '))
    print result
    

    Let me know if you need clarifications.

    0 讨论(0)
提交回复
热议问题