Using Regex to get multiple data on single line by scraping stocks from yahoo [closed]

你。 提交于 2019-12-13 04:23:35


import urllib
import re

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for i in range(len(stocks_symbols)):
    htmlfile = urllib.urlopen("" + stocks_symbols[i])
    htmltext =
    regex = '<span id="yfs_l84_' + stocks_symbols[i] + '">(.+?)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern, htmltext)

    regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
    pattern1 = re.compile(regex1)
    name1 = re.findall(pattern1, htmltext)
    print "Price of", stocks_symbols[i].upper(), name1, "is", price[0]

I guess the problem is in regex1,

regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'

I tried reading documentation but was unable to figure it out.

In this program I trying to scrape Stock-Name and Stock-Price with input of Stock-Symbol as a list.

what I think I am doing is to passing 2 (.+?) in one variable which seems incorrect.


Traceback (most recent call last):
  File "C:\Py\stock\", line 14, in <module>
    pattern1 = re.compile(regex1)
  File "C:\\lib\", line 190, in compile
    return _compile(pattern, flags)
  File "C:\\lib\", line 242, in _compile
    raise error, v # invalid expression
error: nothing to repeat 


^ matches the start of a string and a ? after that is not a legal regex. If you change your regex to regex1 = '(.+?)' it should work. Note that you also had one parenthesis too much.

Furthermore there is a better way to get yahoo's stock information. You can query a lot of tables (including stock info) with YQL and there is also a YQL-Console where you can try out your queries.

The result you get from there is JSON or XML, which can be handled pretty good via some python libraries.


You can extract the price using BeautifulSoup:

import requests
from bs4 import BeautifulSoup
stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for stock in stocks_symbols:
    htmlfile = requests.get("{}".format(stock))
    soup = BeautifulSoup(htmlfile.content)
    price = [x.text for x in soup.findAll("span",id="yfs_l84_{}".format(stock))]
    print ("Price of {}  is {}".format(stock.upper(), price[0]))
Price of AAPL  is 94.03
Price of SPY  is 198.20
Price of GOOG  is 584.73
Price of NFLX  is 472.35
Price of MSFT  is 41.80


Example with requests and lxml and css selection

import requests
import lxml, lxml.cssselect

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for symbol in stocks_symbols:

    r = requests.get("" + symbol)
    html = lxml.html.fromstring(r.text)

    price = html.cssselect('span#yfs_l84_' + symbol)
    print '%s: %s' % (symbol.upper(), price[0].text)

    # there is no `h2` with `id` started wiht "yui_3_9_1_9_"
    # so I can't test this part of code

    #names = html.cssselect('h2[id^="yui_3_9_1_9_"]')
    #for x in names:
    #    print x.text, x.attrib('id')[len('yui_3_9_1_9_'):]


AAPL: 94.03
SPY: 198.20
GOOG: 584.73
NFLX: 472.35
MSFT: 41.80

