问题
import urllib
import re
stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']
for i in range(len(stocks_symbols)):
htmlfile = urllib.urlopen("https://finance.yahoo.com/q?s=" + stocks_symbols[i])
htmltext = htmlfile.read(htmlfile)
regex = '<span id="yfs_l84_' + stocks_symbols[i] + '">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
pattern1 = re.compile(regex1)
name1 = re.findall(pattern1, htmltext)
print "Price of", stocks_symbols[i].upper(), name1, "is", price[0]
I guess the problem is in regex1
,
regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
I tried reading documentation but was unable to figure it out.
In this program I trying to scrape Stock-Name and Stock-Price with input of Stock-Symbol as a list.
what I think I am doing is to passing 2 (.+?) in one variable which seems incorrect.
OutPut:
Traceback (most recent call last):
File "C:\Py\stock\stocks.py", line 14, in <module>
pattern1 = re.compile(regex1)
File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: nothing to repeat
回答1:
^
matches the start of a string and a ?
after that is not a legal regex. If you change your regex to regex1 = '(.+?)'
it should work. Note that you also had one parenthesis too much.
Furthermore there is a better way to get yahoo's stock information. You can query a lot of tables (including stock info) with YQL and there is also a YQL-Console where you can try out your queries.
The result you get from there is JSON or XML, which can be handled pretty good via some python libraries.
回答2:
You can extract the price using BeautifulSoup:
import requests
from bs4 import BeautifulSoup
stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']
for stock in stocks_symbols:
htmlfile = requests.get("https://finance.yahoo.com/q?s={}".format(stock))
soup = BeautifulSoup(htmlfile.content)
price = [x.text for x in soup.findAll("span",id="yfs_l84_{}".format(stock))]
print ("Price of {} is {}".format(stock.upper(), price[0]))
Price of AAPL is 94.03
Price of SPY is 198.20
Price of GOOG is 584.73
Price of NFLX is 472.35
Price of MSFT is 41.80
回答3:
Example with requests
and lxml
and css selection
import requests
import lxml, lxml.cssselect
stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']
for symbol in stocks_symbols:
r = requests.get("https://finance.yahoo.com/q?s=" + symbol)
html = lxml.html.fromstring(r.text)
price = html.cssselect('span#yfs_l84_' + symbol)
print '%s: %s' % (symbol.upper(), price[0].text)
# there is no `h2` with `id` started wiht "yui_3_9_1_9_"
# so I can't test this part of code
#names = html.cssselect('h2[id^="yui_3_9_1_9_"]')
#for x in names:
# print x.text, x.attrib('id')[len('yui_3_9_1_9_'):]
result:
AAPL: 94.03
SPY: 198.20
GOOG: 584.73
NFLX: 472.35
MSFT: 41.80
来源:https://stackoverflow.com/questions/24587592/using-regex-to-get-multiple-data-on-single-line-by-scraping-stocks-from-yahoo