Using Regex to get multiple data on single line by scraping stocks from yahoo [closed]

问题

import urllib
import re

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for i in range(len(stocks_symbols)):
    htmlfile = urllib.urlopen("https://finance.yahoo.com/q?s=" + stocks_symbols[i])
    htmltext = htmlfile.read(htmlfile)
    regex = '<span id="yfs_l84_' + stocks_symbols[i] + '">(.+?)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern, htmltext)

    regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
    pattern1 = re.compile(regex1)
    name1 = re.findall(pattern1, htmltext)
    print "Price of", stocks_symbols[i].upper(), name1, "is", price[0]

I guess the problem is in regex1,

regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'

I tried reading documentation but was unable to figure it out.

In this program I trying to scrape Stock-Name and Stock-Price with input of Stock-Symbol as a list.

what I think I am doing is to passing 2 (.+?) in one variable which seems incorrect.

OutPut:

Traceback (most recent call last):
  File "C:\Py\stock\stocks.py", line 14, in <module>
    pattern1 = re.compile(regex1)
  File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 190, in compile
    return _compile(pattern, flags)
  File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 242, in _compile
    raise error, v # invalid expression
error: nothing to repeat

回答1:

^ matches the start of a string and a ? after that is not a legal regex. If you change your regex to regex1 = '(.+?)' it should work. Note that you also had one parenthesis too much.

Furthermore there is a better way to get yahoo's stock information. You can query a lot of tables (including stock info) with YQL and there is also a YQL-Console where you can try out your queries.

The result you get from there is JSON or XML, which can be handled pretty good via some python libraries.

回答2:

You can extract the price using BeautifulSoup:

import requests
from bs4 import BeautifulSoup
stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for stock in stocks_symbols:
    htmlfile = requests.get("https://finance.yahoo.com/q?s={}".format(stock))
    soup = BeautifulSoup(htmlfile.content)
    price = [x.text for x in soup.findAll("span",id="yfs_l84_{}".format(stock))]
    print ("Price of {}  is {}".format(stock.upper(), price[0]))
Price of AAPL  is 94.03
Price of SPY  is 198.20
Price of GOOG  is 584.73
Price of NFLX  is 472.35
Price of MSFT  is 41.80

回答3:

Example with requests and lxml and css selection

import requests
import lxml, lxml.cssselect

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for symbol in stocks_symbols:

    r = requests.get("https://finance.yahoo.com/q?s=" + symbol)
    html = lxml.html.fromstring(r.text)

    price = html.cssselect('span#yfs_l84_' + symbol)
    print '%s: %s' % (symbol.upper(), price[0].text)

    # there is no `h2` with `id` started wiht "yui_3_9_1_9_"
    # so I can't test this part of code

    #names = html.cssselect('h2[id^="yui_3_9_1_9_"]')
    #for x in names:
    #    print x.text, x.attrib('id')[len('yui_3_9_1_9_'):]

result:

AAPL: 94.03
SPY: 198.20
GOOG: 584.73
NFLX: 472.35
MSFT: 41.80

来源：https://stackoverflow.com/questions/24587592/using-regex-to-get-multiple-data-on-single-line-by-scraping-stocks-from-yahoo

标签

python

regex

urllib

scrape

stock