Multithreading to Scrape Yahoo Finance

问题

I'm running a program to pull some info from Yahoo! Finance. It runs fine as a For loop, however it takes a long time (about 10 minutes for 7,000 inputs) because it has to process each request.get(url) individually (or am I mistaken on the major bottlenecker?)

Anyway, I came across multithreading as a potential solution. This is what I have tried:

import requests
import pprint
import threading

with open('MFTop30MinusAFew.txt', 'r') as ins: #input file for tickers
    for line in ins:
        ticker_array = ins.read().splitlines()

ticker = ticker_array
url_array = []
url_data = []
data_array =[]

for i in ticker:
    url = 'https://query2.finance.yahoo.com/v10/finance/quoteSummary/'+i+'?formatted=true&crumb=8ldhetOu7RJ&lang=en-US&region=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com'
    url_array.append(url) #loading each complete url at one time 

def fetch_data(url):
    urlHandler = requests.get(url)
    data = urlHandler.json()
    data_array.append(data)

pprint.pprint(data_array)

threads = [threading.Thread(target=fetch_data, args=(url,)) for url in url_array]

for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

fetch_data(url_array)

The error I get is InvalidSchema: No connection adapters were found for '['https://query2.finance.... [url continues].

PS. I've also read that using multithread approach to scrape websites is bad/can get you blocked. Would Yahoo! Finance mind if I'm pulling data from a couple thousand tickers at once? Nothing happened when I did them sequentially.

回答1:

If you look carefully at the error you will notice that it doesn't show one url but all the urls you appended, enclosed with brackets. Indeed the last line of your code actually call your method fetch_data with the full array as a parameter, which does't make sense. If you remove this last line the code runs just fine, and your threads are called as expected.

来源：https://stackoverflow.com/questions/39358982/multithreading-to-scrape-yahoo-finance

标签

python

multithreading

yahoo-finance