问题
I'm running a program to pull some info from Yahoo! Finance. It runs fine as a For
loop, however it takes a long time (about 10 minutes for 7,000 inputs) because it has to process each request.get(url)
individually (or am I mistaken on the major bottlenecker?)
Anyway, I came across multithreading as a potential solution. This is what I have tried:
import requests
import pprint
import threading
with open('MFTop30MinusAFew.txt', 'r') as ins: #input file for tickers
for line in ins:
ticker_array = ins.read().splitlines()
ticker = ticker_array
url_array = []
url_data = []
data_array =[]
for i in ticker:
url = 'https://query2.finance.yahoo.com/v10/finance/quoteSummary/'+i+'?formatted=true&crumb=8ldhetOu7RJ&lang=en-US®ion=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com'
url_array.append(url) #loading each complete url at one time
def fetch_data(url):
urlHandler = requests.get(url)
data = urlHandler.json()
data_array.append(data)
pprint.pprint(data_array)
threads = [threading.Thread(target=fetch_data, args=(url,)) for url in url_array]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
fetch_data(url_array)
The error I get is InvalidSchema: No connection adapters were found for '['https://query2.finance.... [url continues]
.
PS. I've also read that using multithread approach to scrape websites is bad/can get you blocked. Would Yahoo! Finance mind if I'm pulling data from a couple thousand tickers at once? Nothing happened when I did them sequentially.
回答1:
If you look carefully at the error you will notice that it doesn't show one url but all the urls you appended, enclosed with brackets. Indeed the last line of your code actually call your method fetch_data with the full array as a parameter, which does't make sense. If you remove this last line the code runs just fine, and your threads are called as expected.
来源:https://stackoverflow.com/questions/39358982/multithreading-to-scrape-yahoo-finance