问题
So currently I have a for loop, which causes the python program to die with the program saying 'Killed'. It slows down around 6000 items in, with the program slowly dying at around 6852 list items. How do I fix this?
I assume it's due to the list being too large.
I've tried splitting the list in two around 6000. Maybe it's due to memory management or something. Help would be appreciated.
for id in listofids:
connection = psycopg2.connect(user = "username", password = "password", host = "localhost", port = "5432", database = "darkwebscraper")
cursor = connection.cursor()
cursor.execute("select darkweb.site_id, darkweb.site_title, darkweb.sitetext from darkweb where darkweb.online='true' AND darkweb.site_id = %s", ([id]))
print(len(listoftexts))
try:
row = cursor.fetchone()
except:
print("failed to fetch one")
try:
listoftexts.append(row[2])
cursor.close()
connection.close()
except:
print("failed to print")
回答1:
You're right, it's probably because the list becomes large: python list are contiguous spaces in memory. Each time you append to the list, python looks if there is a spot at the next position, and if not he relocates the whole array somewhere where there is enough room. The bigger your array, the more python has a to relocate.
One way around would be to create an array of the right size beforehand.
EDIT: Just to make sure it was clear, I made up an example to illustrate my point. I've made 2 functions. The first one appends the stringified index (to make it bigger) to a list at each iteration, and the other just fills a numpy array:
import numpy as np
import matplotlib.pyplot as plt
from time import time
def test_bigList(N):
L = []
times = np.zeros(N,dtype=np.float32)
for i in range(N):
t0 = time()
L.append(str(i))
times[i] = time()-t0
return times
def test_bigList_numpy(N):
L = np.empty(N,dtype="<U32")
times = np.zeros(N,dtype=np.float32)
for i in range(N):
t0 = time()
L[i] = str(i)
times[i] = time()-t0
return times
N = int(1e7)
res1 = test_bigList(N)
res2 = test_bigList_numpy(N)
plt.plot(res1,label="list")
plt.plot(res2,label="numpy array")
plt.xlabel("Iteration")
plt.ylabel("Running time")
plt.legend()
plt.title("Evolution of iteration time with the size of an array")
plt.show()
I get the following result:
You can see on the figure that for the list case, you have regularly some peaks (probably due to relocation), and they seem to increase with the size of the list. This example is with short appended strings, but the bigger the string, the more you will see this effect.
If it does not do the trick, then it might be linked to the database itself, but I can't help you without knowing the specifics of the database.
来源:https://stackoverflow.com/questions/56719455/python-for-loop-slows-due-to-large-list