Python For Loop Slows Due To Large List

戏子无情 提交于 2021-01-29 12:52:10

问题


So currently I have a for loop, which causes the python program to die with the program saying 'Killed'. It slows down around 6000 items in, with the program slowly dying at around 6852 list items. How do I fix this?

I assume it's due to the list being too large.

I've tried splitting the list in two around 6000. Maybe it's due to memory management or something. Help would be appreciated.

    for id in listofids:
        connection = psycopg2.connect(user = "username", password = "password", host = "localhost", port = "5432", database = "darkwebscraper")

        cursor = connection.cursor()
        cursor.execute("select darkweb.site_id, darkweb.site_title, darkweb.sitetext from darkweb where darkweb.online='true' AND darkweb.site_id = %s", ([id]))
        print(len(listoftexts))

        try:
            row = cursor.fetchone()
        except:
            print("failed to fetch one")
        try:
            listoftexts.append(row[2])
            cursor.close()
            connection.close()
        except:
            print("failed to print")

回答1:


You're right, it's probably because the list becomes large: python list are contiguous spaces in memory. Each time you append to the list, python looks if there is a spot at the next position, and if not he relocates the whole array somewhere where there is enough room. The bigger your array, the more python has a to relocate.

One way around would be to create an array of the right size beforehand.

EDIT: Just to make sure it was clear, I made up an example to illustrate my point. I've made 2 functions. The first one appends the stringified index (to make it bigger) to a list at each iteration, and the other just fills a numpy array:

import numpy as np
import matplotlib.pyplot as plt
from time import time

def test_bigList(N):
    L = []
    times = np.zeros(N,dtype=np.float32)

    for i in range(N):
        t0 = time()
        L.append(str(i))
        times[i] = time()-t0

    return times

def test_bigList_numpy(N):
    L = np.empty(N,dtype="<U32")
    times = np.zeros(N,dtype=np.float32)

    for i in range(N):
        t0 = time()
        L[i] = str(i)
        times[i] = time()-t0
    return times

N = int(1e7)
res1 = test_bigList(N)
res2 = test_bigList_numpy(N)

plt.plot(res1,label="list")
plt.plot(res2,label="numpy array")
plt.xlabel("Iteration")
plt.ylabel("Running time")
plt.legend()
plt.title("Evolution of iteration time with the size of an array")
plt.show()

I get the following result:

You can see on the figure that for the list case, you have regularly some peaks (probably due to relocation), and they seem to increase with the size of the list. This example is with short appended strings, but the bigger the string, the more you will see this effect.

If it does not do the trick, then it might be linked to the database itself, but I can't help you without knowing the specifics of the database.



来源:https://stackoverflow.com/questions/56719455/python-for-loop-slows-due-to-large-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!