Upper memory limit?

前端 未结 5 1977
暖寄归人
暖寄归人 2020-11-27 19:51

Is there a limit to memory for python? I\'ve been using a python script to calculate the average values from a file which is a minimum of 150mb big.

Depending on the

相关标签:
5条回答
  • 2020-11-27 20:32

    Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.

    You have a secondary problem: your choices of variable names severely obfuscate what you are doing.

    Here is your script rewritten with the readlines() caper removed and with meaningful names:

    file_A1_B1 = open("A1_B1_100000.txt", "r")
    file_A2_B2 = open("A2_B2_100000.txt", "r")
    file_A1_B2 = open("A1_B2_100000.txt", "r")
    file_A2_B1 = open("A2_B1_100000.txt", "r")
    file_write = open ("average_generations.txt", "w")
    mutation_average = open("mutation_average", "w") # not used
    files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
    for afile in files:
        table = []
        for aline in afile:
            values = aline.split('\t')
            values.remove('\n') # why?
            table.append(values)
        row_count = len(table)
        row0length = len(table[0])
        print_counter = 4
        for column_index in range(row0length):
            column_total = 0
            for row_index in range(row_count):
                number = float(table[row_index][column_index])
                column_total = column_total + number
            column_average = column_total/row_count
            print column_average
            if print_counter == 4:
                file_write.write(str(column_average)+'\n')
                print_counter = 0
            print_counter +=1
    file_write.write('\n')
    

    It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.

    As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.

    Here is a revised version of the outer loop code:

    for afile in files:
        for row_count, aline in enumerate(afile, start=1):
            values = aline.split('\t')
            values.remove('\n') # why?
            fvalues = map(float, values)
            if row_count == 1:
                row0length = len(fvalues)
                column_index_range = range(row0length)
                column_totals = fvalues
            else:
                assert len(fvalues) == row0length
                for column_index in column_index_range:
                    column_totals[column_index] += fvalues[column_index]
        print_counter = 4
        for column_index in column_index_range:
            column_average = column_totals[column_index] / row_count
            print column_average
            if print_counter == 4:
                file_write.write(str(column_average)+'\n')
                print_counter = 0
            print_counter +=1
    
    0 讨论(0)
  • 2020-11-27 20:39

    Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about

    1959167 [MiB]
    

    On jython 2.5 it crashes earlier:

     239000 [MiB]
    

    probably I can configure Jython to use more memory (it uses limits from JVM)

    Test app:

    import sys
    
    sl = []
    i = 0
    # some magic 1024 - overhead of string object
    fill_size = 1024
    if sys.version.startswith('2.7'):
        fill_size = 1003
    if sys.version.startswith('3'):
        fill_size = 497
    print(fill_size)
    MiB = 0
    while True:
        s = str(i).zfill(fill_size)
        sl.append(s)
        if i == 0:
            try:
                sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
            except AttributeError:
                pass
        i += 1
        if i % 1024 == 0:
            MiB += 1
            if MiB % 25 == 0:
                sys.stderr.write('%d [MiB]\n' % (MiB))
    

    In your app you read whole file at once. For such big files you should read the line by line.

    0 讨论(0)
  • 2020-11-27 20:45

    No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.

    In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.

    Edit:

    Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.

    A better approach would be to read the files one line at a time:

    for u in files:
         for line in u: # This will iterate over each line in the file
             # Read values from the line, do necessary calculations
    
    0 讨论(0)
  • 2020-11-27 20:50

    (This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.

    Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.

    As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.

    Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.

    Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.

    To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.

    try:
        from itertools import izip_longest
    except ImportError:    # Python 3
        from itertools import zip_longest as izip_longest
    
    GROUP_SIZE = 4
    input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
                        "A2_B1_100000.txt"]
    file_write = open("average_generations.txt", 'w')
    mutation_average = open("mutation_average", 'w')  # left in, but nothing written
    
    for file_name in input_file_names:
        with open(file_name, 'r') as input_file:
            print('processing file: {}'.format(file_name))
    
            totals = []
            for count, fields in enumerate((line.split('\t') for line in input_file), 1):
                totals = [sum(values) for values in
                            izip_longest(totals, map(float, fields), fillvalue=0)]
            averages = [total/count for total in totals]
    
            for print_counter, average in enumerate(averages):
                print('  {:9.4f}'.format(average))
                if print_counter % GROUP_SIZE == 0:
                    file_write.write(str(average)+'\n')
    
    file_write.write('\n')
    file_write.close()
    mutation_average.close()
    
    0 讨论(0)
  • 2020-11-27 20:56

    You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.

    Better iterate over each line:

    for current_line in u:
        do_something_with(current_line)
    

    is the recommended approach.

    Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.

    This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.

    Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.

    0 讨论(0)
提交回复
热议问题