Parsing large (9GB) file using python

后端 未结 3 1251
陌清茗
陌清茗 2021-01-03 05:57

I have a large text file that I need to parse into a pipe delimited text file using python. The file looks like this (basically):

product/productId: D7SDF9S9         


        
相关标签:
3条回答
  • 2021-01-03 06:45

    Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv module for ease of writing out your pipe-delimited records.

    The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.

    import csv
    import re
    
    fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')
    
    with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
        writer = csv.DictWriter(fw, fields, delimiter='|')
    
        record = {}
        for line in myfile:
            if not line.strip() and record:
                # empty line is the end of a record
                writer.writerow(record)
                record = {}
                continue
    
            field, value = line.split(': ', 1)
            record[field.partition('/')[-1].strip()] = value.strip()
    
        if record:
            # handle last record
            writer.writerow(record)
    

    This code does assume that the file contains text before a colon of the form category/key, so product/productId, review/userId, etc. The part after the slash is used for the CSV columns; the fields list at the top reflects these keys.

    Alternatively, you can remove that fields list and use a csv.writer instead, gathering the record values in a list instead:

    import csv
    import re
    
    with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
        writer = csv.writer(fw, delimiter='|')
    
        record = []
        for line in myfile:
            if not line.strip() and record:
                # empty line is the end of a record
                writer.writerow(record)
                record = []
                continue
    
            field, value = line.split(': ', 1)
            record.append(value.strip())
    
        if record:
            # handle last record
            writer.writerow(record)
    

    This version requires that record fields are all present and are written to the file in a fixed order.

    0 讨论(0)
  • 2021-01-03 06:57

    Don't read the whole file into memory at once, instead iterate over it line by line, also use Python's csv module to parse the records:

    import csv
    
    with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:
    
        writer = csv.writer(outfile, delimiter='|')
    
        for record in csv.reader(infile, delimiter='\n', lineterminator='\n\n'):
            values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
            writer.writerow(values)
    

    A couple things to note here:

    • Use with to open files. Why? Because using with ensures that the file is close()d, even if an exception interrupts the script.

    Thus:

    with open('myfile.txt') as f:
        do_stuff_to_file(f)
    

    is equivalent to:

    f = open('myfile.txt')
    try:
        do_stuff_to_file(f)
    finally:
        f.close()
    

    To be continued... (I'm out of time ATM)

    0 讨论(0)
  • 2021-01-03 07:03

    Use "readline()" to read the fields of a record one by one. Or you can use read(n) to read "n" bytes.

    0 讨论(0)
提交回复
热议问题