read a very big single line txt file and split it

前端 未结 3 456
耶瑟儿~
耶瑟儿~ 2021-01-19 20:59

I have the following problem: I have a file which is nearly 500mb big. Its text, all in one line. The text is seperated with a virtual line ending, its called ROW_DEL and is

相关标签:
3条回答
  • 2021-01-19 21:39

    Read this file in chunks, for example use StreamReader.ReadBlock in c#. You can set the maximum number of characters to read there.

    For each readed chunk you can replace ROW_DEL to \r\n and append it to new file.

    Just remember to increase current index by the number of character you just read.

    0 讨论(0)
  • 2021-01-19 21:47

    Actually 500mb of text is not that big, it's just that notepad sucks. You probably don't have sed available since you're on windows but at least try naive solution in python, I think it will work fine:

    import os
    with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out:
      f_out.write(f_in.read().replace('ROW_DEL ', os.linesep))
    
    0 讨论(0)
  • 2021-01-19 21:59

    Here's my solution.
    Easy in the principle (ŁukaszW.pl gave it) but not so easy to code if one wants to take care of peculiar cases (which ŁukaszW.pl did not).

    The peculiar cases are when the separator ROW_DEL is splitted in two of the read chunks (as I4V pointed out), and even more subtlely if there are two contiguous ROW_DEL of which the second is splitted in two read chunks.

    Since ROW_DEL is longer than any of the possible newlines ('\r', '\n', '\r\n') , it can be replaced in place in the file by the newline used by the OS. That's why I choosed to rewrite the file in itself.
    For that I use mode 'r+', it doesn't create a new file.
    It's also absolutely mandatory to use a binary mode 'b'.

    The principle is to read a chunk (in real life its size will be 262144 for example) and x additional characters, wher x is the length of the separator -1.
    And then to examine if the separator is present in the end of the chunk + the x characters.
    Accoridng if it is present or not, the chunk is shortened or not before the transformation of the ROW_DEL is performed, and rewritten in place.

    The nude code is:

    text = ('The hospital roommate of a man infected ROW_DEL'
            'with novel coronavirus (NCoV)ROW_DEL'
            '—a SARS-related virus first identified ROW_DELROW_DEL'
            'last year and already linked to 18 deaths—ROW_DEL'
            'has contracted the illness himself, ROW_DEL'
            'intensifying concerns about the ROW_DEL'
            "virus's ability to spread ROW_DEL"
            'from person to person.')
    
    with open('eessaa.txt','w') as f:
        f.write(text)
    
    with open('eessaa.txt','rb') as f:
        ch = f.read()
        print ch.replace('ROW_DEL','ROW_DEL\n')
        print '\nlength of the text : %d chars\n' % len(text)
    
    #==========================================
    
    from os.path import getsize
    from os import fsync,linesep
    
    def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
        if chunk_length<len(sep):
            print 'Length of second argument, %d , is '\
                  'the minimum value for the third argument'\
                  % len(sep)
            return
    
        x = len(sep)-1
        x2 = 2*x
        file_length = getsize(whichfile)
        with open(whichfile,'rb+') as fR,\
             open(whichfile,'rb+') as fW:
            while True:
                chunk = fR.read(chunk_length)
                pch = fR.tell()
                twelve = chunk[-x:] + fR.read(x)
                ptw = fR.tell()
    
                if sep in twelve:
                    pt = twelve.find(sep)
                    m = ("\n   !! %r is "
                         "at position %d in twelve !!" % (sep,pt))
                    y = chunk[0:-x+pt].replace(sep,OSeol)
                else:
                    pt = x
                    m = ''
                    y = chunk.replace(sep,OSeol)
    
                pos = fW.tell()
                fW.write(y)
                fW.flush()
                fsync(fW.fileno())
    
                if fR.tell()<file_length:
                    fR.seek(-x2+pt,1)
                else:
                    fW.truncate()
                    break
    
    rewrite('eessaa.txt','ROW_DEL',14)
    
    with open('eessaa.txt','rb') as f:
        ch = f.read()
        print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
        print '\nlength of the text : %d chars\n' % len(ch)
    

    To follow the execution, here's another code that prints messages all along:

    text = ('The hospital roommate of a man infected ROW_DEL'
            'with novel coronavirus (NCoV)ROW_DEL'
            '—a SARS-related virus first identified ROW_DELROW_DEL'
            'last year and already linked to 18 deaths—ROW_DEL'
            'has contracted the illness himself, ROW_DEL'
            'intensifying concerns about the ROW_DEL'
            "virus's ability to spread ROW_DEL"
            'from person to person.')
    
    with open('eessaa.txt','w') as f:
        f.write(text)
    
    with open('eessaa.txt','rb') as f:
        ch = f.read()
        print ch.replace('ROW_DEL','ROW_DEL\n')
        print '\nlength of the text : %d chars\n' % len(text)
    
    #==========================================
    
    from os.path import getsize
    from os import fsync,linesep
    
    def rewrite(whichfile,sep,chunk_length,OSeol=linesep):
        if chunk_length<len(sep):
            print 'Length of second argument, %d , is '\
                  'the minimum value for the third argument'\
                  % len(sep)
            return
    
        x = len(sep)-1
        x2 = 2*x
        file_length = getsize(whichfile)
        with open(whichfile,'rb+') as fR,\
             open(whichfile,'rb+') as fW:
            while True:
                chunk = fR.read(chunk_length)
                pch = fR.tell()
                twelve = chunk[-x:] + fR.read(x)
                ptw = fR.tell()
    
                if sep in twelve:
                    pt = twelve.find(sep)
                    m = ("\n   !! %r is "
                         "at position %d in twelve !!" % (sep,pt))
                    y = chunk[0:-x+pt].replace(sep,OSeol)
                else:
                    pt = x
                    m = ''
                    y = chunk.replace(sep,OSeol)
                print ('chunk  == %r   %d chars\n'
                       ' -> fR now at position  %d\n'
                       'twelve == %r   %d chars   %s\n'
                       ' -> fR now at position  %d'
                       % (chunk ,len(chunk),      pch,
                          twelve,len(twelve),m,   ptw) )
    
                pos = fW.tell()
                fW.write(y)
                fW.flush()
                fsync(fW.fileno())
                print ('          %r   %d long\n'
                       ' has been written from position %d\n'
                       ' => fW now at position  %d'
                       % (y,len(y),pos,fW.tell()))
    
                if fR.tell()<file_length:
                    fR.seek(-x2+pt,1)
                    print ' -> fR moved %d characters back to position %d'\
                           % (x2-pt,fR.tell())
                else:
                    print (" => fR is at position %d == file's size\n"
                           '    File has thoroughly been read'
                           % fR.tell())
                    fW.truncate()
                    break
    
                raw_input('\npress any key to continue')
    
    
    rewrite('eessaa.txt','ROW_DEL',14)
    
    with open('eessaa.txt','rb') as f:
        ch = f.read()
        print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1))
        print '\nlength of the text : %d chars\n' % len(ch)
    

    There's some subtlety in the treatment of the ends of the chunks in order to detect if ROW_DEL straddles on two chunks and if there are two ROW_DEL contiguous. That's why I took a long time to post my solution: I finally was obliged to write fR.seek(-x2+pt,1) and not only fR.seek(-2*x,1) or fR.seek(-x,1) according if sep is straddling or not (2*x is x2 in the code, with ROW_DEL x and x2 are 6 and 12). Anybody interested by this point will examine it by changing the codes in the sections accoridng if 'ROW_DEL' is in twelve or not.

    0 讨论(0)
提交回复
热议问题