How to edit 300 GB text file (genomics data)?

后端 未结 4 1175
旧巷少年郎
旧巷少年郎 2020-12-18 06:34

I have a 300 GB text file that contains genomics data with over 250k records. There are some records with bad data and our genomics program \'Popoolution\' allows us to comm

相关标签:
4条回答
  • 2020-12-18 07:14

    A basic pattern in R is to read the data in chunks, edit, and write out

    fin = file("fin.txt", "r")
    fout = file("fout.txt", "w")
    while (length(txt <- readLines(fin, n=1000000))) {
        ## txt is now 1000000 lines, add an asterix to problem lines
        ## bad = <create logical vector indicating bad lines here>
        ## txt[bad] = paste0("*", txt[bad])
        writeLines(txt, fout)
    }
    close(fin); close(fout)
    

    While not ideal, this works on Windows (implied by the mention of Notepad++) and in a language that you are presumably familiar (R). Using sed (definitely the appropriate tool in the long run) would require installation of additional software and coming up to speed with sed.

    0 讨论(0)
  • 2020-12-18 07:24

    If you are required to have a person mark these records manually with a text editor, for whatever reason, you should probably use split to split the file up into manageable pieces.

    split -a4 -d -l100000 hugefile.txt part.
    

    This will split the file up into pieces with 100000 lines each. The names of the files will be part.0000, part.0001, etc. Then, after all the files have been edited, you can combine them back together with cat:

    cat part.* > new_hugefile.txt
    
    0 讨论(0)
  • 2020-12-18 07:25

    The simplest solution is to use a stream-oriented editor such as sed. All you need is to be able to write one or more regular expression(s) that will identify all (and only) the bad records. Since you haven't provided any details on how to identify the bad records, this is the only possible answer.

    0 讨论(0)
  • 2020-12-18 07:28

    Based on your update:

    One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.

    Here you have an approach: If you know the line number, you can add an asterisk in the beginning of that line saying:

    sed 'LINE_NUMBER s/^/*/' file
    

    See an example:

    $ cat file
    aa
    bb
    cc
    dd
    ee
    $ sed '3 s/^/*/' file
    aa
    bb
    *cc
    dd
    ee
    

    If you add -i, the file will be updated:

    $ sed -i '3 s/^/*/' file
    $ cat file
    aa
    bb
    *cc
    dd
    ee
    

    Even though I always think it's better to do a redirection to another file

    sed '3 s/^/*/' file > new_file
    

    so that you keep intact your original file and save the updated one in new_file.

    0 讨论(0)
提交回复
热议问题