bash looping and extracting of the fragment of txt file

后端 未结 3 1666
长情又很酷
长情又很酷 2021-01-23 13:26

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the follo

相关标签:
3条回答
  • 2021-01-23 13:50

    I would suggest processing using awk:

    for i in $FILES
    do
        echo -n \""$i\": "
        awk 'BEGIN {
               output="";
               outputlength=0
             }
             /(^ *[0-9]+)/ {                                    # process only lines that start with a number
               if (length(substr($10, 2)) > outputlength) {     # if line has more hashes, store it
                 output=$0;
                 outputlength=length(substr($10, 2))
               }
             }
             END {
               print output                                     # output the resulting line
             }' "$i"
    done
    
    0 讨论(0)
  • 2021-01-23 13:51

    Probably makes more sense as an Awk script.

    This picks the first line with the widest histogram in the case of a tie within an input file.

    #!/bin/bash
    
    awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
       FNR < 9 { next }
       length($10) > max { max = length($10); sel = FILENAME ":" $0 }
       END { if (sel) print sel }' ./"$prot"/*.dlg
    

    This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.

    In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.

    The second line skips lines 1-8 which contain the header.

    The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.

    The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.

    If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;

    awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
       !looking { next }
       looking > 1 && $1 != looking { looking = 0; nextfile }
       $1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
       END { if (sel) print sel }' ./"$prot"/*.dlg
    

    This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.

    0 讨论(0)
  • 2021-01-23 13:57

    You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.

    grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
    

    grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.

    Result should be like this:

    file1.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |##########
    file2.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |####
    file3.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |#######
    

    Here is a modification to get the first appearence in case of many equal max lines in a file:

    grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
    

    We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.


    Second solution

    Here using only awk:

    awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
               END {for (i in row) print i ":" row[i]}' *.dlg
    

    Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:

    awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
               END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
    
    0 讨论(0)
提交回复
热议问题