bash looping and extracting of the fragment of txt file

后端未结

关注

 3  1669

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the follo

相关标签:

3条回答

再見小時候

2021-01-23 13:50

I would suggest processing using awk:

for i in $FILES
do
    echo -n \""$i\": "
    awk 'BEGIN {
           output="";
           outputlength=0
         }
         /(^ *[0-9]+)/ {                                    # process only lines that start with a number
           if (length(substr($10, 2)) > outputlength) {     # if line has more hashes, store it
             output=$0;
             outputlength=length(substr($10, 2))
           }
         }
         END {
           print output                                     # output the resulting line
         }' "$i"
done

0 讨论(0)

没有蜡笔的小新

2021-01-23 13:51
Probably makes more sense as an Awk script.

This picks the first line with the widest histogram in the case of a tie within an input file.
```
#!/bin/bash

awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
   FNR < 9 { next }
   length($10) > max { max = length($10); sel = FILENAME ":" $0 }
   END { if (sel) print sel }' ./"$prot"/*.dlg
```
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.

In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.

The second line skips lines 1-8 which contain the header.

The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.

The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.

If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
```
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
   !looking { next }
   looking > 1 && $1 != looking { looking = 0; nextfile }
   $1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
   END { if (sel) print sel }' ./"$prot"/*.dlg
```
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.
0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2021-01-23 13:57
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
```
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
```
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.

Result should be like this:
```
file1.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |##########
file2.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |####
file3.dlg:   3 |     -5.47 |  17 |     -5.44 |   2 |#######
```
Here is a modification to get the first appearence in case of many equal max lines in a file:
```
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
```
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.

Second solution

Here using only awk:
```
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
           END {for (i in row) print i ":" row[i]}' *.dlg
```
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
```
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
           END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...