AWK: go through the file twice, doing different tasks

若如初见. 提交于 2019-12-04 02:59:44

问题


I am processing a fairly big collection of Tweets and I'd like to obtain, for each tweet, its mentions (other user's names, prefixed with an @), if the mentioned user is also in the file:

users = new Dictionary()
for each line in file:
   username = get_username(line)
   userid   = get_userid(line)
   users.add(key = userid, value = username)
for each line in file:
   mentioned_names = get_mentioned_names(line)
   mentioned_ids = mentioned_names.map(x => if x in users: users[x] else null)
   print "$line | $mentioned_ids"

I was already processing the file with GAWK, so instead of processing it again in Python or C I decided to try and add this to my AWK script. However, I can't find a way to make to passes over the same file, executing different code for each one. Most solutions imply calling AWK several times, but then I'd loose the associative array I made in the first pass.

I could do it in very hacky ways (like cat'ing the file twice, passing it through sed to add a different preffix to all the lines in each cat), but I'd like to be able to understand this code in a couple of months without hating myself.

What would be the AWK way to do this?

PD:

The less terrible way I've found:

function rewind(    i)
{
    # from https://www.gnu.org/software/gawk/manual/html_node/Rewind-Function.html
    # shift remaining arguments up
    for (i = ARGC; i > ARGIND; i--)
        ARGV[i] = ARGV[i-1]

    # make sure gawk knows to keep going
    ARGC++

    # make current file next to get done
    ARGV[ARGIND+1] = FILENAME

    # do it
    nextfile
}

BEGIN {
 count = 1;
}

count == 1 {
 # first pass, fills an associative array
}

count == 2 {
 # second pass, uses the array
}

FNR == 30 { 
   # handcoded length, horrible
   # could also be automated calling wc -l, passing as parameter
  if (count == 1) {
        count = 2;
        rewind(1)
    }
}

回答1:


The idiomatic way to process two separate files, or the same file twice in awk is like this:

awk 'NR==FNR{ 
    # fill associative array 
    next
}
{
    # use the array
}' file1 file2

The total record number NR is only equal to the record number for the current file FNR on the first file. next skips the second block for the first file. The second block is then processed for the second file. If file1 and file2 are the same file, then this passes through the file twice.



来源:https://stackoverflow.com/questions/28544105/awk-go-through-the-file-twice-doing-different-tasks

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!